Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini, M.Sc. Manuel Hermes Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Technische Universität Kaiserslautern www.bisor.de Organizational Stuff 1 General • Part 0: Data -> Information -> Knowledge: Concepts in Science and Engineering • Part 1: Organizing the “Data Lake” (from data mining to data fishing) • Part 2: Stochastic Models on structured attribute data • Part 3: Getting Ready for the Digital Twin • Lectures and exercises (no tutorials). • OLAT PW: 2 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 https://www.billyloizou.com/blog/bridging-the-gap-between-data-creativity Summary: In this section, we will see: • What is Data Science? • What is/are Data/Data Set? • Sources of Data • Ecosystem of Data Science • Legal, Ethical, and Social Issues • Tasks of Data Science Basics of Data Science 2 What is Data Science? • Explosion of data: o Social networks, o Internet of Things, …. • Heterogeneity: o big and unstructured data that might be noisy, …. • Technologies: o huge storage capacity, clouds, o computing power, o algorithms, o statistics and computation techniques Basics of Data Science 3 What is Data Science? trends.google.com (accessed March 29, 2022) Basics of Data Science 4 What is Data Science? Data Science vs. Machine Learning vs. Big Data • Common point: focus on improving decision making through analyzing data • Difference: o Machine Learning: focus on algorithms o Big Data: focus on structured data o Data Science: focus on unstructured data Basics of Data Science 5 What is Data Science? Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram Jake Vander Plas. Python Data Science Handbook. O'Reilly Media, (2016) Basics of Data Science 6 What is Data Science? A First Try to Define Data Science “A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” Josh Wills, Director of Data Science at Cloudera “A data scientist is someone who is worse at statistics than any statistician and worse at software engineering than any software engineer.” Will Cukierski, Data Scientist at Kaggle “The field of people who decide to print ‘Data Scientist’ on their business cards and get a salary bump.” https://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html Basics of Data Science 7 What is Data Science? According to David Donoho, data science activities are classified as follows: • Data Gathering, Preparation, and Exploration • Data Representation and Transformation • Computing with Data • Data Visualization and Presentation • Data Modeling • Science about Data Science David Donoho. 50 Years of Data Science, Journal of Computational and Graphical Statistics, 26:4, 745-766, (2017). Basics of Data Science 8 What is Data Science? Definition: Data Science concerns the recognition, formalization, and exploitation of data phenomenology emerging from digital transformation of business, society and science itself. Definition: Data Science holds a set of principles, problem definitions, algorithms, and processes with the objective of extracting nontrivial and useful patterns from large data sets. David Donoho. Data Science: The End of Theory? John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 9 What is Data Science? Some Examples: Data Science in History Mid 19-th century: ● devastating outbreaks of Cholera had plagued London. ● Dr John Snow prepared a dot map for the Soho district in 1854. ● The clustering pattern from the map showed that the Broad Street pump is likely the source of contamination. ● Using this evidence, the authorities disabled the pump, and consequently they could decline Cholera incidence in the neighborhood. Basics of Data Science 10 What is Data Science? Some Examples: Data Science Today • Precision medicine • Data Science in Sales and Marketing • Governments Using Data Science • Data Science in Professional Sports • Predicting elections N. Silver. The Signal and the Noise. Penguin (2012). E. A. Ashley. Towards Precision Medicine. Nature Reviews Generics (2017). John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 11 What is Data Science? The Hype Cycle https://en.wikipedia.org/wiki/Hype_cycle Jeremykemp at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=10547051 Basics of Data Science 12 What is Data Science? Some Myths about Data Science • Data science is an autonomous process that finds the answers to our problems • Every data science project requires big data and should use deep learning • Data science is easy to do • Data science pays for itself in a very short time John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 13 What is/are Data/Data Set? • A datum or a piece of information is an abstraction of a realworld entity (i.e., a person, an object, or an event). • Individual abstraction: variable, feature, and attribute • Attribute: Each entity is described by several attributes. • A data set consists of the data concerning a collection of entities, such that each entity is described in terms of a collection of attributes. Hint: “data set” and “analytics record”: are often used as equivalent terms. https://www.merriam-webster.com/dictionary/datum John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 14 What is/are Data/Data Set? • The standard attribute types: numeric, nominal, and ordinal. • Numeric attributes are used to describe measurable quantities, which are integer or real values. • Nominal (categorical) attributes take typically values from a finite collection or set. • Ordinal attributes are used to rank over the class of objects. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 15 What is/are Data/Data Set? • Structured data: are data that can be stored in a table or spreadsheet, such that each instance/row in the table has the same structure and set of attributes. • Unstructured data: are data where each instance in the data set may have its own internal and specific structure, such that this so-called structure can be different for each instance. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 16 What is/are Data/Data Set? • Raw data: when attributes are raw abstractions regarding an object or an event • Derived data: if we derive them from other pieces of data • Types of raw data: o captured data: when we perform a direct measurement or an observation to collect the data. o exhaust data: these are by-product that we obtain from a process, where the primary objective is not capturing data. Kitchin, Rob. The Data Revolution: Big Data, Open Data, Data Infrastructures, and Their Consequences. Sage, (2014). John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 17 What is/are Data/Data Set? • Metadata: this is the data that describes other data Example: US National Security Agency’s (NSA) surveillance program PRISM collected a large collection of metadata about people’s phone conversations, i.e., the agency was not recording the content of phone calls (i.e., there was no wiretapping) but actually the NSA was collecting the data regarding the calls, e.g., who the recipient was, the duration of the call, etc. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Pomerantz, Jeffrey. Metadata. Cambridge, MA: MIT Press (2015). Basics of Data Science 18 Sources of Data Typically, 80% of project time is spent on getting data ready Problems: Unclear variable names, missing values, misspelled text, 20% numbers as text (in spreadsheet), outliers, etc. Example: Emoji Unicode number: U+1F914 HTML-code: &#129300; CSS-code: \1F914 John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) https://unicode-table.com/en/1F914/ Basics of Data Science 80% 19 Sources of Data • In-house data: when the data is ready, e.g., your company provides it. o Advantages: it is fast and ready, and you can use the data in the frame of your company and cooperate with people who collected and created the data. o Potential disappointments: It might happen that the data is not well gathered, or not well documented, not well maintained, the person who gathered them already left the company, what you actually need is not in the data, …. Poulson Barton. Data Science Foundations: Fundamentals. (2019) Basics of Data Science 20 Sources of Data Open data: is a kind of data that is (1) gratis and (2) free to use. • Government, e.g., o The home of the U.S. Government’s open data (https://www.data.gov/) o The global open data index (https://index.okfn.org) • Science, e.g., o Nature: https://www.nature.com/sdata o The Open Science Data Cloud (OSDC): https://www.opensciencedatacloud.org • Social Media, e.g., Google trends (https://trends.google.com/) and Yahoo finance (https://finance.yahoo.com) Basics of Data Science 21 Sources of Data Other sources of data (with some overlaps): • Application Programming Interface (API) • Data Scraping (be careful!) • Creating Data (be careful!) • Passive Collection of Data (be careful!) • Generating Data Poulson Barton. Data Science Foundations: Fundamentals. (2019) Basics of Data Science 22 Ecosystem of Data Science Data Science Ecosystem: The term refers to the set of programming languages, software packages and tools, methods and algorithms, general infrastructure, etc. that an organization uses to gather, store, analyze, and get maximum advantage from data in the data science projects of the organization. There are different sets of technologies used for the purpose of data science: ● commercial products, ● open-source tools, ● mixture of opensource tools and commercial products. https://online.hbs.edu/blog/post/data-ecosystem John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 23 Ecosystem of Data Science A typical data architecture for data science. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 24 Ecosystem of Data Science • The data sources layer: e.g., as online transaction processing (OLTP) systems including banking, finance, call center, etc. • The data-storage layer: the data sharing, data warehousing, and data analytics across an organization, which has two parts: o Most organizations use a data-sharing software. o Managing (storage and analytics) of big data, by, e.g., Hadoop. • The applications layer: the ready data is used in analyzing the specific challenge of the data project. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 25 Legal, Ethical, and Social Issues • Legal Issues: for example, privacy laws: o The EU's General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), etc. If an organization violates seriously the policies of the GDPR, then there might be billions of fine for the concerned organization. • Ethical Issues: concerning authenticity, fairness (equality, equity, and need), etc. • Social Issues: public opinion should be respected and taken into consideration. David Martens. Data Science Ethics: Concepts, Techniques and Cautionary Tales. Oxford University Press (2022) Poulson Barton. Data Science Foundations: Fundamentals. (2019) Basics of Data Science 26 Legal, Ethical, and Social Issues https://unctad.org/page/data-protection-and-privacy-legislation-worldwide (accessed December 2021) Basics of Data Science 27 Legal, Ethical, and Social Issues Data science in interaction with human and artificial intelligence: ● Recommendations: after processing the data, algorithm make recommendations, but human decides to take or leave. ● Human-in-the-Loop decisions: algorithms make and execute their own decisions, but humans are present to control. ● Human-Accessible decisions: algorithms make decisions automatically and execute them, but the process should be accessible and interpretable. ● Machine-Centric decisions: machine are communicating with together, e.g., Internet of Things (IoT). David Martens. Data Science Ethics: Concepts, Techniques and Cautionary Tales. Oxford University Press (2022) Poulson Barton. Data Science Foundations: Fundamentals. (2019) Basics of Data Science 28 Standard Tasks of Data Science Most data science projects belong to one of the following classes of tasks: • Clustering • Anomaly (outlier) detection • Association-rule mining • Prediction John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 29 Standard Tasks of Data Science • Clustering: Example: through identifying customers and their preferences and needs, data science can support marketing and sales campaign of companies via targeted marketing. The standard data science approach: formulate the problem as a clustering task, where clustering sorts the instances of a data set into subgroups of similar instances. For this purpose, we need to know number of subgroups (e.g., via some domain knowledge) and a range of attributes that describe customers for clustering, e.g., demographic information (age, gender, etc.), location (ZIP code), and so on. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 30 Standard Tasks of Data Science • Anomaly (outlier) detection: In anomaly detection (or outlier analysis), we search for and identify instances that do not match and conform to the typical instance in a data set, e.g., fraudulent activities, fraudulent credit card transactions. A typical approach: 1. Using domain expertise, define some rules. 2. Then, use, e.g., SQL (or another language) to check business databases or data warehouse. Another approach: training a prediction model to classify instances as anomalous versus normal. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 31 Standard Tasks of Data Science • Association-rule mining: For example, data science can be used in cross-selling, where the vendor suggests to customers, who are currently buying some products, if they are also interested to buy other similar, related, or even complementary products. For the purpose of cross-selling, we need to identify associations between products. This can be done by unsupervised-dataanalysis techniques. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 32 Standard Tasks of Data Science • Prediction (classification): Example: In customer relationship management (CRM), a typical task consists in performing propensity modeling, i.e., estimating the likelihood that a customer will make a decision, e.g., leaving a service. Customer churn: when customers leave one service, e.g., cell phone company, to join another one. Using classification models (and training them), data science task is to help in detecting (predicting) churns, i.e., classifying a customer as a churn risk or not. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 33 Standard Tasks of Data Science • Prediction (regression): Example: Price prediction, which consists in estimating the price of a product at some point in future. • A typical approach: Using regression because price prediction consists in estimating the value of a continuous attribute. John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018) Basics of Data Science 34 Legal, Ethical, and Social Issues Basics of Data Science Legal, Ethical, and Social Issues Basics of Data Science Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 1, Section 2 → Relational Database Models: Modeling and Querying Structured Attributes of Objects Part 1: Organizing the “Data Lake” (from data mining to data fishing) • Relational Database Models: Modeling and Querying Structured Attributes of Objects • Graph- and Network-based Data Models: Modeling and Querying Structured Relations of Objects • Information Retrieval: Document Mining and Querying of ill-structured Data • Streaming Data and High Frequency Distributed Sensor Data • The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate evolving heterogeneous data lakes Basics of Data Science 2 Relational Database Models: Modeling and Querying Structured Attributes of Objects Basics of Data Science 3 Summary: In this section, we will see: • Introduction • The Relational Model • Relational Database Management System (RDBMS) • How to Design a Relational Database • Database Normalization • Other Types of Databases Basics of Data Science 4 Introduction • Data Storage Problem • One-dimensional array https://www.weforum.org/agenda/2015/02/a-brief-history-of-big-data-everyone-should-read/ Basics of Data Science 5 Using a List to Organize Data Holidays Pictures: Alexanderplatz.jpg Brandenburg Gate.jpg Eiffel Tower.jpg London Eye.jpg Louvre.jpg Images from Internet. Basics of Data Science 6 Organize Data in Table: Country City Picture Date Person Germany Berlin Brandenburg Gate.jpg 1.7.2021 Joshua Germany Berlin Alexanderplatz.jpg 1.7.2021 Hans England London London Eye.jpg 1.9.2021 Hans France Paris Eiffel Tower.jpg 1.8.2021 Joshua France Paris Louvre.jpg 1.8.2021 Hans Basics of Data Science 7 Pros and Cons of a Table Structure: • It is easy to add attributes with additional columns • It is easy to add new records with additional rows • We have to store repetitive information • It is not easy to accommodate special circumstances Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 8 Relational Data Bases: Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 002 Alexanderplatz.jpg 1 1.7.2021 003 Eiffel Tower.jpg 2 1.8.2021 004 Louvre.jpg 2 1.8.2021 005 London Eye.jpg 3 1.9.2021 Locations People Picture# Person 001 Hans 002 Joshua 003 Joshua 004 Hans Location City Country 005 Hans 1 Berlin Germany 005 Joshua 2 Paris France 005 Sarah 3 London England Basics of Data Science 9 The Relational Model Basics of Data Science 10 The Relational Model: • It was originally developed by data scientist E.F. Codd in his paper: "A Relational Model of Data for Large Shared Data Banks“, which was published in 1970 • Key points: o The retrieval of information is separated from its storage o Using some rules data is organized across multiple tables that are related to each other Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 11 The Pictures Database: Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 People 002 Alexanderplatz.jpg 1 1.7.2021 Picture# Person 003 Eiffel Tower.jpg 2 1.8.2021 001 Hans 004 Louvre.jpg 2 1.8.2021 002 Joshua 005 London Eye.jpg 3 1.9.2021 003 Joshua 006 River Thames.jpg 3 1.9.2021 004 Hans 005 Hans 005 Joshua 005 Sarah Locations Location City Country 1 Berlin Germany 2 Paris France 3 London England Basics of Data Science 12 The Pictures Database: Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 People 002 Alexanderplatz.jpg 1 1.7.2021 Picture# Person 003 Eiffel Tower.jpg 2 1.8.2021 001 Hans 004 Louvre.jpg 2 1.8.2021 002 Joshua 005 London Eye.jpg 3 1.9.2021 003 Joshua 006 River Thames.jpg 3 1.9.2021 004 Hans 005 Hans 005 Joshua 005 Sarah Locations Location City Country 1 Berlin Germany 2 Paris France 3 London England Basics of Data Science 13 The Pictures Database: Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 People 002 Alexanderplatz.jpg 1 1.7.2021 Picture# Person 003 Eiffel Tower.jpg 2 1.8.2021 001 Hans 004 Louvre.jpg 2 1.8.2021 002 Joshua 005 London Eye.jpg 3 1.9.2021 003 Joshua 006 River Thames.jpg 3 1.9.2021 004 Hans 005 Hans 005 Joshua 005 Sarah Locations Location City Country 1 Berlin Germany 2 Paris France 3 London England Basics of Data Science 14 Relational Database Management System (RDBMS) Basics of Data Science 15 Some RDBMS systems and sellers • Microsoft SQL Server (SQL: Structured Query Language) • PostgreSQL • Azure SQL Database • IBM Db2 • Oracle • MySQL • SQLite https://realpython.com/python-sql-libraries/ Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 16 RDBMS Tasks • Creating and modifying the structure of the data • Defining names for tables and column • Creating key-value columns and building relationships • Manipulating records and performing CRUD operations o Create new records of data o Read data that exists o Update values of data o Delete records from the database I. Robinson, J. Webber, and E. Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 17 Further RDBMS Tasks • Performing regular backups • Maintaining copies of the database • Controlling access permissions • Creating reports including visualizations • Creating forms How to interact with a RDBMS? • Part 1: Graphical interface • Part 2: Coding with SQL (Structured Query Language) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 18 Database Components • Relations • Domains • Tuples Alternative terms for database components • Tables • Columns • Records Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 19 Data Table Person Name Favorite Color Eye Color Columns, Fields, or Attributes Records or Rows Person Name Person Name Favorite Color Eye Color Basics of Data Science Favorite Color Eye Color 20 How to Design a Relational Database Basics of Data Science 21 How to Design a Relational Database • Find out which information that should be stored • Pay attention to what you want to extract or get out of the database • Create table groups collecting information • Hint: imagine tables as "nouns" and columns as "adjectives“ of the nouns Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 22 Example: Entity Relationship Diagram Customers Orders CustomerID FirstName LastName StreetAddress City State Zip OrderID CustomerID ProductName Quantity Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 23 Example: Entity Relationship Diagram One-to Many Relationship Customers CustomerID FirstName LastName StreetAddress City State Zip 1 Orders N Basics of Data Science OrderID CustomerID ProductName Quantity 24 Example: Entity Relationship Diagram One-to Many Relationship Customers CustomerID FirstName LastName StreetAddress City State Zip 1 Orders OrderID CustomerID ProductName Quantity Basics of Data Science 25 Example: Entity Relationship Diagram • Crow's Foot Notation One-to Many Relationship: Crow’s Foot Notation Customers CustomerID FirstName LastName StreetAddress City State Zip Orders OrderID CustomerID ProductName Quantity Gordon C. Everest, "Basic Data Structure Models Explained With A Common Example" Computing Systems, Proceedings Fifth Texas Conference on Computing Systems, Austin, TX, 1976 October 18-19, pages 39-46. Basics of Data Science 26 Example: Entity Relationship Diagram Products ProductName PartNumber Size Color Price Supplier QuantityInStock Suppliers int int int int int int int SupplierName PhoneNumber StreetAddress City State Zip int int int int int int Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 27 Database Diagram: Data Types Order Name BirthDate Salary Number Text Date Currency Number Text Date Currency Number Text Date currency Benefits of Data Types • Efficient storage • Data consistency and improved quality Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGrawHill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) • Improved performance Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 28 Data Type Categories • Character or text: char(5), nchar(20), varchar(100), • Numerical data: tinyint, int, decimal/float • Currency, times, dates, • Other data types: geographic coordinates, binary files, etc. Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 29 Example: Data Types Products Suppliers ProductName varchar(100) PartNumber int Size varchar(20) Color varchar(20) Price decimal Supplier varchar(100) QuantityInStock int SupplierID SupplierName PhoneNumber StreetAddress City State Zip int varchar(100) char(15) varchar(100) varchar(50) char(3) char(5) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 30 Primary Key • Definition: A primary key is the column or columns that contain values that are used to uniquely identify each row in a table. • Guaranteed to be unique (forever) Some ways to define primary keys: o Natural keys: there may already be unique identifiers in your data o Composite keys: concatenation of multiple columns o Surrogate keys: created just for the database (product id, …). https://www.ibm.com/docs/en/iodg/11.3?topic=reference-primary-keys Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 31 Example: Determining a Primary Key FirstName LastName FavoriteColor Hans Kurz Green Hans Long Yellow Joshua Müller Green PersonalID (PK) FirstName LastName FavoriteColor 001 Hans Kurz Green 002 Hans Long Yellow 003 Joshua Müller Green Basics of Data Science 32 Some Notes on Naming Tables and Columns: • Consistency • Capitalization • Spaces • Avoid using reserved Words The command keywords should not be used in your defined names • Avoid acronyms Avoid acronyms, use full and legible terms Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 33 Integrity and Validity Data Integrity ensures that information is identical to its source, and the information has not been, in an accidental or malicious way, altered, modified, or destroyed. Validation is about the evaluations used to determine compliance and accordance with security requirements. • Data Validation: Invalid data is not allowed to be in database and the user receives error message in case of inappropriate inputs/values. • Unique Values: When a value can appear only once, e.g., primary key. • Business Rules: Take into account your organization’s constraints. https://oa.mo.gov/sites/default/files/CC-DataIntegrityandValidation.pdf Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 34 Example: Unique Constraint Products ProductName PartNumber Size Color Price Supplier QuantityInStock Suppliers varchar(100) UNIQUE int varchar(20) varchar(20) decimal varchar(100) int SupplierID SupplierName PhoneNumber StreetAddress City State Zip int varchar(100) char(15) varchar(100) varchar(50) char(3) char(5) In creating the table [Products], you should add the following lines: • In MS SQL Server: CONSTRAINT [UK_Products_ProductName] UNIQUE ( [ProductName] ), Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 35 • NULL values: indicate that data is not known, is not specified, or is not applicable. • NOT NULL values: indicate a required column. Example: Birthdate of customers or employees? People FirstName Birthdate Joshua April 25, 1990 Albert Valentin July 14, 2000 Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 36 Indexes • Indexes are added to any column which is used in frequent searches. o Clustered indexes: primary keys o Non-clustered indexes: all other indexes • Issues in adding too many indexes o It will reduce speed in adding new records o Note: you can still search on non-indexed fields Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 37 Example: Creating Index Products ProductName PartNumber Size Color Price Supplier QuantityInStock • Suppliers varchar(100) UNIQUE int varchar(20) varchar(20) decimal varchar(100) int SupplierID SupplierName PhoneNumber StreetAddress City State Zip int varchar(100) char(15) varchar(100) varchar(50) char(3) char(5) In MS SQL Server: CREATE INDEX [idx_Suppliers_SupplierName] ON [Suppliers] ([SupplierName]) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 38 Check (Integrity) Constraints These are built directly into the design of the table and then entry data is checked and validated before being saved to the table. • Numerical Checks: It can include a range of acceptable values. • Character Checks: It can be used to limit possible cases. Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 39 Example: Check Constraint Products ProductName PartNumber Size Color Price Supplier QuantityInStock • Suppliers varchar(100) UNIQUE int varchar(20) varchar(20) decimal varchar(100) int SupplierID SupplierName PhoneNumber StreetAddress City State Zip int varchar(100) char(15) varchar(100) varchar(50) char(3) char(5) In MS SQL Server: [State] char(3) NOT NULL CONTRAINT CHK_State CHECK (State = 'RLP' OR State = 'NRW'), Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 40 Relationships Basics of Data Science 41 The Pictures Database: Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 People 002 Alexanderplatz.jpg 1 1.7.2021 Picture# Person 003 Eiffel Tower.jpg 2 1.8.2021 001 Hans 004 Louvre.jpg 2 1.8.2021 002 Joshua 005 London Eye.jpg 3 1.9.2021 003 Joshua 004 Hans 005 Hans 005 Joshua 005 Sarah Locations Location City Country 1 Berlin Germany 2 Paris France 3 London England Basics of Data Science 42 Example: Creating Relationships and Primary Key (PK) Pictures Pictures# FileName Location Date People (PK) Pictures# Person (PK) (PK) Locations Location City Country (PK) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 43 Example: Creating Relationships and Links Pictures Pictures# FileName Location Date People (PK) Pictures# Person (PK) (PK) Locations Location City Country (PK) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 44 Example: Creating Relationships and Foreign Key (FK) Pictures Pictures# FileName Location Date People (PK) Pictures# (PK) (FK) Person (PK) (FK) Locations Location City Country (PK) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 45 Some Notes: • Generally, relationships are created on the foreign key (FK) • Same data types for the FK and PK columns • Same or different name for the FK and PK columns Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 46 Exercise: Create relationship between the following tables. Hint: First, you need to change one of the columns in the Products table. Products ProductName PartNumber Size Color Price Supplier QuantityInStock Suppliers varchar(100) UNIQUE int varchar(20) varchar(20) decimal varchar(100) int Basics of Data Science SupplierID SupplierName PhoneNumber StreetAddress City State Zip int varchar(100) char(15) varchar(100) varchar(50) char(3) char(5) 47 Optionality Cardinality • The minimum number of related records • The maximum number of related records • Usually, 0 or 1 • Usually, 1 or many (N) Example: Example: • If a course must have a responsible, optionality= 1 • If a course can have only one responsible, cardinality = 1 • If a course might have a responsible, optionality= 0 • If a course can have several responsible, cardinality = N 1 .. N means the range of Optionality=1 (must case) .. Cardinality = N (unspecified maximum) Basics of Data Science 48 Example: Database Diagram and Optionality-Cardinality Pictures Pictures# FileName Location Date 1 .. 1 0 .. N (PK) People Pictures# (PK) (FK) Person (PK) (FK) 0 .. N Locations 1 .. 1 Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science Location City Country (PK) 49 Example: Database Diagram and Optionality-Cardinality Pictures Pictures# FileName Location Date 1 .. 1 0 .. N (PK) People Pictures# (PK) (FK) Person (PK) (FK) 0 .. N 1 .. 1 Locations 1 .. 1 Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science Location City Country (PK) 50 Optionality Cardinality 1 = NOT NULL 1 = UNIQUE constraint 0 = NULL N = No Constraint Optionality .. Cardinality 1 .. 1 Not Null + Unique 0 .. N Null + Not Unique 1 .. N Not Null + Not Unique 0 .. 1 Null + Unique Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 51 One-to-Many Relationships Library Database BookLoans LibraryUsers CardNumber CardNumber UserName 50001 Valentin 50002 Laura 50003 Christian BookName CheckoutDate 50001 Operations Research 15.10.2021 50001 SQL & NoSQL Databases 01.11.2021 50001 SQL For Dummies 01.12.2021 Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 52 One-to-One Relationships Employee Database Employees 1 .. 1 1 .. 1 EmployeeID (PK) FirstName LastName Position OfficeNumber HumanResources EmployeeID Salary JobRating (PK) Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 53 Many-to-Many Relationships Example: Class Schedule Database Students 0 .. N 0 .. N StudentID (PK) StudentName Courses CourseID CourseName RoomName (PK) Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 54 Example: Class Schedule Database 0 .. N StudentCourses CourseID StudentID Grade 0 .. N (PK) (PK) Students Courses StudentID (PK) StudentName CourseID CourseName RoomName 1 .. 1 1 .. 1 (PK) Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 55 Self Joins (Recursive Relationships) • Other names: recursive relationship, self-referencing relationship • Relationship rules and types: the same rules and types between two tables Employees Table EmployeeID (PK) Name SupervisorID 1008 Maxim 1009 Joshua 1008 1020 Sven 1008 1021 Sarah 1020 Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 56 Employee Organizational Chart Maxim Self Join Diagram Employees Joshua Sven EmployeeID (PK) Name Sarah SupervisorID (FK) Check constraint: SupervisorID ≠ EmployeeID Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 57 Cascade Updates and Deletes: Example: Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 002 Alexanderplatz.jpg 1 1.7.2021 Locations 003 Eiffel Tower.jpg 2 1.8.2021 Location City Country 004 Louvre.jpg 2 1.8.2021 1 Berlin Germany 005 London Eye.jpg 3 1.9.2021 2 Paris France London England 3 Basics of Data Science 7 58 Cascade Updates and Deletes Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 002 Alexanderplatz.jpg 1 1.7.2021 003 Eiffel Tower.jpg 2 7 1.8.2021 Location City Country 004 Louvre.jpg 2 7 1.8.2021 1 Berlin Germany 005 London Eye.jpg 3 1.9.2021 2 Paris France London England Locations 3 Basics of Data Science 7 59 Cascade Updates and Deletes Pictures Picture# FileName Location Date 001 Brandenburg Gate.jpg 1 1.7.2021 002 Alexanderplatz.jpg 1 1.7.2021 003 Eiffel Tower.jpg 2 7 1.8.2021 Location City Country 004 Louvre.jpg 2 7 1.8.2021 1 Berlin Germany 005 London Eye.jpg 3 1.9.2021 2 Paris France London England Locations 3 Basics of Data Science 7 60 How to implement cascade changes • Note that cascade update and delete don’t concern insertion of new data • If you choose to switch off the cascade functionality, then you can still protect data integrity, e.g., from accidental changes • How to activate: depends on the platform that you use • In SQL: ON UPDATE CASCADE and ON DELETE CASCADE Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 61 Database Normalization Basics of Data Science 62 Database Normalization • Normalization consists of a set of rules that describe proper design of database. • The rules for table structure are called "normal forms" (NFs). • There are first, second, and third normal forms that should be satisfied in order. • A database has a good if it design satisfies "third normal form" (3NF). First Normal Form (1NF) • It requires that all fields of table include a single piece of data Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 63 Example 1: not satisfied 1NF Picture# FileName Person 001 Brandenburg Gate.jpg Hans 002 Alexanderplatz.jpg Joshua 003 Eiffel Tower.jpg Joshua 004 Louvre.jpg Hans 005 London Eye.jpg Hans, Joshua, Sarah 006 River Thames.jpg A bad solution to the issue: Picture# FileName Person1 001 Brandenburg Gate.jpg Hans 002 Alexanderplatz.jpg Joshua 003 Eiffel Tower.jpg Joshua 004 Louvre.jpg Hans 005 London Eye.jpg Hans 006 River Thames.jpg Basics of Data Science Person2 Person3 Joshua Sarah 64 Example 1: satisfying 1NF Pictures People Picture# FileName Location Date Picture# Person 001 Brandenburg Gate.jpg 1 1.7.2021 001 Hans 002 Alexanderplatz.jpg 1 1.7.2021 002 Joshua 003 Eiffel Tower.jpg 2 1.8.2021 003 Joshua 004 Louvre.jpg 2 1.8.2021 004 Hans 005 London Eye.jpg 3 1.9.2021 005 Hans 005 Joshua 005 Sarah Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 65 Example 2: satisfying 1NF Address Gottlieb Daimler Str. 42, Kaiserslautern, RLP 67663 Street Building City State PostalCode Gottlieb Daimler Str. 42 Kaiserslautern RLP 67663 Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 66 Second Normal Form (2NF) • A table fulfills the 2NF if all of the fields in the primary key are required to determine the other fields, i.e., the non-key fields. People People Picture# (PK) Person (PK) Picture# (PK) Person (PK) LastName 001 Hans 001 Hans Schmidt 002 Joshua 002 Joshua Schmidt 003 Joshua 003 Joshua Schmidt 004 Hans 004 Hans Schmidt 005 Hans 005 Hans Schmidt 005 Joshua 005 Joshua Schmidt 005 Sarah 005 Sarah Woods Basics of Data Science 67 Example: satisfying 2NF People Picture# (PK) Person (PK) 001 1 002 2 003 2 004 1 005 1 005 2 005 3 Person Basics of Data Science Person (PK) FirstName LastName 1 Hans Schmidt 2 Joshua Schmidt 3 Sarah Woods 68 Third Normal Form • A table fulfills 3NF if all of the non-key fields are independent from any other non-key field. Example: 3NF is violated Person Picture# (PK) FirstName LastName StateAbbv StateName 1 Hans Schmidt RLP Rheinland-Pfalz 2 Joshua Schmidt BWG Baden-Württemberg 3 Sarah Woods NRW Nordrhein-Westfalen Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 69 Example: satisfying 3NF Sate Person Picture# (PK) FirstName LastName StateAbbv 1 Hans Schmidt RLP 2 Joshua Schmidt BWG 3 Sarah Woods NRW Basics of Data Science StateAbbv (PK) StateName RLP Rheinland-Pfalz BWG BadenWürttemberg NRW NordrheinWestfalen 70 Denormalization • The objective of normalization process consists in removing redundant information from the database and make the database work properly. • In contrast: the aim of denormalization is to introduce redundancy and violates deliberately one of the normalization forms if database design has a good reason to do so, e.g., this might be done for the objective of increasing performance in some application contexts. Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 71 Example: denormalizing a Table Normalized tables: Sate Person Picture# (PK) FirstName LastName StateAbbv 1 Hans Schmidt RLP 2 Joshua Schmidt BWG 3 Sarah Woods NRW Basics of Data Science StateAbbv (PK) StateName RLP Rheinland-Pfalz BWG BadenWürttemberg NRW NordrheinWestfalen 72 Example denormalizing a Table Picture# (PK) FirstName LastName StateAbbv StateName 1 Hans Schmidt RLP Rheinland-Pfalz 2 Joshua Schmidt BWG Baden-Württemberg 3 Sarah Woods NRW Nordrhein-Westfalen Basics of Data Science 73 Other Types of Databases Basics of Data Science 74 Graph Databases (GD) • Nodes and edges are used to store information. • In a GD, each node can have relationships with (connected to) any other node. • In a GD, nodes can be used to represent different kinds and types of information. • Application area: typically used for modeling, representing, and studying social networks. Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 75 Example: Married to Has played Friends with Has played Parent of Has played Friends with Basics of Data Science Parent of ● Birthplace ● Age ● Gender ● Job title ● Salary 76 Document Databases (DD) • Document databases are used to store documents, where each document represents a single object • Document databases Support files that are in different formats • In Document databases, a variety of operations can be performed on documents, e.g., reading, classifying, …. Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Allen G. Taylor. SQL For Dummies, 9th Edition (2019) Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 77 Example: Document Databases Basics of Data Science 78 NoSQL Databases (Not relational) Basics of Data Science 79 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 1, Section 3 → Graph- and Network-based Data Models: Modeling and Querying Structured Relations of Objects Part 1: Organizing the “Data Lake” (from data mining to data fishing) • Relational Database Models: Modeling and Querying Structured Attributes of Objects • Graph- and Network-based Data Models: Modeling and Querying Structured Relations of Objects • Information Retrieval: Document Mining and Querying of ill-structured Data • Streaming Data and High Frequency Distributed Sensor Data • The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate evolving heterogeneous data lakes Basics of Data Science 2 Summary: In this section, we will see: • A short introduction. • What is a “graph database”? • Why do we need “graph database”? • How can we use a “graph database”? If you are interested in this topic, for further and deeper knowledge on graph databases, please refer to references and books on graph databases, e.g., ● Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) ● Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) ● Dave Bechberger, Josh Perryman. Graph Databases in Action. Manning (2020) Basics of Data Science 3 NoSQL Databases https://hostingdata.co.uk/nosql-database/ Basics of Data Science 4 Basic structure of a relational database management system. Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Basics of Data Science 5 Variety of sources for Big Data Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Basics of Data Science 6 NoSQL: Nonrelational SQL, Not Only SQL, No to SQL? The term NoSQL is basically used for nonrelational data management systems that meet the following conditions: (1) We don’t store data in table data structures. (2) The database querying language is not SQL. NoSQL databases support various database models. Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 7 Basic structure of a NoSQL database management system • Mostly, NoSQL database management systems use Massively distributed storage architecture. • Multiple consistency models, e.g., strong consistency, weak consistency, etc. Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Basics of Data Science 8 The definition of NoSQL databases based on the web-based NoSQL Archive: A Web-based storage system is a NoSQL database system if the following requirements are met: • Model: it does not use relational database model. • At least three Vs: volume, variety, and velocity. • Schema: no fixed database schema. • Architecture: it supports horizontal scaling and massively distributed web applications. • Replication: data replication is supported. • Consistency assurance: consistency is ensured. NoSQL Archive: http://nosql-database.org/, retrieved February 17, 2015 Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Basics of Data Science 9 Three different NoSQL databases. Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019) Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Adam Fowler. NoSQL For Dummies, 9th Edition (2015) Basics of Data Science 10 Graph Databases Basics of Data Science 11 A graph is a network composed of nodes (also called vertices) and connections, the so-called edges (or arcs, if directed). a f b g d c h • • • • • • e Adjacent • Cycle Path • Tree Degree of a node • Forest Connected graph Disconnected graph Directed Graph i HMM: OR, 7.1 u. 7.2 Basics of Data Science 12 • Graph databases leverage relationships in highly connected data with the objective of generating insights. • Indeed, when we have connected data with a significant size or value, using graph databases is the best choice to represent and query connected data. • Large companies realized the importance of graphs and graph data bases longtime ago, but in recent years using graph infrastructures became more and more common and used by many organizations. • Despite this renaissance of graph data and graph thinking to use in information management, it is interesting and important to note that the graph theory itself is not new. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 13 What Is a Graph? In a formal language, a graph is a collection or set of vertices (nodes) and edges that connect the vertices (nodes). Example: A small social graph Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 14 Example: A simple graph model for publishing messages in social network. Relationships: CURRENT and PREVIOUS. Question: How can you identify Ruth's timeline? Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 15 Why Graph Databases? (i) Relational Databases Lack Relationships • Initially, the relational databases were designed to codify tabular structures. Even though this task does very well, the relational databases struggle when they try to model the ad hoc relationships that we encounter in the real world. • In the relational databases, there exist relationships, but only at modeling stage just for the purpose of joining tables. Moreover, this becomes an issue in highly connected domains. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 16 Why Graph Databases? (ii) NoSQL Databases Also Lack Relationships • Most of the NoSQL databases store collections of disconnected objects (whether documents, values, or columns). • In the NoSQL databases, we can use aggregate’s identifiers to add relationships. But this can become quickly excessively expensive. • Moreover, the NoSQL databases don’t support operations to point backward. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 17 Why Graph Databases? (iii) Graph Databases Embrace Relationships • In the previous models that we studied, if there is any implicit connection in the data, the data models and the databases are blind to these connections. But, in the graph world, connected data is stored truly as connected data. • The graph models are flexible in the sense that they allow to add new nodes and new relationships without having any need to migrate data or to compromise the existing network, i.e., the original data and its intent remain unchanged. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 18 Example: Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 19 Graph Databases A graph database management system (or for short, a graph database) is an online database management system that can perform Create, Read, Update, and Delete (CRUD) methods on a graph data models. https://www.avolutionsoftware.com/abacus/the-graph-database-advantage-for-enterprise-architects/ Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 20 Properties of graph Databases • (a) The underlying storage: There are graph databases that use native graph storage which is optimized and designed for storing and managing graphs. However, some other graph databases do not use native graph storage • (b) The processing engine: In some definitions, the connected nodes of a graph point physically "point" to each other in the database. Such a graph database use a so-called indexfree adjacency or native graph processing. In a broader definition, which we use in this course too, we assume a graph database as a one that can perform CRUD operations on a graph data. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 21 Figure: An overview of some of the graph databases. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 22 Graph Compute Engines A graph compute engine is a technology that is used to run graph computational algorithms (such as identifying clusters, …) on large datasets. Some Graph Compute Engines: Cassovary, Pegasus, and Giraph. https://github.com/twitter/cassovary http://www.cs.cmu.edu/~pegasus/ https://giraph.apache.org/ Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 23 Data Modeling with Graphs • Question: how do we model as graphs? Models and Goals • What is “modeling”? An abstracting process which is motivated by a specific goal, purpose, or need. • How to model? There are no unique and natural way! Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 24 Labels and Paths in the Graph • In a given network, the nodes might play one or more roles, e.g., some nodes might show users, whereas others might represent orders or maybe products, etc. To attribute roles to a node, we can use “labels”. Since a node of a given graph can take various roles (simultaneously), we might need to associate more than one label to a given node. • Using labels, we can ask database to perform different tasks, e.g., find all the nodes labeled “product”. • In a graph model, a natural representation of relationships is done by “paths”. Hence, querying (or traversing) a given graph model is done through following paths. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 25 The Labeled Property Graph Model • A labeled property graph is composed of nodes, relationships, labels, and properties. • Nodes hold properties. • Nodes can have one or more labels to group nodes and indicate their role(s). • Nodes are connected by relationships, which are named, have direction and point from a start node to an end node. • Similar to the case of nodes, relationships can have properties too. In addition to a labeled property graph model, we need a query language to create, manipulate, and query data in a graph database. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 26 Which Query Language? • Which graph database query language? → Cypher • Why Cypher? → Standard and widely deployed, easy to learn and understand (if you have a background in SQL). • There are other graph database query languages, e.g., SPARQL and Gremlin. https://neo4j.com/developer/cypher/ Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 27 Querying Graphs: An Introduction to Cypher • Using Cypher, we can ask the database to search for data that corresponds and matches a given pattern. Identifiers: Ian, Jim, and Emil Example: ASCII art representation of this diagram in Cypher: (emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil) Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 28 Some Notation: • We draw nodes with parentheses. • We draw relationships with --> and <-- (where the signs < and > indicate direction of relationship). Moreover, between the dashes, set off by square brackets [] and put a colon and then the name of the relationship. • Similarly, we put a colon as prefix to node labels. • We use curly braces, i.e., {} to specify node (and relationship) property key-value pairs. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 29 (emil:Person {name:'Emil'}) <-[:KNOWS]-(jim:Person {name:'Jim'}) -[:KNOWS]->(ian:Person {name:'Ian'}) -[:KNOWS]->(emil) • Identifiers: Ian, Jim, and Emil • Property: name • Label: Person Example: The identifier “Emil” is assigned to a node in the dataset, where this nodes has the “label Person” and a “name property” with the value “Emil”. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 30 Cypher is made up with clauses/keywords, for example: MATCH clause that is followed by a RETURN clause. Example 1: Find Person nodes in the graph that have a name of ‘Tom Hanks'. MATCH (tom:Person {name: 'Tom Hanks’}) RETURN tom Example 2: Find which ‘Movie’s Tom Hanks has directed. MATCH (:Person {name: 'Tom Hanks'})-[:DIRECTED]->(movie:Movie) RETURN movie https://neo4j.com/developer/cypher/querying/ https://gist.github.com/DaniSancas/1d5265fc159a95ff457b940fc5046887 Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 31 Example 3: In the following Cypher query, we use these clauses to find the mutual friends of a user whose name is Jim. MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c) RETURN b, c Alternatively: MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c) WHERE a.name = 'Jim' RETURN b, c Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 32 The RETURN clause Using this clause, we specify which nodes, properties, and relationships in the matched data must be returned to the user. Some other Cypher clauses • WHERE: used for filtering results that match a pattern. • CREATE and CREATE UNIQUE: used for creating nodes and relationships. • DELETE: to removes nodes, properties, and relationships. • FOREACH: to performs an update action on a list. • START: we use it to specify one or more explicit starting points (i.e., nodes or relationships) in the given graph. Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015) Basics of Data Science 33 Appendix: • Some definitions from graph theory (for self study) Basics of Data Science 34 A graph G is a network composed of nodes (also called vertices) and connections, the so-called edges (or arcs if directed). We may denote such a graph with G=(V,E), where V is the set of vertices (nodes) and E is the set of edges. v2 v1 Suppose that the graph G has n vertices and m edges. v3 Let V={v1, …, vn} is the set of vertices and E={e1, …, em} Each edge is defined by two nodes, for example: e1 = (v1, v2). v4 Two nodes vi and vj are adjacent if they are connected by an edge. Basics of Data Science 35 (*) Degree of a vertex is the number of edges passing from it. Example: degree of v3 is 3. (*) A path is a sequence of edges that connect a sequence of (adjacent) vertices. Example: v2, v1 , v3, v4 v2 v3 v1 Basics of Data Science v4 36 (*) A cycle is a sequence of vertices starting and ending at a same vertex. The length of a cycle is the number of edges in the cycle. Example: {v1, v2 , v3 , v1} define a cycle of length 3. v2 v1 v3 Basics of Data Science v4 37 Some definitions: (*) Connected graph: There is a path between any two vertices. (*) Disconnected graph: There are at least 2 vertices such that there is no path for connecting them. Basics of Data Science 38 (*) Tree: Any connected graph that has no cycle. Example: (*) Forest: a set of trees. Example: Basics of Data Science 39 (*) Complete graph (or a Clique): All vertices are adjacent to each other. Example: (*) Planar graph: The edges do not cross (i.e., do not overlap). Example: Basics of Data Science 40 (*) Directed Graphs: (*) Weighted graph: is a graph whose vertices or edges have been assigned weights. v2 20 15 v1 v4 v3 5 Basics of Data Science 50 41 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 1, Section 4 → Information Retrieval: Document Mining and Querying of ill-structured Data Part 1: Organizing the “Data Lake” (from data mining to data fishing) • Relational Database Models: Modeling and Querying Structured Attributes of Objects • Graph- and Network-based Data Models: Modeling and Querying Structured Relations of Objects • Information Retrieval: Document Mining and Querying of ill-structured Data • Streaming Data and High Frequency Distributed Sensor Data • The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate evolving heterogeneous data lakes Basics of Data Science 2 Summary: In this section, we will see: • A short introduction • What is a “document database”? • Why do we need “document database”? • How can we use a “document database”? Basics of Data Science 3 Information Retrieval (IR): Information retrieval might be defined as follows: • “Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).” • The term “unstructured data” means the data that does not have a structure which is clear, semantically apparent, and easy-to-understand for a computer. This is unlike what we find in relational databases. • The information retrieval also means supporting users in processing, browsing, filtering, or clustering collections of (retrieved) documents collections. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval. Cambridge University Press (2009) Basics of Data Science 4 Examples: • In a web search, an IR system should find something (text, …) out of billions of documents that are stored on millions/billions of servers and computers. • Personal information retrieval: Email programs contain not only search features but also text classification, e.g., spam filter to divert junk e-mails to specific folder(s). Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval. Cambridge University Press (2009) Basics of Data Science 5 Document Databases Basics of Data Science 6 Document Databases (DD) or Document Stores • The focus of the document databases is on storage and access methods that are optimized for documents, which is different from rows or records that are common in a relational database management system (RDBMS). • Document databases are used to store documents, where each document represents a single object. • DDs Support files in different formats, and in DDs, a variety of operations can be performed on documents, e.g., reading, classifying, putting in a collection, …. • No need to have a final structure in advance: the organization of a DD comes completely from the individual documents that are stored in the DD. Adam Wilbert. Relational Databases: Essential Training. (2019) Basics of Data Science 7 Example: Document Databases Basics of Data Science 8 Document Databases Solutions • There are several platforms for document databases, e.g., MongoDB, MarkLogic, CouchDB (Apache), etc. • MongoDB (https://www.mongodb.com/) is probably the most popular document database system. MongoDB integrates extremely well with Python (PyMongo). • MongoDB is a schema-free, document-oriented database that uses a collection-oriented storage, which are analogous to tables in a relational database. Each collection contains documents (possibly nested), and a document is a set of fields, each one being a key-value pair. Olivier Curé and Guillaume Blin (editors). RDF Database Systems Triples Storage and SPARQL Query Processing (Chapter 2). Elsevier (2015) Basics of Data Science 9 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 1, Section 5 → Streaming Data and High Frequency Distributed Sensor Data Part 1: Organizing the “Data Lake” (from data mining to data fishing) • Relational Database Models: Modeling and Querying Structured Attributes of Objects • Graph- and Network-based Data Models: Modeling and Querying Structured Relations of Objects • Information Retrieval: Document Mining and Querying of ill-structured Data • Streaming Data and High Frequency Distributed Sensor Data • The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate evolving heterogeneous data lakes Basics of Data Science 2 Summary: In this section, we will see: • A short introduction • What is a “streaming data”? • Challenges of “streaming data”? • Notes on query languages for streaming data. Basics of Data Science 3 • Big Data: we might define big data as the case when the dataset is so large that we cannot manage it if we don’t use nonconventional technologies or algorithms to extract information and knowledge. • Big data could be characterized the three “V”s of big data management, i.e., Volume (more and more data), Variety (different types of data), and Velocity (arriving continuously). • According to Gartner, the three-V concept is summarized as follows: “high volume, velocity and variety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making.” Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Doug Laney. 3-D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, 2001. https://www.gartner.com/ Basics of Data Science 4 There are some other Vs that have been added: • Variability: The structure of the data changes over time. • Value: we consider data as a valuable object only it helps un in making better decisions. • Validity and Veracity: It is important to notice that some or some of the data might not be fully reliable, and it is an important task to manage and control this uncertainty. Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Basics of Data Science 5 Using big data technologies has the objective of improving service and life quality, for example: • Business: big data can be used to improve service quality by customer personalization and detecting churns. • Technology: using big data technologies, we can reduce processing time of data from days and hours to just some seconds. • Health: by mining medical information and records of people, we can monitor health conditions. • Smart cities: collecting huge volume of data and processing them effectively permits us to ensure sustainable economic development, better use of natural resources, and a higher life quality. Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Basics of Data Science 6 Real-Time Analytics (on demand vs. continuous) Real-time analytics is a particular case of the big data. According to Gartner, “real-time analytics is the discipline that applies logic and mathematics to data to provide insights for making better decisions quickly”. Data Streams: As an algorithmic abstraction in real-time analytics, data streams are defined as a sequence of items or elements, which are possibly infinite, such that each item has a timestamp and a temporal order. In such a sequence, the items arrive one-by-one, and our objective consists in developing algorithms which do predictions or detect pattern in real time. https://www.gartner.com/en/information-technology/glossary/real-time-analytics Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Basics of Data Science 7 Time, Memory, and Accuracy: In stream mining process, we are interested in algorithms that require short computation time, low volume of memory, but with the highest accuracy. Applications: Streaming data take place in many contexts, for example: • Sensor data and the Internet of Things: we find sensors almost everywhere, in industry and in our cities. • Telecommunication data: with billions of phones in the world, telecommunication companies collect a huge amount of phone call data. Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Basics of Data Science 8 Applications (cntd.): • Social media: by using social networks, e.g., facebook, twitter, Instagram, and LinkedIn, we continuously produce data for the corresponding companies. • Marketing and e-commerce: online businesses collect a huge amount of data in real time, which can be used for different purposes, e.g., fraud detection. • Epidemics and disasters.: data streaming can be used in detecting epidemics and natural disasters. • Electricity demand prediction: energy providers want to know in advance the quantity of demand. Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Basics of Data Science 9 • Example: An electrical grid, where each dot represents a sensor. Joao Gama. Knowledge Discovery from Data Streams. Chapman and Hall CRC (2010) Basics of Data Science 10 • Hardware technology: possible to collect and store data continuously. • Sources of data generation continuously: surfing on the internet, using a phone or credit card, etc. • Challenges: o Due to high-speed nature of data streams, data processing can be done in only one pass. o Temporal locality: in fact, stream data may evolve over time. o Unbounded Memory Requirements: huge (illimited) of volume of generated data streams. Charu C. Aggarwal. Data Streams Models and Algorithms. Springer (2007) Basics of Data Science 11 • Tradeoff between Accuracy and Efficiency: The algorithm must ensure a tradeoff between the accuracy of the result and the computation time and the required space (memory). • The data are not independent and identically distributed. • Visualization: It is a big challenge to present effectively numerical results and information obtained from a huge amount of data. • Hidden big data: A large amount of potentially useful data are not used for many reasons. Charu C. Aggarwal. Data Streams_Models and Algorithms. Springer (2007) Moharned Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. A Survey of Classification Methods in Data Streams. In: Data Streams Models and Algorithms. Springer (2007) Joao Gama. Knowledge Discovery from Data Streams. Chapman and Hall CRC (2010) Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Basics of Data Science 12 • Data stream: The term data stream is used for a countably infinite sequence of items/elements with the objective of representing data elements that become available progressively over time. • A stream-based application: it analyzes elements that become available from streams to produce instantly new results with the objective of providing fast reactions, if it is required. • Types of data streams models: o Structured: data elements exhibit a certain schema or format. o Unstructured: may contain arbitrary formats and contents. Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 13 • Types of structured streams : I. The turnstile model: this is the most general model, where a vector of elements is used to model the stream. Each element of the stream is an update to an element which is in the underlying vector whose size is the domain of the streaming elements. II. The cash register model: in this model, the stream elements can only be added to the underlying vector. III. The time series model: this model considers each stream element as a new and independent entry to the vector. Consequently, the underlying vector is constantly increasing, and in general, it can be unbounded. This model is frequently used in current stream processing engines. Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 14 • Stream processing: it consists in analyzing streaming data on-the-fly with the objective of producing updated results as soon as new data are received. • “Time” in stream processing: in many stream processing applications, time plays a central role, i.e., either we need to update the results by taking the recent data, or we want to detect the temporal trends. • “Windows” in stream processing: windows are used to define bounded segments of elements over an unbounded data stream. Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 15 • “Windows” are used to compute some statistical information, e.g., average of some input data. • The most common types of windows: o count-based windows: the size is defined in terms of the number of elements, o time-based windows: the size is defined in terms of a time frame. • In both types, we have: o sliding windows: the window progress continuously upon arrival of new data elements, o tumbling windows: they can collect multiple elements before moving. Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019) https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics Basics of Data Science 16 Examples: https://docs.microsoft.com/en-us/stream-analytics-query/sliding-window-azure-stream-analytics https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics Nicoló Rivetti. Introduction to Stream Processing Algorithms. In: Encyc. Big Data Tech., Springer (2019) Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6):1794–1813 (2002) Basics of Data Science 17 Monitoring massive data streams: There are two main approaches: • Sampling: In the sampling approaches, all elements are read once, but only a subset of them are kept for further process. There are several methods for selecting the samples, which are expected to be representative. In a competitive market, it might be necessary to keep the sampling policy secret; otherwise, an adversary can get benefit. • Summaries: A summary approach scans each piece of stream input data onthe-fly and keeps locally compact sketches or synopses that contain the most representative and important information. Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT Press (2018) Nicoló Rivetti. Introduction to Stream Processing Algorithms. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 18 Query language for processing streaming data: • An essential difference between a stream versus a conventional query language: stream queries continue to produce answers as new elements arrive; hence, queries are stored and evolve over time. • Most stream query languages try to extend SQL. o Academic languages: CQL, SQuAl, ESL, etc. o Commercial languages: StreamSQL, CCL, EQL, StreaQuel, etc. • Difference between stream query languages is mainly on their approach in addressing requirements of stream processing: language closure, windowing, correlation, and pattern matching. Mitch Cherniack and Stan Zdonik. Stream-Oriented Query Languages and Operators. In: Encyclopedia of Database Systems. Springer (2009) Basics of Data Science 19 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 1, Section 6 → The Semantic Web: Ontologist's Dream (or nightmare?) of how to integrate evolving heterogeneous data lakes Part 1: Organizing the “Data Lake” (from data mining to data fishing) • Relational Database Models: Modeling and Querying Structured Attributes of Objects • Graph- and Network-based Data Models: Modeling and Querying Structured Relations of Objects • Information Retrieval: Document Mining and Querying of ill-structured Data • Streaming Data and High Frequency Distributed Sensor Data • The Semantic Web: Ontologist's Dream (or nightmare?) of how to integrate evolving heterogeneous data lakes Basics of Data Science 2 Summary: In this section, we will see: • What is “Data Lake”? • What is “Ontology”? • What is a “Semantic Web”? • What is “Data Integration”? • How to integrate evolving heterogeneous data sources? Basics of Data Science 3 Data lake Basics of Data Science 4 Definition: a data repository in which we store different datasets coming from multiple sources and in original structures is defined as a data lake. Data Lakes versus Data Warehouses: • A data warehouse is a database that is optimized to analyze relational data that come from transactional systems. Moreover, (i) the data is cleaned, (ii) the data structure and schema are defined in advance. • A data lake stores relational as well as non-relational data. In data lake, when data is captured, we are not aware of the structure of the data or schema. Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019) https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ Basics of Data Science 5 Tasks and features of a “data lake”: • Extracting data and metadata from multiple and heterogeneous source. • Ingest the extracted data into a storage system. • Transform, clean, and integrate data with other datasets. • Provide possibility to explore and to query the data and metadata. Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019) https://lakefs.io/data-lakes/ Basics of Data Science 6 Data lake architecture Four layers: • ingestion, • storage, • transformation, • interaction. Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 7 Ingestion Layer: This layer has the responsibility of importing data. More precisely, ingesting data and extracting metadata should be done automatically (as far as possible). Here, the data quality (DQ) control is used to ensure the quality of the ingested data. Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 8 Storage Layer: The main components are: • the metadata repository: it stores all the metadata of the data lake, i.e., no matter if they have been partially collected automatically or will be later added manually. • the raw data repositories: the data are ingested in original formats, and we need different storage systems for different data types, e.g., for relational, XML, graph, etc. There should be a data access interface for querying. Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 9 Transformation Layer: This layer transforms the raw data into a desired target structure. For this purpose, a data lake has a data transformation engine where data can be cleaned, transformed, and integrated in a scalable way. Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 10 Interaction Layer: The focus of this layer is on interactions between users and the data lake. The components data exploration and metadata manager are in close relationship to provide access and exploration of the data by the users. Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 11 Ontology Basics of Data Science 12 Terminology of Ontology: Originated from ancient Greek (from philosophy), the term ontology is composed of on (genitive ontos), which is indeed the verb “being”, and logia that refers to the term “science” or “study”. Hence, the term ontology might be interpreted as “the study of being”. In recent years, the computer scientists has adopted the term ontology. Ontology: Classically, in artificial intelligence (AI), ontologies were defined as a kind of “knowledge representation” and “knowledge models”. In computer science, an ontology can be defined as “a set of representational primitives with which to model a domain of knowledge or discourse”. In this definition, primitives refer to “concepts and relations” or “classes and attributes” (or properties or other things that define relations between elements/terms). Eva Blomqvist. Ontologies for Big Data. In: Encyc. Big Data Tech., Springer (2019) Gruber T. Ontology. In: Encyclopedia of database systems. Springer (2009) Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012) Basics of Data Science 13 An example: in the setting of a university, faculty members, staffs, students, courses, lecture rooms, and disciplines are some important classes of concepts for which we can define different relationships. In this context, ontologies may also contain information like: ● Properties, e.g., “A” teaches “B”, ● Restrictions, e.g., only professors can have PhD students, ● Statements: professor and other staffs are disjoint. In the web context, by using ontologies we create a shared understanding of a given domain. The most important ontology languages: ● Resource Description Framework (RFD): a vocabulary description language. ● Web Ontology Language (OWL): a richer vocabulary description language. Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012) Basics of Data Science 14 An example of an RDF graph: Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012) Basics of Data Science 15 Applications: The following list includes some of the tasks for which ontologies can be used: • Data integration: ontologies can act as a model that unifies representation and linking various datasets. • Data access: ontologies can act as vocabularies with the objective of understanding and querying datasets. • Data analysis, cleaning, and constraint checking: provides opportunities for performing analytical queries. • Integration with ML approaches: ontologies can be used as a structure of input and output features. Gruber T. Ontology. In: Encyclopedia of database systems. Springer (2009) Basics of Data Science 16 Semantic Web Basics of Data Science 17 Semantic Web: “to make the web more accessible to computers.” More precisely, in the current state, the computers use the values and keywords to search for information, then these are sent from servers to users. This is all the task that is done in the current web, and all intelligent works are done by human. The idea of semantic web (or web of data) consists in making the web richer for machines, i.e., the web becomes source of machine-readable and machineunderstandable data. https://www.merriam-webster.com/dictionary/semantic Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012) Basics of Data Science 18 Semantic Web follow the following design principles: 1. Creating standard format of structured and semi-structured data on the web, 2. Creating not only datasets, but also individual data-elements, and their relations and make all accessible on the web, 3. Making semantics of the data and make it understandable by machines. Semantic Web technology uses labeled graphs, Uniform Resource Identifiers (URI) to identify the data elements and their relations in the datasets, and ontologies to formally represent the semantics of the data. In this context, RDF and OWL are used as “knowledge representation” languages. Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012) Basics of Data Science 19 Querying the Semantic Web: SPARQL: similar to SQL, however specifically designed for RDF, is a query language to select, extract, … from a knowledge expressed in RDF. Triplestore or RDF store: a software to execute SPARQL queries. A sample SPARQL code: Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012) Basics of Data Science 20 An example: Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012) Basics of Data Science 21 Definition: Information integration, or data integration, consists in posing a single query that involves several data sources with the aim of receiving a single answer. Integration-Oriented Ontology: Using integration-oriented ontology, we want to conceptualize a domain of interest to automate data integration from evolving heterogenous sources of data, and this is done using Semantic Web technologies. In this way, we make a connection between domain concepts and the underlying data sources. Then, ontology-mediated queries of a user (a data analyst) are automatically translated to corresponding query language of the available sources. Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 22 To implement data-integration settings, we can use Semantic Web technologies. Thanks to flexibility and simplicity of ontologies, these are used to define unified interface for heterogeneous environments. In fact, ontologies are structured into two levels • TBox: represents terminology, • ABox: represents assertions. In this context, Resource Description Framework (RDF) can be used to represent the knowledge for an automated processing. Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 23 Example: a query execution in the Semantic Web. Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 24 Example: a query execution in integration-oriented ontologies. Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019) Basics of Data Science 25 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 2, Section 1 → From linear to non-linear regression models Part 2: Stochastic Models on structured attribute data: • From Linear to Non-Linear Regression models • Support Vector Machines • Deep? Neural Network Models • Learning from Data Streams: Training and Updating Deterministic and Stochastic Models • Reinforcement Learning Basics of Data Science 2 Summary: In this chapter, we will see: • What is “Machine Learning”? • Some Basic Concepts • Regression Analysis • Linear Regression • Validity of Linear Regression Model • Logistic Regression Basics of Data Science 3 What you should know for this part of the course: • Some statistics • Python • Data collection and data cleaning Basics of Data Science 4 Machine Learning: • Machine learning, or ML, is a field that is devoted to understanding and building methods that learn automatically from data with the objective of improving performance on some set of tasks. • Different ML methods: supervised, unsupervised, semi-supervised, and reinforcement learning methods. • Example of supervised machine learning: teacher and student. Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Basics of Data Science 5 Supervised learning problems: • Regression: continuous variables • Classification: categorical variables House # Location Building factor year Surface 1 0.9 2000 50 2 0.8 1995 120 3 1.2 1980 80 Basics of Data Science 6 General steps of creating a ML model: • Cleaning the available data. • Splitting data: o Testing dataset: for testing the performance of the model. o Training dataset: for training the model. o Validation dataset: for adjusting the model. Tools: • Python and python packages (NumPy, pandas, Matplotlib, and scikit-learn) Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Basics of Data Science 7 Some notes on supervised learning: • Supervised machine learning models need our help! o Clean data o Training o Testing • Evaluation metrics: accuracy and precision. o Measured in terms of percentage. • Challenges: Preparing and cleaning data. • Advantages: working with labeled data, low complexity, and easier interpretation. Basics of Data Science 8 Graphical Representation Scatter Plot • A scatter plot (Chambers 1983) reveals relationships or association between two variables. A scatter plot usually consists of a large body of data. • The relationship between two variables is called correlation. • The closer the data points come when plotted to making a straight line, the higher the correlation between the two variables, or the stronger the relationship. • If the data points make a straight line going from the origin out to high xand y-values, then the variables are said to have a positive correlation. • If the line goes from a high-value on the y-axis down to a high-value on the x-axis, the variables have a negative correlation. Basics of Data Science 9 Graphical Representation Scatter plots are especially useful when there is a large number of data points. They provide the following information about the relationship between two variables: • Strength • Shape: linear, curved, etc. • Direction - positive or negative • Presence of outliers A correlation between the variables results in the clustering of data points along a line. Basics of Data Science 10 Graphical Representation The following is an example of a scatter plot suggestive of a positive linear relationship. Basics of Data Science 11 Basic Statistics Correlation Coefficient (r) It is a coefficient that indicates the strength of the association between any two metric variables. • The sign (+ or -) indicates the direction of the relationship. • The value can range from “+1” to “-1”, with: o “+1” indicating a perfect positive relationship, o “0” indicating no relationship, o and “-1” indicating a perfect negative or reverse relationship (as one variable grows larger, the other variable grows smaller) Basics of Data Science 12 Regression Analysis Basics of Data Science 13 Regression Analysis In regression, we want to do some predictions. • How is sales volume affected by the weather? • How does oil price affect bread price? • How does oil price affect inflation? • How does the amount of a drug absorbed by the patient’s body affect the blood pressure? Common point: ask for a response (dependent variable) which can be written as a combination of one or more predictors (independent variables). In regression, we build a model to predict the response (dependent variable) from the independent variables. Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Basics of Data Science 14 Linear Regression In regression, our objective consists in building a model to describe the relation between the response (dependent variable) 𝑦 ∈ 𝑅 and a combination of one or more (independent) variables 𝑥𝑖 ∈ 𝑅 𝑛 . In linear regression model, we describe the response 𝑦 as a linear combination of 𝑚 variables 𝑥𝑖 : 𝑦 = 𝛽1 𝑥1 + … + 𝛽𝑚 𝑥𝑚 Based on the number of predictors, we have two types of linear regression: 1. Simple Linear Regression: one response and one predictor. 2. Multiple Linear Regression: one response and two or more predictors. Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022) Basics of Data Science 15 Linear Regression Simple linear regression: Assume that 𝑛 samples 𝑥1 , 𝑦1 , . . . , 𝑥𝑛 , 𝑦𝑛 are given, then the regression line is defined as follows: 𝑦 = 𝛽0 + 𝛽1 𝑥 The parameter 𝛽0 : intercept or the constant term. The parameter 𝛽1 : the slope. (𝑥𝑖 , 𝑦𝑖 ): the observations, 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑒𝑖 𝑒𝑖 defines the error term. Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022) Basics of Data Science 16 Linear Regression Example: Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Basics of Data Science 17 Linear Regression The Ordinary Least Squares (OLS): an approach to find the values for the “𝛽” parameters through minimizing the squared distance of the predicted values from the actual values: 𝑛 ||𝛽0 + 𝛽1 𝑥 − 𝑦||22 = (𝛽0 + 𝛽1 𝑥𝑖 − 𝑦𝑖 )2 𝑖=1 The Residual Sum of Squares (RSS) of prediction, is quadratic convex function 0 , 𝛽 1 , where and has a unique global minimum at 𝑤 ෝ= 𝛽 σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦 ҧ 𝑖 − 𝑦) ത 𝛽1 = σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 0 = 𝑦ത − 𝛽 1 𝑥ҧ 𝛽 Where 𝑥ҧ and 𝑦ത are the sample means. Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Basics of Data Science 18 Validity of Linear Regression Model 1. Linearity: There should be a linear relationship between the response variable and the predictor(s). How to check: • Scatterplot. • Covariance Analysis (COV). • Correlation Analysis (e.g., Pearson’s Correlation Coefficient). • etc. Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022) https://www.alpharithms.com/simple-linear-regression-modeling-502111/ Basics of Data Science 19 Validity of Linear Regression Model 2. Normality: The error terms (residuals) should follow a normal distribution. How to check: • Sometimes we can ignore it! • Visual check can be done by Quantile-Quantile (Q-Q) plots. • Other tests: Omnibus test (in Python), Shapiro-Wilk, Kolmogorov-Smirnov, etc. Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022) https://www.alpharithms.com/simple-linear-regression-modeling-502111/ Basics of Data Science 20 Validity of Linear Regression Model 3. Independence: There should be no correlation among the error terms. In fact, if there is correlation among the error terms, then this is called "autocorrelation". How to check: • Durbin-Watson test: available in the package statsmodels of Python • Breusch-Godfrey test. • Ljung-Box test. Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022) https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html Basics of Data Science 21 Validity of Linear Regression Model 4. No Multicollinearity: The predictors should be independent of each other. In absence of this independence, then there will be the multicollinearity Issue. How to check: - Sensitivity check of the regression coefficients - Farrar-Gluaber Test - Condition Number Test - etc. Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022) https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html Basics of Data Science 22 Validity of Linear Regression Model 5. Homoscedasticity (constant variance): When the variance of the error terms (residuals) appears constant over a range of predictor variables, the data are said to be homoscedastic. To have a valid linear regression model, we should have “homoscedasticity”, i.e., all error terms (residuals) should have the same variance. How to check: - Scatterplot. - Breusch-Pagan test (exists in the statsmodels of Python). - Levene’s test. - Park test, etc. Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022) https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html Basics of Data Science 23 Validity of Linear Regression Model Example: Heteroscedasticity. https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html Basics of Data Science 24 Nonlinear Regression? Basics of Data Science 25 Relative performance as a function of layer size, resource dimensions and evaluations 50.000 evaluations Basics of Data Science 500.000 evaluations 26 Concentration in network effect markets for low price products Basics of Data Science 27 Concentration in network effect markets for high price products Basics of Data Science 28 Market concentration as a joint function of network centrality and closeness Basics of Data Science 29 Does the surplus for the monopolists depend on the closeness of the network structure? Basics of Data Science 30 Does anticipation in agent‘s decision making lead to significantly different concentration and overall welfare? Basics of Data Science 31 How to deal with integer covariates? Basics of Data Science 32 Fitness depending on Population Size and Sampling Rate Basics of Data Science 33 Logistic Regression Basics of Data Science 34 Logistic Regression • Logistic regression is a classification approach. • Logistic regression is mainly used for qualitative (categorical) response variable. In fact, for qualitative (categorical) response variables, linear regression is not a reliable option because: o (a) a linear regression method cannot handle qualitative response variables with more than two classes; o (b) a linear regression method does not provide meaningful estimates even if there are only two classes. • Logistic regression is suitable for the binary qualitative response values. G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021) Basics of Data Science 35 Logistic Regression • Given: set of training observations 𝑥1 , 𝑦1 , . . . , 𝑥𝑛 , 𝑦𝑛 • We use the given data to build a classifier. • The question: how should we model the relationship between predictor 𝑥 and 𝑝 𝑥 = 𝑝𝑟𝑜𝑏 𝑦 = 1 𝑥) • In linear regression, we used 𝑝 𝑥 ≔ 𝑦 = 𝛽0 + 𝛽1 𝑥 • However, this does not fit well to the case of 0/1 classification. G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021) Basics of Data Science 36 Logistic Regression Linear versus Logistic Regression G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021) Basics of Data Science 37 Logistic Regression To overcome this issue, we need formulate 𝑝 𝑥 by using a function that gives us values between 0 and 1 no matter the value of the variable 𝑥. There are many functions that does this job for us; however, in logistic regression, the following logistic function, which is nonlinear, is used: 𝑒 𝛽0+𝛽1𝑥 𝑝 𝑥 = 1 + 𝑒𝛽0+𝛽1𝑥 In this function, the value of 𝑝 𝑥 never drops below 0 and goes never over 1. This function always produces an S-shaped curve between 0 and 1. G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021) Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017) Basics of Data Science 38 Logistic Regression “Odds”: After rearranging terms in the logistic function, we obtain: 𝑝 𝑥 = 𝑒 𝛽0+𝛽1𝑥 1 −𝑝 𝑥 Here, the quantity 𝑝 𝑥 1 −𝑝 𝑥 is called the “odds”. The odds can take on any value between 0 and ∞. Indeed, values of the odds close to 0 and ∞ indicate very low and very high probabilities of response = 1, respectively. G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021) Basics of Data Science 39 Logistic Regression “Log Odds” or “Logit”: The left-hand side of the following equation is called the “log odds” or “logit”. : log 𝑝 𝑥 1 −𝑝 𝑥 = 𝛽0 + 𝛽1 𝑥 • In linear regression, 𝛽1 gives the average change in 𝑦 if we increase the value of 𝑥 by 1 unit. • By increasing the value of 𝑥 by 1 unit, the value of logit is changed by 𝛽1 . • Such a relation does not hold for 𝑝(𝑥). But anyway, if is 𝛽1 positive (negative), then increasing 𝑥 will increase (decrease) 𝑝(𝑥) too. G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021) Basics of Data Science 40 Logistic Regression • The 𝛽 parameters in the logistic regression model should be estimated. • There are several approaches to achieve this objective. • The preferred approach is the method of maximum likelihood. • The mathematical formulation of a likelihood function: 𝑙(𝛽0 , 𝛽1 ) = ෑ 𝑝(𝑥𝑖 ) ෑ (1 − 𝑝(𝑥𝑖 ′ )) 𝑖:𝑦𝑖 =1 𝑖 ′ :𝑦𝑖′ =0 0 and 𝛽 1 that maximize this likelihood function. • Objective: estimating 𝛽 • In Python: Logistic Regression with scikit-learn, and its LogisticRegression class. G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021) Basics of Data Science 41 Prof. Dr. Oliver Wendt M.Sc. Manuel Hermes Business Information Systems & Operations Research • Basics of Data Science • Summer Semester 2022 Part 2, Section 2 → Deep? Neural Network Models Part 2: Stochastic Models on structured attribute data: • From Linear to Non-Linear Regression models • Deep? Neural Network Models • Support Vector Machines • Learning from Data Streams: Training and Updating Deterministic and Stochastic Models • Reinforcement Learning Basics of Data Science – ANN 2 Artificial Neural Networks (I) Basics of Data Science – ANN 1 3 Application Areas (I) • Text recognition, OCR – Handwritten/scanned text http://www.plate-recognition.info Basics of Data Science – ANN 1 http://geeknizer.com 4 Application Areas (II) • Facial recognition system / Deep fakes (GAN) http://www.giga.de Basics of Data Science – ANN 1 5 Application Areas (III) • • • • • Early warning systems (tornados) Time-Series-Analysis (weather, shares) Virtual agents, AI in games and simulations Medical diagnostics Autonomous vehicles (ATO / driverless cars) Basics of Data Science – ANN 1 6 Why ANN? (I) • Many problems can‘t be solved by using explicit knowledge (i.e. knowledge that can be stored symbolically via characters (language, writing)) • Implicit knowledge is needed (= being able to do sth. without knowing how to do it) -> Transfer learning • Examples: – – Facial Recognition Systems (many pixels, few feasible solutions) Autonomous Vehicles Basics of Data Science – ANN 1 7 Why ANN? (II) • Example: Time-Series-Analysis – Advantages: • Non-linear relations are easier to represent with non-linear activation functions • Flexible, as they don‘t need information about probability distributions or formal model specifications • No assumptions necessary for building the forecast model • Many parameters are given by the application area and the available data • ANN are quite robust against noise • Training during forecast is possible, adaptions according to changing relations can be made (Continuous Learning, CL) Basics of Data Science – ANN 1 8 Why ANN? (III) • Example: Time-Series-Analysis – Disadvantages: • Learning process is time-consuming • Knowledge only by learning (i.e. known relations can‘t be implemented in advance -> Transfer learning) • Not all parameters are fixed in advance and have to be specified -> frequently, there are no satisfying heuristics -> time-consuming Basics of Data Science – ANN 1 9 The biological role model The human brain consists of: Ca. 1011 neurons and cell body nucleus synaptical node Ca. 1013 connections Dendrites Direct the incoming signals to the cell nucleus Cell nucleus Processes the incoming signals axon dendrite nucleolus Axon Forwards the output signals to other cells Synapses Connection between the axon of one neuron and the dendrite of another one Basics of Data Science – ANN 1 10 The Artificial Neuron Input signals X1 X2 Xn “activate” Transfer funktion Local memory y Output signal Copies of output signal Basics of Data Science – ANN 1 11 Artificial Neural Network - Basics • An artificial neural network consists of several neurons/units and connections • Neurons/Units: – – – – Input unit Hidden unit Bias unit Output unit • Connections: – Directed – Weighted • Several units of the same type form a layer Basics of Data Science – ANN 1 12 Input and propagation function • ai = activity level of the sending unit • wij = weight of the connection between neuron i and j • Input of unit j: inputji = aiwij • Net input of unit j: netinputj = Σi inputji = Σi aiwij (propagation function) http://www.neuronalesnetz.de Basics of Data Science – ANN 1 13 Activation Function, Output and Bias • Activation function: relationship between net input and activity level of a neuron • Activity level is transformed into output by an output function (often the identity function) Basics of Data Science – ANN 1 14 Different Activation Functions • Linear function • Threshold function (Rectified linear Unit) • Binary function • Sigmoid function Basics of Data Science – ANN 1 15 Sigmoid Activation Function activity level 1 f ( x) = −x 1+ e net input Basics of Data Science – ANN 1 16 Rectified Linear Unit (ReLU) Activation Function Pauly et at, 2017 Basics of Data Science – ANN 1 17 Bias Units • Bias units: – – – – Have an activity level of +1 Weight to another unit positive or negative Positive weight: unit stays active (high activity desired) Negative weight: units stays inactive (barrier) http://www.neuronalesnetz.de Basics of Data Science – ANN 1 18 Classification of ANN • Different classifications possible, e.g. by… – Number of layers / hidden layers – Activation function (e.g. ReLU, binary, sigmoid) – Learning paradigm ((un-)supervised, reinforcement learning, stochastic) – Feedforward, Feedback (Recurrent) Basics of Data Science – ANN 1 19 Classification of ANN Neural Networks Supervised Learning Feedforward Perceptron Multi-LayerPerceptron GANs (Generative Adversarial Networks) Radial Basis Function Unsupervised Learning Feedback Feedforward Feedback ARTMAP (Predictive ART) Kohonen Maps Adaptive Resonance Theory (ART) Reinforcement learning LTSM (Long shortterm memory) GRU (Gated Recurrent Unit) Basics of Data Science – ANN 1 20 Learning - Training set vs. Test set • Training set: – Input vector (where desired output or response is known) – Really used for training the system – Weights are adjusted according to the result • Test set: – Input vector (where desired output or response is known) – Verification of the learning effects – No adjustment of weights • Relation training set vs. test set about 70/30 • Also important: order of the patterns presented Basics of Data Science – ANN 1 21 Date Example - OCR A not A not A A Basics of Data Science – ANN 1 not A 22 Learning Paradigms • Unsupervised learning • Supervised learning • Reinforcement learning Basics of Data Science – ANN 1 23 Unsupervised and Supervised Learning • Supervised learning – For a training set of data input vectors and correct output vectors are known – Search for optimal network weights minimizing an error measure (e.g. mean squared error) on the training set – Hopefully generalizing to minimizing the error in the application phase • Unsupervised Learning – Correct output Vectors are not known – Goal: Finding patterns in the input data – Application field: Similar to Linear Models: Interdependence Analysis Basics of Data Science – ANN 1 24 Reinforcement Learning • • • • No labelled data is needed -> no output vector is known Optimizes a cumulative reward High degree of generality High potential for decision problems from various disciplines like Control theory, Operations Research, Multi Agent Systems, … • Can learn complex strategies (but a problem specific parameter set is still needed) • Hopefully generalizing to minimizing the error in the application phase Basics of Data Science – ANN 1 25 Network Topology • Feedforward Network • Feedback Network Basics of Data Science – ANN 1 26 Network Topology Feedforward Network Input Output Input layer Hidden layer Output layer Basics of Data Science – ANN 1 27 Network Topology Feedback Network (Recurrent Network) Input Output Input layer Hidden layer Output layer Basics of Data Science – ANN 1 28 Network Topology Feedback Network (Recurrent Network) • Feedback networks contain recurrent arcs in the same or previous layer • Examples: – Adaptive Resonance Theory (ART) – ARTMAP (Predictive ART) – GRU (Gated Recurrent Unit) Basics of Data Science – ANN 1 29 Output Vector designs • Preferred: One-hot coding - Used for Categorical data, Example: OCR 1 Output Neuron for each distinct output Desired Output: 1 active neuron, all others inactive Advantage: Evaluation of Output quality • Other examples: – Grey code – Categorical encoders (NLP) – Embeddings (conversion to N-dimensional Vectors) Basics of Data Science – ANN 1 30 Artificial Neural Networks (II) Basics of Data Science – ANN 2 31 Classification - Perceptron Neural Networks Supervised Learning Feedforward Perceptron Multi-LayerPerceptron GANs (Generative Adversarial Networks) Radial Basis Function Unsupervised Learning Feedback Feedforward Feedback ARTMAP (Predictive ART) Kohonen Maps Adaptive Resonance Theory (ART) LTSM (Long shortterm memory) GRU (Gated Recurrent Unit) Basics of Data Science – ANN 2 32 The (Single-Layer) Perceptron Origin: McCulloch und Pitts 1943 n f wi xi i =1 X: Vector of inputs x1-xn (Dendrites) w: Vector of weights y: Output vector f (.): Activation function Basics of Data Science – ANN 2 33 Single-Layer Perceptron (SLP) • • • • • Earliest kind of neural network Simple associative memory Binary activation function Only capable of learning linearly separable patterns Used for simple Classification problems Basics of Data Science – ANN 2 34 XOR-Problem (lack of) Linear Separability Source: Prof. S. Krüger Basics of Data Science – ANN 2 35 XOR-Problem (lack of) Linear Separability Source: Prof. S. Krüger Basics of Data Science – ANN 2 36 XOR-Problem Solution: Multi-Layer Perceptron (MLP) Source: Prof. S. Krüger Basics of Data Science – ANN 2 37 Neural Networks Supervised Learning Feedforward Perceptron Multi-LayerPerceptron GANs (Generative Adversarial Networks) Radial Basis Function Unsupervised Learning Feedback Feedforward Feedback ARTMAP (Predictive ART) Kohonen Maps Adaptive Resonance Theory (ART) LTSM (Long shortterm memory) GRU (Gated Recurrent Unit) Basics of Data Science – ANN 2 38 The Perceptron Origin: McCulloch und Pitts 1943 n f wi xi i =1 X: Vector of inputs x1-xn (Dendrites) w: Vector of weights y: Output vector f (.): Activation function Basics of Data Science – ANN 2 39 Multi-Layer Perceptron (MLP) • One of the most popular neural network models • Activation function was historically mostly sigmoidal function, nowadays ReLU is commonly used due to computational efficiency • Important proofs (for sigmoid activation fn only!): – Two-layer perceptron can approximate any nonlinear function – Three-layer perceptron sufficient to separate any (convex or non-convex) polyhedral decision region Basics of Data Science – ANN 2 40 Multi-Layer Perceptron (MLP) • Consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one (one input and output layer, one or more hidden layers) • Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function • MLP utilizes a supervised learning technique called backpropagation for training the network Basics of Data Science – ANN 2 41 Backpropagation • Formulated in 1974 by Paul Werbos, widely used since published by David Rumelhart, Geoffrey Hinton and Ronald Williams in 1986 • Generalization of the Delta-Rule in MPLs • Most well-known learning procedure • Historically, an „external teacher“ is required knowing the correct output value. RL Algorithms work with the reward signal only. • Special case of a gradient procedure based on the mean squared error Basics of Data Science – ANN 2 42 A side note to… Gradient procedures: • Also known as steepest descent method – Beginning: approximated value – Go in direction of negative gradient (indicates direction of the steepest descent form the approximated value) – Stop when there is no numerical improvement • Convergence is often very slow Basics of Data Science – ANN 2 43 Backpropagation Prone to the following problems Very slow on flat parts of parameter space Oscillating between steep walls of (multimodal) error function Basics of Data Science – ANN 2 44 A side note to… Squared error: • Squared difference between estimator and true values E = Error n = Number of patterns ti = Target value oi = Output Basics of Data Science – ANN 2 45 Backpropagation – Algorithm • Input pattern is propagated forward through the network • Output is compared with target. Difference is the error of the network • Error is back-propagated from the output to the inputlayer • Weights are changed according to their influence on the error -> if the same input is used again, the output approximates to the target Basics of Data Science – ANN 2 46 Multi-Layer Perceptrons Special Forms Neural Networks Supervised Learning Feedforward Perceptron Multi-LayerPerceptron Radial Basis Function Unsupervised Learning Feedback Feedforward Feedback ARTMAP (Predictive ART) Kohonen Maps Adaptive Resonance Theory (ART) Basics of Data Science – ANN 2 47 Special Forms of MLPs Autoencoders Source: https://towardsdatascience.com/generating-images-with-autoencoders-77fd3a8dd368 Basics of Data Science – ANN 2 48 Special Forms of MLPs Generative Adversarial Networks (GANs) Source: https://www.freecodecamp.org/news/an-intuitive-introduction-to-generative-adversarial-networks-gans-7a2264a81394/ Basics of Data Science – ANN 2 49 Artificial Neural Networks (III) Basics of Data Science – ANN 3 50 Neural Networks Supervised Learning Feedforward Perceptron Multi-LayerPerceptron GANs (Generative Adversarial Networks) Radial Basis Function Unsupervised Learning Feedback Feedforward Feedback ARTMAP (Predictive ART) Kohonen Maps Adaptive Resonance Theory (ART) LTSM (Long shortterm memory) GRU (Gated Recurrent Unit) Basics of Data Science – ANN 3 51 Radial Basis Functions (RBF) • Real-valued function • Value is only dependent on the distance from some point c (center, can be the origin) 𝜙 𝑥, 𝑐 = 𝜙( 𝑥 − 𝑐 ) • Norm is usually the Euclidean distance; others are possible, too • Suitable for classification Problems • Additionally, RBF can be used to approximate functions or to solve partial differential equations Basics of Data Science – ANN 3 52 Radial basis function (RBF) Networks • Similar to perceptrons, but with exactly 3 layers (input, hidden, output) • Feedforward, fully connected, no short cuts Basics of Data Science – ANN 3 53 Radial basis function (RBF) Networks • Input neurons: – Only direct the input without any weights to the next layer • Output neurons: – Activation function: identity function – Propagation function: weighted sum • Hidden neurons (=RBF-neurons): – Propagation function: norm (distance between net input and center of the neuron, i.e., difference between input vector and center vector) – Activation function: radial basis function Basics of Data Science – ANN 3 54 Radial basis function (RBF) Networks Basics of Data Science – ANN 3 55 Radial basis function (RBF) Networks • Learning and training via adjustment of – Centers of RBF-neurons – Widths of Gaussian function – Weights of connections between RBF and output layer Basics of Data Science – ANN 3 56 RBF NN is More Suitable for Probabilistic Pattern Classification Hyperplane MLP Kernel function Basics of Data Science – ANN 3 RBF 57 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 2, Section 3 → Support Vector Machines Part 2: Stochastic Models on structured attribute data: • From Linear to Non-Linear Regression models • Deep? Neural Network Models • Support Vector Machines • Learning from Data Streams: Training and Updating Deterministic and Stochastic Models • Reinforcement Learning Basics of Data Science – Support Vector Machines 2 Support Vector Machines Basics of Data Science – Support Vector Machines 3 SVM in a nutshell (1) • Beginning: Training set of objects (vectors) with known classification, represented in a vector space Basics of Data Science – Support Vector Machines 4 SVM in a nutshell (2) • Task: Find hyperplane separating objects into two classes Basics of Data Science – Support Vector Machines 5 SVM in a nutshell (3) • Important: Maximize distance from vectors nearest to hyperplane (i.e. the margin) (needed for better classification, in case test objects don‘t match training objects exactly) Max. Basics of Data Science – Support Vector Machines 6 SVM in a nutshell (4) • Not all training vectors need to be considered (too far away from hyperplane, „hidden“ by other vectors) • Hyperplane only depends on nearest vectors (called support vectors) Basics of Data Science – Support Vector Machines 7 SVM in a nutshell (5) • Linear separability: Hyperplanes cannot be bent, so objects need to be linearly separable Basics of Data Science – Support Vector Machines 8 SVM in a nutshell (6) • Most of real data is not linearly separable Basics of Data Science – Support Vector Machines 9 SVM in a nutshell (7) • Solution approach: Basics of Data Science – Support Vector Machines 10 SVM in a nutshell (8) • Transfer vector space (incl. all training vectors) into higher-dimensional space (up to infinitely high) and find the hyperplane • Re-transfer into low-dimensional space: linear hyperplane turns into nonlinear one, but training vectors are exactly separated into two classes • Problems: 1. Transformation into higher dimension computationally expensive 2. Representation in lower dimension very complex and so not usable Basics of Data Science – Support Vector Machines 11 SVM in a nutshell (9) • Kernel trick: Use proper kernel functions that 1. describe hyperplane in high-dimensional space AND 2. are manageable in lower dimensions Transformation into both dimensions is in fact possible without calculating Basics of Data Science – Support Vector Machines 12 Support Vector Machines Basics of Data Science – Support Vector Machines 1 Preliminaries • Task of this class of algorithms: detect and exploit complex patterns in data (e.g., by clustering, classifying, ranking, cleaning, … the data) • Typical problems: – How to represent complex patterns (computational problem) – How to exclude unstable patterns / overfitting (statistical problem) Basics of Data Science – Support Vector Machines 2 Very Informal Reasoning • The class of kernel methods implicitly defines the class of possible patterns by introducing a notion of similarity between data • Example: Similarity between documents – By length – By topic – By language … • Choice of similarity -> Choice of relevant features Basics of Data Science – Support Vector Machines 3 More formal reasoning • Kernel methods exploit information about the inner products between data items • Many standard algorithms can be rewritten to only require inner products between data (inputs) • Kernel functions = inner products in some feature space (potentially very complex) • If kernel given, no need to specify what features of the data are being used Basics of Data Science – Support Vector Machines 4 Overview • • • • • Linear Learning Machines (LLM) Kernel Induced Feature Spaces Generalization Theory Optimization Theory Support Vector Machines (SVM) Basics of Data Science – Support Vector Machines 5 Modularity • Any kernel-based learning algorithm is composed of two modules: – A general purpose learning machine – A problem specific kernel function • Any kernel-based algorithm can be fitted with any kernel • Kernels themselves can be constructed in a modular way → Great for software engineering (and for analysis) Basics of Data Science – Support Vector Machines 6 Linear Learning Machines • Simplest case: classification → Decision function is a hyperplane in input space • The Perceptron Algorithm (Rosenblatt, 1957) • Useful to analyze the Perceptron algorithm, before looking at SVMs and Kernel Methods in general Basics of Data Science – Support Vector Machines 7 Linear Learning Machines Basic Notation • • • • • • • Input space Output space Hypothesis Real-valued Training Set Test error Dot product ℝ Basics of Data Science – Support Vector Machines 8 Linear Learning Machines Dot Product? • Inner product /scalar product (here dot product) between vectors • Hyperplane: (in Hesse normal form, good for calculating distances of points to the plane; w = normal vector of plane b = distance from origin) Basics of Data Science – Support Vector Machines 9 Linear Learning Machines Perceptron • Linear separation of the input space sign= +1 sign= -1 sign= 0 Basics of Data Science – Support Vector Machines 10 Linear Learning Machines Perceptron Algorithm • Update rule (ignoring threshold): if then Basics of Data Science – Support Vector Machines 11 Linear Learning Machines Observations • Solution is a linear combination of training points 𝑤 = ∝𝑖 𝑦𝑖 𝑥𝑖 ∝𝑖 ≥ 0 • Only used informative points (mistake driven) • The coefficient 𝛼𝑖 of a point in combination reflects its ‘difficulty’ Basics of Data Science – Support Vector Machines 12 Excursion: Duality Primal program: Dual program: Basics of Data Science – Support Vector Machines 13 Linear Learning Machines Dual Representation • The decision function can be re-written as follows: Basics of Data Science – Support Vector Machines 14 Linear Learning Machines Dual Representation • And also the update rule can be rewritten as follows: • If then • Note: In dual representation, data appears only inside dot products Basics of Data Science – Support Vector Machines 15 Linear Learning Machines Duality: First Property of SVMs • DUALITY is the first feature of Support Vector Machines • SVMs are Linear Learning Machines represented in a dual fashion • Data appear only within dot products (in decision function and in training algorithm) Basics of Data Science – Support Vector Machines 16 Linear Learning Machines Limitations of LLMs • Linear classifiers cannot deal with – Non-linearly separable data – Noisy data • This formulation only deals with vectorial data Basics of Data Science – Support Vector Machines 17 Linear Learning Machines Non-Linear Classifiers • Alternative 1: Creating a network of simple linear classifiers (neurons): a Neural Network (Problems: local minima; many parameters; heuristics needed to train; etc) • Alternative 2: Map data into a richer feature space including non-linear features, then use a linear classifier Basics of Data Science – Support Vector Machines 18 Overview • • • • • Linear Learning Machines (LLM) Kernel Induced Feature Spaces Generalization Theory Optimization Theory Support Vector Machines (SVM) Basics of Data Science – Support Vector Machines 19 Linear Learning Machines Learning in the Feature Space • Map data into a feature space where they are linearly separable Basics of Data Science – Support Vector Machines 20 Linear Learning Machines Problems with Feature Space • Working in high dimensional feature spaces solves the problem of expressing complex functions BUT: • There is a computational problem (working with very large vectors) • And a generalization theory problem (curse of dimensionality) Basics of Data Science – Support Vector Machines 21 Kernel-induced Feature Spaces Implicit Mapping to Feature Space We will introduce Kernels: • Solve the computational problem of working with many dimensions • Can make it possible to use infinite dimensions – efficiently in time/space • Other advantages, both practical and conceptual Basics of Data Science – Support Vector Machines 22 Kernel-induced Feature Spaces Implicit Mapping to Feature Space • In the dual representation, the data points only appear inside dot products: • The dimensionality of feature space not necessarily important • We may not even know the map Basics of Data Science – Support Vector Machines 23 Kernel-induced Feature Spaces Kernels • A function that returns the value of the dot product between the images of the two arguments • Given a function K, it is possible to verify that it is a kernel Basics of Data Science – Support Vector Machines 24 Kernel-induced Feature Spaces Kernels • One can use LLMs in a feature space by simply rewriting it in dual representation and replacing dot products with kernels: Basics of Data Science – Support Vector Machines 25 Kernel-induced Feature Spaces The Kernel Matrix • (aka the Gram matrix): Basics of Data Science – Support Vector Machines 26 Kernel-induced Feature Spaces The Kernel Matrix • The kernel matrix is the central structure in kernel machines • Information ‘bottleneck’: contains all necessary information for the learning algorithm • Fuses information about the data AND the kernel Basics of Data Science – Support Vector Machines 27 Kernel-induced Feature Spaces Mercer’s Theorem • Many interesting properties: – The kernel matrix is Symmetric Positive Definite – Any symmetric positive definite matrix • Can be regarded as a kernel matrix • Is an inner product matrix in some feature space Basics of Data Science – Support Vector Machines 28 Symmetric positive definite matrix • Symmetric: 1 2 2 4 −3 1 5 3 −3 1 −2 2 5 3 2 6 • Positive definite: Quadratic form 𝑞𝐴 𝑥 = 𝑥 𝑡 ⋅ 𝐴 ⋅ 𝑥 > 0, 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 ≠ 0 Example: Basics of Data Science – Support Vector Machines 29 Kernel-induced Feature Spaces More Formally: Mercer’s Theorem • Every (semi-) positive definite, symmetric function is a kernel: i.e. there exists a mapping such that it is possible to write: Pos. Def. Basics of Data Science – Support Vector Machines 30 Kernel-induced Feature Spaces Examples of Kernels • Simple examples of kernels are: Basics of Data Science – Support Vector Machines 32 Kernel-induced Feature Spaces Example: Polynomial Kernels Basics of Data Science – Support Vector Machines 33 Kernel-induced Feature Spaces Example: Polynomial Kernels Basics of Data Science – Support Vector Machines 34 Kernel-induced Feature Spaces Example: Polynomial Kernels Quelle: http://goo.gl/8JowV Basics of Data Science – Support Vector Machines 35 Kernel-induced Feature Spaces Example: the two spirals • Separated by a hyperplane in feature space (gaussian kernels) Basics of Data Science – Support Vector Machines 36 Kernel-induced Feature Spaces Making Kernels • The set of kernels is closed under some operations. If K, K’ are kernels, then: – – – – K+K’ is a kernel cK is a kernel, if c>0 aK+bK’ is a kernel, for a,b >0 And many more…… • Can construct complex kernels from simple ones: modularity ! Basics of Data Science – Support Vector Machines 37 Kernel-induced Feature Spaces Second Property of SVMs: SVMs are Linear Learning Machines, that • Use a dual representation AND • Operate in a kernel induced feature space (that is: is a linear function in the feature space implicitly defined by K) Basics of Data Science – Support Vector Machines 38 Kernel-induced Feature Spaces Kernels over General Structures • Kernels over sets, over sequences, over trees, … • Applied in text categorization, bioinformatics, … Basics of Data Science – Support Vector Machines 39 Kernel-induced Feature Spaces A bad kernel … • … would be a kernel whose kernel matrix is mostly diagonal: all points orthogonal to each other, no clusters, no structure … Basics of Data Science – Support Vector Machines 40 Kernel-induced Feature Spaces No Free Kernel • If mapping in a space with too many irrelevant features, kernel matrix becomes diagonal • Need some prior knowledge of target to choose a good kernel Basics of Data Science – Support Vector Machines 41 Kernel-induced Feature Spaces Other kernel-based algorithms • Not just LLMs can use kernels – clustering – Principal Component Analysis – Others... • Dual representation often possible Basics of Data Science – Support Vector Machines 42 Overview • • • • • Linear Learning Machines (LLM) Kernel Induced Feature Spaces Generalization Theory Optimization Theory Support Vector Machines (SVM) Basics of Data Science – Support Vector Machines 1 Linear Learning Machines Problems with Feature Space • Working in high dimensional feature spaces solves the problem of expressing complex functions BUT: • There is a computational problem (working with very large vectors) • And a generalization theory problem (curse of dimensionality) Basics of Data Science – Support Vector Machines 2 The Generalization Problem • The curse of dimensionality – Easy to overfit in high dimensional spaces – Regularities could be found in the training set that are accidental → would not be found again in a test set • The SVM problem is ill posed – Finding one hyperplane that separates the data – → Many such hyperplanes may exist! • How to choose the best possible hyperplane? Basics of Data Science – Support Vector Machines 3 The Generalization Problem • Many methods exist to choose a good hyperplane (inductive principles) – Bayes, – Statistical learning theory / PAC • Each can be used • We will focus on a simple case motivated by statistical learning theory (will give the basic SVM) Basics of Data Science – Support Vector Machines 4 Generalization Theory Statistical (Computational) Learning Theory • Generalization bounds on the risk of overfitting – PAC setting: probably approximately correct – assumption of i.i.d. data • Standard bounds from VC (Vapnik–Chervonenkis) theory give upper and lower bound proportional to VC dimension • VC dimension of LLMs proportional to dimension of space (can be huge) Basics of Data Science – Support Vector Machines 5 Generalization Theory Vapnik–Chervonenkis dimension • Measure of the capacity of a statistical classification algorithm • Defined as the cardinality of the largest set of points that the algorithm can shatter • Core concept in Vapnik–Chervonenkis theory Basics of Data Science – Support Vector Machines 6 Generalization Theory Assumptions and Definitions • Distribution D over input space X • Train and test points drawn randomly (i.i.d.) from D • Training error of hyp: fraction of points in S misclassifed by hyp • Test error of hyp: probability under D to misclassify a point x • VC dimension h: size of largest subset of X shattered by hyp (every dichotomy implemented) Basics of Data Science – Support Vector Machines 7 Generalization Theory Vapnik–Chervonenkis dimension • Allows to predict error on test points from error on training set • For sample size m and VC-dimension h << m it will hold with probability 1-η that 2m/ m Basics of Data Science – Support Vector Machines 8 Generalization Theory VC Bounds m m Basics of Data Science – Support Vector Machines 9 Generalization Theory VC Bounds • But often VC dimension h >> m, so bound is very weak • Does not tell us which hyperplane to choose • However: margin-based bounds exist, too !! Basics of Data Science – Support Vector Machines 10 Generalization Theory Margin Based Bounds • The VC worst case bound still holds, but if lucky (margin is large) the other bounds can be applied and better generalization can be achieved: • Best hyperplane: the maximal margin one • Margin is large if kernel is chosen well Basics of Data Science – Support Vector Machines 11 Generalization Theory Maximal Margin Classifier • Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature space • Third feature of SVMs: Maximize the margin • SVMs control capacity by – Increasing the margin,… – Not by reducing the number of degrees of freedom (dimension free capacity control) Basics of Data Science – Support Vector Machines 12 Generalization Theory Maximal Margin Classifier Basics of Data Science – Support Vector Machines 13 Generalization Theory Max Margin = Minimal Norm • Distance between the two convex hulls Basics of Data Science – Support Vector Machines 18 Generalization Theory The primal problem • Minimize: subject to: Basics of Data Science – Support Vector Machines 19 Overview • • • • • Linear Learning Machines (LLM) Kernel Induced Feature Spaces Generalization Theory Optimization Theory Support Vector Machines (SVM) Basics of Data Science – Support Vector Machines 20 Optimization Theory • The problem of finding the maximal margin hyperplane: constrained optimization (quadratic programming) • Use Lagrange theory (or Kuhn-Tucker Theory) • Lagrangian: Basics of Data Science – Support Vector Machines 21 Optimization Theory From Primal to Dual • Differentiate and substitute: Basics of Data Science – Support Vector Machines 22 Optimization Theory The Dual Problem • Maximize: • Subject to: • The duality again! Can use kernels! Basics of Data Science – Support Vector Machines 23 Optimization Theory Convexity • This is a Quadratic Optimization problem → Convex → No local minima !!! ☺☺☺ • (Second effect of Mercer’s conditions) • Solvable in polynomial time … • (Convexity is another fundamental property of SVMs) Basics of Data Science – Support Vector Machines 24 Optimization Theory Kuhn-Tucker Theorem Properties of the solution: • Duality: can use kernels • KKT conditions: • Sparseness: only the points nearest to the hyperplane (margin = 1) have positive weight • They are called support vectors Basics of Data Science – Support Vector Machines 25 Optimization Theory KKT Conditions Imply Sparseness • Sparseness: another fundamental property of SVMs Basics of Data Science – Support Vector Machines 26 Optimization Theory XOR-Example: Polynomial Kernel K(xi, xj)= (xi‘xj+1)2 Basics of Data Science – Support Vector Machines 27 Optimization Theory XOR-Example: Gaussian Kernel K(xi, xj)= exp(-(xi-xj)2 / 2(sigma)2) Basics of Data Science – Support Vector Machines 28 Optimization Theory other example: Gaussian Kernel K(xi, xj)=exp(-(xi-xj)2/2(sigma)2) Basics of Data Science – Support Vector Machines 29 Overview • • • • • Linear Learning Machines (LLM) Kernel Induced Feature Spaces Generalization Theory Optimization Theory Support Vector Machines (SVM) Basics of Data Science – Support Vector Machines 30 Support Vector Machines Properties of SVMs - Summary ✓ ✓ ✓ ✓ ✓ Duality Kernels Margin Convexity Sparseness Basics of Data Science – Support Vector Machines 31 Support Vector Machines Dealing with noise • In the case of non-separable data in feature space, the margin distribution can be optimized Basics of Data Science – Support Vector Machines 32 Support Vector Machines The Soft-Margin Classifier Minimize: Or: Subject to: Basics of Data Science – Support Vector Machines 33 Support Vector Machines Maximal Margin versus Soft Margin Max margin 2-norm soft margin Basics of Data Science – Support Vector Machines 1-norm soft margin 36 Support Vector Machines The regression case • For regression, all the above properties are retained, introducing epsilon-insensitive loss: Basics of Data Science – Support Vector Machines 37 Support Vector Machines Regression: the ε-tube Basics of Data Science – Support Vector Machines 38 Support Vector Machines Implementation Techniques • Maximizing a quadratic function, subject to a linear equality constraint (and inequalities as well) Basics of Data Science – Support Vector Machines 39 Support Vector Machines Simple Approximation • Initially complex QP packages were used. • Stochastic Gradient Ascent (sequentially update 1 weight at the time) gives excellent approximation in most cases Basics of Data Science – Support Vector Machines 40 Support Vector Machines Sequential Minimal Optimization • SMO: update two weights simultaneously • Realizes gradient ascent without leaving the linear constraint (J. Platt). • Online versions exist (Li-Long; Gentile) Basics of Data Science – Support Vector Machines 41 Support Vector Machines Comparison to Neural Networks Model Estimation Model Evaluaton Support Vector Machines Relatively quick (convex QP) Complexity dependent → Could be slow Artifical Neural Networks Relatively slow (gradient search) Compact Model → Fast Basics of Data Science – Support Vector Machines 42 Prof. Dr. Oliver Wendt Dr. habil. Mahdi Moeini Business Information Systems & Operations Research Basics of Data Science Summer Semester 2022 Part 2, Section 4 → Reinforcement Learning 1 Part 2: Stochastic Models on structured attribute data: • From Linear to Non-Linear Regression models • Deep? Neural Network Models • Support Vector Machines • Reinforcement Learning • Learning from Data Streams: Training and Updating Deterministic and Stochastic Models Basics of Data Science: Reinforcement Learning 2 Reinforcement Learning Basics of Data Science: Reinforcement Learning 3 Recap: Supervised Learning Source: Bishop 2006, p. 7 • • • • Optimize (i.e. minimize) mean squared error Based on training samples of data points Hopefully generalizing well to forecast future data points Assuming a uniform relevance of the parameter space (independent variables) ? • What to do if there is no teacher / trainer ? Basics of Data Science: Reinforcement Learning 4 Supervised? Learning ...from Simulation? expected profit y=f(x1,x2) Basics of Data Science: Reinforcement Learning 5 Markov Process • Markov Property state transitions must be history independent, i.e. transition probability T(s, s’) – of reaching a state s’(t+1) at time t+1 – from current state s(t) does NOT depend on any earlier state or transition • Markov Chain stochastic state-transition process (ST-Process) complying with the markov property Basics of Data Science: Reinforcement Learning 6 Markov Decision Process • Markov Decision Processes (MDP) are defined by: – – – – set of states S set of actions A reward function R: S x A→ Real state transition probability T: S x A x S → [0;1] giving the probability of reaching state s’ from state s when action a is taken – policy : S → A prescribing which action to take in a given state Basics of Data Science: Reinforcement Learning 7 Reinforcement Learning • Reinforcement Learning: Established as a “scientific community” since 20 years • Origins / cybernetics, psychology, statistics, robotics, influences: artificial intelligence, neuro sciences, • goal: programming of agents by reward and punishment without necessity to explicitly specify the agents’ action strategies • method: agents act in a dynamic environment and learn by trial-and-error Basics of Data Science: Reinforcement Learning 8 Reinforcement Learning – agent is related with the environment via “sensors” – in each interaction step the agent receives as input a reward signal r and a feedback concerning the environmental state s – agent chooses an action a as output, that may or may not change the environmental state – agent gets to know the value of its action only via the reinforcement / reward signal – goal of the agent is to maximize the long run sum of all reward signals received Basics of Data Science: Reinforcement Learning 9 Reinforcement Learning Agent Action State s Reward r ar rt+1 st+1 Environment Basics of Data Science: Reinforcement Learning 10 RL-Model Types • Models with finite horizon – – – – Optimisation of reward over h steps: Non-stationary policy at the end of time horizon presumes finite life-time of agent stationary policy, if h is a “floating” horizon • Discounted models with infinite horizon – Optimisation of discounted reward over infinite numbers of steps: • Models with average reward Basics of Data Science: Reinforcement Learning E t =0 rt ( ) ( ) h E t =0 t rt 1 h E t =0 rt h 11 Reinforcement Learning vs. neighboring domains • Adaptive Control – – • Structure of the dynamic model is not to be changed adaption problems are reduced to parameter estimate of the control strategy supervised learning (neural networks) – – – RL does not get training samples Reinforcement System has to explore the environment to enhance its performance → exploration vs. exploitation trade-off Basics of Data Science: Reinforcement Learning 12 State–Value Function State–Value Function of an arbitrary policy : V (s) = E R t | s t = s = E k rt + k +1 | s t = s k = 0 V (s ' ' ) V (s ' ) V (s) Basics of Data Science: Reinforcement Learning 13 Action–Value Function Action-Value Function Q of an arbitrary policy : k Q ( s, a ) = E rt + k +1 | st = s, at = a k =0 Q (s1 , a1 ) Q (s 2 , a 3 ) s2 s1 Basics of Data Science: Reinforcement Learning 14 Optimal State and Action–Value Function Optimal State-Value Function V* : V* (s) = max V (s) Optimal Action-Value Function Q* Q* ( s, a) = E rt +1 + V * ( st +1 ) | st = s, at = a * V ( s ) = max r ( s, a ) + T ( s, a, s ' )V ( s ') a s' * Basics of Data Science: Reinforcement Learning 15 Dynamic Programming • explore decision tree by trial and error of all possibilities and find the best way • Offline Version: possible solutions are calculated ex ante and strategy stored in look–up-table • Online Version: new solution paths are explored and evaluated during “runtime” • PROBLEM: exponential growth of state space Basics of Data Science: Reinforcement Learning 16 Value-Iteration Algorithm: Value-Iteration Initialise V(s) arbitrarily iterate until decision policy is good enough iterate for s S iterate for a A Q( s, a) := R( s, a) + s 'S T ( s, a, s ' )V ( s ' ) end V ( s ) := max a Q( s, a) end end Basics of Data Science: Reinforcement Learning 17 Policy-Iteration Algorithm: Policy-Iteration initialise decision policy ' arbitrarily repeat =' calculate the value function of the decision policy solve the linear system of equations V ( s ) := R( s, ( s )) + s 'S T ( s, ( s ), s ' )V ( s ' ) improve the decision policy for each state: '( s) := arg max a ( R( s, a) + s 'S T ( s, a, s ')V ( s ') ) until =' Basics of Data Science: Reinforcement Learning 18 Monte-Carlo-Method - learning via experience - learning in episodes - no total decision tree necessary - generation of average-returns for determination of V(s) Basics of Data Science: Reinforcement Learning 19 first visit Monte-Carlo-Method • generate an episode; choose a policy • run through whole episode • calculate average returns R for each V(s) visited • use all returns after particular s in episode • in next episode calculate V(s) average returns only for those states not visited in prior episodes Basics of Data Science: Reinforcement Learning 20 first visit Monte-Carlo-Method Example: V (s' ) = 5,5 V (s' ' ) = 6 r7 = 6 V (s) = 4,34 V (s' ' ) = 9 r8 = 9 Basics of Data Science: Reinforcement Learning 21 every visit Monte-Carlo-Method • generate an episode; choose a policy • run through whole episode • calculate average returns R for each V(s) visited • use all returns after particular s in episode • in next episode update V(s) for all states visited no matter whether visited before or not Basics of Data Science: Reinforcement Learning 22 Monte-Carlo-Method Example: V (s) alt = 4,34 V (s' ) alt = 5,5 V (s' ' ) = 6 V (s) neu = 5 V (s' ) neu = 6,5 V (s' ' ) = 9 r7 = 6 r8 = 9 Update-rule: V(st) V(st) + [Rt - V(st)] Basics of Data Science: Reinforcement Learning 23 Temporal-Difference-Learning • combines Dynamic Programming with MonteCarlo- Method • uses episodes • uses estimates for V(s) in the beginning of the episode • corrects estimate value for V(s,t) via sum of immediate return and state value function for the following state • episode does not need to be completed for for calculation of estimate values! Basics of Data Science: Reinforcement Learning 24 Temporal-Difference-Learning V (s t +1 ) V (s t + 2 ) r7 V (s t ) r8 Update-rule: V ( st ) V ( st ) + [rt +1 + V ( st +1 ) − V ( st )] Basics of Data Science: Reinforcement Learning 25 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 1 = 0.2 V π (s t +1) r7 = 2 V π (s t ) 1.2 r8 = 5 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 26 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 1 = 0.2 V π (s t ) 1.2 V π (s t +1) r7 = 2 1.6 r8 = 5 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 27 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 1 = 0.2 V π (s t ) 1.2 V π (s t +1) 0.4 r7 = 2 1.6 r8 = 5 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 28 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 1 = 0.2 V π (s t ) 1.2 V π (s t +1) 0.4 r7 = 2 0 1.6 r8 = 5 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 29 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 2 = 0.2 V π (s t ) 2.48 V π (s t +1) 0.72 r7 = 2 0 2.96 r8 = 5 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 30 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 3 = 0.2 V π (s t ) 3.78 V π (s t +1) 0.98 r7 = 2 0 4.11 r8 = 5 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 31 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 20 = 0.2 V π (s t ) 14.9 V π (s t +1) 1.97 r7 = 2 0 9.77 r8 = 5 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 32 Temporal-Difference-Learning Example V π (s t + 2 ) Episode 21 = 0.2 V π (s t ) V π (s t +1) 1.97 r7 = 2 9.22 1.0 15.1 0 r8 = 5 0 V π (s t ) V(s t ) + α [rt+1 + γV(s t+1) − V(s t )] Basics of Data Science: Reinforcement Learning 33 On/Off-Policy-Method On-policy-Method: policy, which generate the decisions and policy used to estimate V(s) are identical Off-policy-Method: action policy and policy for updating estimates are different Basics of Data Science: Reinforcement Learning 34 Q-Learning on policy On-Policy-Temporal-Difference Algorithm Initialize Q (s, a) arbitrary Repeat for each episode Initialize s Select a from s using a policy derived from Q Repeat (for each step of the episode): Perform action a and observe r, s’ Select a’ from s’ using a policy derived from Q Q(s,a) Q(s,a) + αr + γ Q(s' ,a' ) − Q(s,a) s s’ ; a a’ until s is terminal state Basics of Data Science: Reinforcement Learning 35 Q-Learning off policy Q-Learning: Off-Policy Temporal-Difference-Learning - Optimal path is not determined by update of V(s), but by update of Q(s,a) - action policy determines path - estimation policy is used to update Q(s,a) - action policy is -greedy; estimation policy is greedy - Advantage: global optimum is to be found with higher probability Basics of Data Science: Reinforcement Learning 36 Q-Learning Repeat for each episode: 1. Start from a given s 2. Choose an action a, starting from s and with use of the chosen behavioural policy e.g. -greedy 3. observe reward r and subsequent state s‘ 4. generate an update of Q as follows: Q(s, a ) Q(s, a ) + [rt +1 + max Q(s' , a ' ) − Q(s, a )] 5. Move from s to s‘ a' Basics of Data Science: Reinforcement Learning 37 Literature • D.P.Bertsekas, J.N.Tsitsiklis: Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996 • M.L.Putermann: Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley, New York, 1994 • R.S.Sutton, A.G. Barto: Reinforcement Learning: An Introduction, second edition, MIT Press, 2018 Basics of Data Science: Reinforcement Learning 38 Basics of Data Science Decision Trees & Random Forests Daniel Schermer Technische Universität Kaiserslautern Department of Business Studies & Economics Chair of Business Information Systems & Operations Research https://bisor.wiwi.uni-kl.de 18.07.2022 Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 1 / 50 Introduction Contents 1 Introduction 2 CART Algorithm 3 Simple Examples 4 Bias-Variance Tradeoff 5 Ensembles, Bagging & Random Forests 6 Advanced Examples 7 Conclusion Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 2 / 50 Introduction Decision Trees Motivation Consider the following: Whenever heart attack patients are admitted to a hospital, several variables are monitored. Based on these observations, decision trees allow us to construct simple rule-based systems. yes Minimum systolic blood pressure over 24h period ≤ 91? yes no Age ≤ 65 no Sinus tachycardia present? yes high risk Daniel Schermer (TU Kaiserslautern) low risk high risk Basics of Data Science no low risk 18.07.2022 3 / 50 Introduction Decision Trees Introduction It is convenient to introduce features and labels1 : x1 : Date yes x5 ≤ 91? no x2 : Age yes x3 : Height no x2 ≤ 65 x4 : Weight x6 x5 : Minimum systolic blood pressure, 24h x6 : Sinus tachycardia present? 1 y : 0 (low risk) or 1 (high risk) 0 yes no 1 0 Given a sample x =(22.07.2022, 25, 175cm, 70kg, 115, 0), what is the prognosis? 1A label can also be a numerical value, e. g., remaining life expectancy. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 4 / 50 Introduction Decision Trees Introduction From a graph theoretic perspective, a decision tree is a directed rooted tree with three main elements: The root (decision) node. (Internal) decision nodes. Leaf nodes. Root D1 D2 D3 L1 Daniel Schermer (TU Kaiserslautern) L2 L3 Basics of Data Science L4 L5 18.07.2022 5 / 50 Introduction Decision Trees Introduction Learning Sample L A learning sample L consists of data: L = {(x1 , y1 ), . . . , (xn , yn = f (xn ))} where xi ∈ X and yi ∈ Y . We distinguish two general types of variables: A variable is called categorical if it takes values in a finite set with no natural ordering (e. g., color). A variable is called numerical if its values are real numbers (e. g., blood pressure, age). Generally, xi is a vector consisting of one or more numerical or categorical features (variables). We assume the label yi to be either a numerical or categorical (e. g., temperature, yes/no) variable. A decision tree partitions L based on X to group similar parts of Y together. The CART-Algorithm2 achieves such a partitioning recursively by finding splits θ greedily. We define the cardinality of L as n, i.e., |L| = |X | = |Y | = n. 2 Other noteworthy methods are ID3 or C4.5. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 6 / 50 CART Algorithm 1 Introduction 2 CART Algorithm 3 Simple Examples 4 Bias-Variance Tradeoff 5 Ensembles, Bagging & Random Forests 6 Advanced Examples 7 Conclusion Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 7 / 50 CART Algorithm Classification and Regression Trees (CART) Mathematical Formulation CART Algorithm We differentiate two cases when splitting L on a feature f : For a categorical feature fc : Let S be the set of all values that variables xi ∈ L exhibit on feature fc . Let T be a subset of features of S, i. e., T ⊂ S. θ(L, fc , T ) splits L as follows: Lleft contains all (xi , yi ) ∈ L for which xi (fc ) ∈ T right L (1) contains all (xi , yi ) ∈ L for which xi (fc ) ∈ S \ T (2) For a numerical feature fn : θ = (L, f , t) contains a threshold t such that L is split as follows: Lleft contains all (xi , yi ) ∈ L for which xi (fn ) ≤ t L Daniel Schermer (TU Kaiserslautern) right (3) contains all (xi , yi ) ∈ L for which xi (fn ) > t Basics of Data Science (4) 18.07.2022 8 / 50 CART Algorithm Classification and Regression Trees (CART) Mathematical Formulation CART Algorithm The goodness of a split G (L, θ) is computed using an impurity or loss function H(·), the choice of which depends on the task being solved (see Slide 10): G (L, θ) = nleft nright H(Lleft (θ)) + H(Lright (θ)) n n (5) When learning decision trees, we want to minimize the loss function, i. e., we want to find the best split θ∗ such that the goodness G (·) implied by the partitioning into Lleft and Lright is minimal: θ∗ = arg minG (L, θ) (6) θ After the first split θ∗ , we can recurse for the newly created Lleft and Lright until a termination criterion is met (or no reduction in impurity is possible). Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 9 / 50 CART Algorithm Classification and Regression Trees (CART) Mathematical Formulation Commonly Used Loss Functions If the target is a classification with values {k1 , . . . , kK } and pk is the frequency with which class k occurs in L, then the Gini Impurity is defined as follows: H(L) = 1 − K X pk2 (7) k=1 If the target is a continuous value, then the Mean Squared Error (MSE) is defined as follows: y= 1 X yi n (8) 1 X (yi − y )2 n (9) yi ∈L H(L) = yi ∈L Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 10 / 50 CART Algorithm Classification and Regression Trees (CART) Recap To recap, we now have all components of the CART algorithm: 1 A procedure to partition L based on θ for categorical or numerical features f (see Slide 8). 2 A general measure to assess the goodness of a split based on loss functions (see Slide 9) 3 We use the Gini Impurity (classification) and MSE (regression) as loss functions (see Slide 10). Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 11 / 50 Simple Examples 1 Introduction 2 CART Algorithm 3 Simple Examples 4 Bias-Variance Tradeoff 5 Ensembles, Bagging & Random Forests 6 Advanced Examples 7 Conclusion Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 12 / 50 Simple Examples Classification Tree — Example Learning Sample L Features Samples Label Day Outlook Temperature Humidity Wind Play Tennis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 13 / 50 Simple Examples Classification Tree — Example Wind as root decision node Day ∗ We need to test every possible split to find θ . We start (arbitrarily) with the feature f = Wind. θ1 (L, Wind, T = {Strong}) yields: T = {Strong} ⇒ S \ T = {Weak} 2 2 3 3 left H(L ) = 1 − − = 0.50 6 6 2 2 2 6 H(Lright ) = 1 − − = 0.38 8 8 |6| |8| G (L, θ1 ) = · 0.50 + · 0.38 = 0.43 |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Strong Weak Tennis: 3Y, 3N Tennis: 6Y, 2N 18.07.2022 14 / 50 Simple Examples Classification Tree — Example Humidity as root decision node Day We continue (arbitrarily) with the feature Humidity. θ2 (L, Humidity, T = {High}) yields: T = {High} ⇒ S \ T = {Normal} 2 2 3 4 H(Lleft ) = 1 − − = 0.49 7 7 2 2 6 1 right H(L )=1− − = 0.24 7 7 |7| |7| · 0.49 + · 0.24 = 0.37 G (L, θ2 ) = |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Humidity High Normal Tennis: 3Y, 4N Tennis: 6Y, 1N 18.07.2022 15 / 50 Simple Examples Classification Tree — Example Outlook or Temperature as root decision node Using Outlook and Temperature as candidate features is more complex: If we split L on Outlook (S = {Sunny, Overcast, Rain}), we can build the following subsets T : ▶ ▶ ▶ Option 1: T = {Sunny, Overcast} and S \ T = {Rain} Option 2: T = {Sunny, Rain} and S \ T = {Overcast} Option 3: T = {Overcast, Rain} and S \ T = {Sunny} If we split L on Temperature (S = {Cool, Mild, Hot}), we can build the following subsets T : ▶ ▶ ▶ Option 1: T = {Cool, Mild} and S \ T = {Hot} Option 2: T = {Mild, Hot} and S \ T = {Cool} Option 3: T = {Cool, Hot} and S \ T = {Mild} We investigate all of these options on the following slides. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 16 / 50 Simple Examples Classification Tree — Example Outlook as root decision node, Option 1 Day We continue with Outlook. θ3 (L, Outlook, T = {Sunny, Overcast}) yields: T = {Sunny, Overcast} ⇒ S \ T = {Rain} 2 2 3 6 left − = 0.44 H(L ) = 1 − 9 9 2 2 3 2 H(Lright ) = 1 − − = 0.48 5 5 |9| |5| · 0.44 + · 0.48 = 0.45 G (L, θ3 ) = |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Outlook Sunny, Overcast Rain Tennis: 6Y, 3N Tennis: 3Y, 2N 18.07.2022 17 / 50 Simple Examples Classification Tree — Example Outlook as root decision node, Option 2 Day We continue with Outlook. θ4 (L, Outlook, T = {Sunny, Rain}) yields: T = {Sunny, Rain} ⇒ S \ T = {Overcast} 2 2 5 5 left − = 0.50 H(L ) = 1 − 10 10 2 2 4 0 H(Lright ) = 1 − − = 0.00 4 4 |10| |4| · 0.50 + · 0.00 = 0.36 G (L, θ4 ) = |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Outlook Sunny, Rain Overcast Tennis: 5Y, 5N Tennis: 4Y, 0N 18.07.2022 18 / 50 Simple Examples Classification Tree — Example Outlook as root decision node, Option 3 Day We continue with Outlook. θ5 (L, Outlook, T = {Overcast, Rain}) yields: T = {Overcast, Rain} ⇒ S \ T = {Sunny} 2 2 2 7 left − = 0.35 H(L ) = 1 − 9 9 2 2 2 3 H(Lright ) = 1 − − = 0.48 5 5 |9| |5| · 0.35 + · 0.48 = 0.40 G (L, θ5 ) = |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Outlook Overcast, Rain Sunny Tennis: 7Y, 2N Tennis: 2Y, 3N 18.07.2022 19 / 50 Simple Examples Classification Tree — Example Temperature as root decision node, Option 1 Day We continue with Temperature. θ6 (L, Temperature, T = {Cool, Mild}) yields: T = {Cool, Mild} ⇒ S \ T = {Hot} 2 2 3 7 left − = 0.42 H(L ) = 1 − 10 10 2 2 2 2 H(Lright ) = 1 − − = 0.50 4 4 |10| |4| · 0.42 + · 0.50 = 0.44 G (L, θ6 ) = |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Temperature Cool, Mild Hot Tennis: 7Y, 3N Tennis: 2Y, 2N 18.07.2022 20 / 50 Simple Examples Classification Tree — Example Temperature as root decision node, Option 2 Day We continue with Temperature. θ7 (L, Temperature, T = {Mild, Hot}) yields: T = {Mild, Hot} ⇒ S \ T = {Cool} 2 2 4 6 left − = 0.48 H(L ) = 1 − 10 10 2 2 3 1 H(Lright ) = 1 − − = 0.38 4 4 |10| |4| · 0.48 + · 0.38 = 0.45 G (L, θ7 ) = |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Temperature Mild, Hot Cool Tennis: 6Y, 4N Tennis: 3Y, 1N 18.07.2022 21 / 50 Simple Examples Classification Tree — Example Temperature as root decision node, Option 3 Day We continue with Temperature. θ8 (L, Temperature, T = {Cool, Hot}) yields: T = {Cool, Hot} ⇒ S \ T = {Mild} 2 2 5 3 left H(L ) = 1 − − = 0.47 8 8 2 2 2 4 H(Lright ) = 1 − − = 0.44 6 6 |8| |6| · 0.47 + · 0.44 = 0.46 G (L, θ8 ) = |14| |14| Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Temperature Cool, Hot Mild Tennis: 5Y, 3N Tennis: 4Y, 2N 18.07.2022 22 / 50 Simple Examples Classification Tree — Example First split Day We select Outlook for our first (root) decision node and use the split θ4 (L, Outlook, T = {Sunny, Rain}) because: θ∗ = θ4 = arg minG (L, θ) θ Currently, we have two leaf nodes: For each node that does not perfectly classify the labels (H ̸= 0), we can recurse the procedure. This yields the decision tree shown on Slide 24. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Outlook Sunny, Rain Overcast Tennis: 5Y, 5N Tennis: 4Y, 0N 18.07.2022 23 / 50 Simple Examples Classification Tree After many more splits . . . Outlook Sunny, Rain Overcast Humidity Yes High Features Normal Temperature Outlook Sunny Rain Hot, Mild Cold No Wind Yes Wind Weak Strong Weak Strong Yes No Yes No Daniel Schermer (TU Kaiserslautern) Basics of Data Science Day Outlook 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Label Temperature Humidity Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High 18.07.2022 Wind Tennis Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 24 / 50 Simple Examples Regression Tree — Example We have the following learning sample L = {X , Y }: X = {x1 = 0, x2 = 1, x3 = 2, x4 = 3, x5 = 4} f (xi ) = Y = {y1 = 0, y2 = 1, y3 = 2, y4 = 1, y5 = 1} We introduced the following partitioning scheme (Slide 8) 2 Lleft contains all (xi , yi ) ∈ L for which xi (fn ) ≤ t 1.5 y Lright contains all (xi , yi ) ∈ L for which xi (fn ) > t 1 If the samples are sorted in ascending order for feature f , a common idea is to consider all thresholds: 0.5 t1 = 0 0 1 2 x Daniel Schermer (TU Kaiserslautern) 3 x1 (f ) + x2 (f ) x2 (f ) + x3 (f ) , t2 = ,... 2 2 4 Here, we have just a single feature (the x value). Basics of Data Science 18.07.2022 25 / 50 Simple Examples Regression Tree — Example All relevant thresholds The first threshold is t1 = x1 + x2 = 0.5: 2 The second threshold is t2 = Lleft = {x1 , y1 } y left = H(Lleft ) = Lleft = {x1 , x2 , y1 , y2 } 1 1X yi = 0 1 i=1 y left = 1 1X (yi − y left )2 = 0 1 i=1 H(Lleft ) = Lright = {x2 , . . . , x5 , y2 , . . . , y5 } y right = H(Lright ) = G = 2 1X yi = 0.5 2 i=1 2 1X (yi − y left )2 = 0.25 2 i=1 Lright = {x3 , . . . , x5 , y3 , . . . , y5 } 5 1X yi = 1.25 4 i=2 y right = 5 1X (yi − y right )2 = 0.1875 4 i=2 H(Lright ) = 1 4 · 0 + · 0.1875 = 0.15 5 5 Daniel Schermer (TU Kaiserslautern) x2 + x3 = 1.5: 2 G = Basics of Data Science 5 1X yi = 1.33 3 i=3 5 1X (yi − y right )2 = 0.2222 3 i=3 2 3 · 0.25 + · 0.2222 = 0.2333 5 5 18.07.2022 26 / 50 Simple Examples Regression Tree — Example All relevant thresholds The third threshold is t3 = x3 + x4 = 2.5: 2 The fourth threshold is t4 = Lleft = {x1 , x2 , x3 , y1 , y2 , y3 } y left = H(Lleft ) = Lleft = {x1 , x2 , x3 , x4 , y1 , y2 , y3 , y4 } 3 1X yi = 1 3 i=1 y left = 3 1X 2 (yi − y left )2 = 3 i=1 3 H(Lleft ) = Lright = {x4 , x5 , y4 , y5 } y right = H(Lright ) = G = 4 1X yi = 1 4 i=1 4 1X 1 (yi − y left )2 = 4 i=1 2 Lright = {x5 , y5 } 5 1X yi = 1 2 i=4 y right = 5 1X (yi − y right )2 = 0 2 i=4 H(Lright ) = 3 2 2 · + · 0 = 0.4 5 3 5 Daniel Schermer (TU Kaiserslautern) x4 + x5 = 3.5: 2 G = Basics of Data Science 5 1X yi = 1 1 i=5 5 1X (yi − y right )2 = 0 1 i=5 4 1 1 · + · 0 = 0.4 5 2 5 18.07.2022 27 / 50 Simple Examples Regression Tree — Example First Iteration The optimal split θ∗ is θ(L, 0.5) We see that the tree provides a piecewise linear approximation by using axis-aligned splits. 2 1 5 Samples, y = 1 MSE = 0.4, x ≤ 0.5 True 0 False −1 1 Sample, y =0 MSE = 0 Daniel Schermer (TU Kaiserslautern) 4 Samples, y =1.25 MSE = 0.188, Basics of Data Science 0 1 2 3 18.07.2022 4 5 28 / 50 Simple Examples Regression Tree — Example After many more splits . . . The procedure can be recursed, yielding an improved piecewise-linear approximation after each iteration. 5 Samples, y = 1 MSE = 0.4, x ≤ 0.5 2 True False 1 Sample, y =0 MSE = 0 4 Samples, y =1.25 MSE = 0.188, x ≤ 2.5 1 True True 2 Samples, y =1.5 MSE = 0.25, x ≤ 1.5 2 Samples y =1 MSE = 0 True False 1 Sample, y =1 MSE = 0 1 Sample, y =2 MSE = 0 0 −1 Daniel Schermer (TU Kaiserslautern) 0 Basics of Data Science 1 2 3 18.07.2022 4 5 29 / 50 Bias-Variance Tradeoff 1 Introduction 2 CART Algorithm 3 Simple Examples 4 Bias-Variance Tradeoff 5 Ensembles, Bagging & Random Forests 6 Advanced Examples 7 Conclusion Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 30 / 50 Bias-Variance Tradeoff Bias-Variance Tradeoff High-Level Overview Bias Bias measures the average amount by which the predictions of a model ŷi differ from the true value yi . Low Bias: Weak assumptions regarding the functional relationship between the input and output. High Bias: Strong assumptions regarding the functional relationship between the input and output. Variance Variance measures the variability of the predictions when a model is learnt over different L. Low variance ⇒ Small changes in L cause small changes in the model. High variance ⇒ Small changes in L cause large changes in the model. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 31 / 50 Bias-Variance Tradeoff Bias-Variance Tradeoff High-Level Overview Optimum Model Complexity Error In general, the trade-off between bias and variance is non-trivial! Total error Bias2 Variance Model Complexity Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 32 / 50 Bias-Variance Tradeoff Bias-Variance Tradeoff Decision Trees & Example Decision trees have low bias and high variance: ▶ ▶ ▶ We make (almost) no assumption about the functional relationship underlying L. A small change in L can lead to a completely different decision tree. This puts decision trees at risk of overfitting (not generalizing well)! A more intuitive example may be fitting a polynomial function of degree d on a learning sample. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 33 / 50 Bias-Variance Tradeoff Bias-Variance Tradeoff Decision Trees & Example There are several ways to address the Bias-Variance Tradeoff for tree-based learners: Pre-Regularization (Pre-Pruning): Stop growing the tree prematurely. ▶ ▶ ▶ Minimum number of samples required for each split or leaf. Maximum depth of the tree. Goodness required to make a new split. Post-Regularization (Post-Pruning): Grow a full tree, then prune it. Ensembling, Bagging & Random Forests. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 34 / 50 Ensembles, Bagging & Random Forests 1 Introduction 2 CART Algorithm 3 Simple Examples 4 Bias-Variance Tradeoff 5 Ensembles, Bagging & Random Forests 6 Advanced Examples 7 Conclusion Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 35 / 50 Ensembles, Bagging & Random Forests Ensembles Introduction Some of the most powerful machine learning models are ensemble methods. An ensemble combines two or more base predictors, aiming to create a more powerful model. We can distinguish two types of approaches: ▶ ▶ Averaging methods (Bagging, Random Forests, . . . ) Boosting methods (AdaBoost, Gradient Boosted Decision Trees, . . . ) Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 36 / 50 Ensembles, Bagging & Random Forests Bagging Bootstrapping & Aggregating Bagging (Bootstraping and aggegating) Bootstrapping: Randomly sample with replacement from L until we have a new learning sample L̃ of the same size. Iterate this procedure B times, until we have L˜1 , . . . , L˜B bootstraped learning samples. Learn B predictors (e.g., decision trees, using CART) based on each L˜1 , . . . , L˜B . Aggregating: For classification: The predicted label is the majority vote amongst the B predictors. For regression: The predicted value is the average amongst the values predicted by the B predictors. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 37 / 50 Ensembles, Bagging & Random Forests Bagging Bootstrapping & Aggregating Consider our previous regression example where we had: L = {(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), (x4 , y4 ), (x5 , y5 )} = {(0, 0), (1, 1), (2, 2), (3, 1), (4, 1)} Random sampling with replacement might yield the following B bootstrap samples: L˜1 = {(x3 , y3 ), (x4 , y4 ), (x3 , y3 ), (x1 , y1 ), (x2 , y2 )} L˜2 = {(x4 , y4 ), (x5 , y5 ), (x5 , y5 ), (x2 , y2 ), (x1 , y1 )} ... L˜B = {(x1 , y1 ), (x2 , y2 ), (x4 , y4 ), (x5 , y5 ), (x3 , y3 )} Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 38 / 50 Ensembles, Bagging & Random Forests Bagging Bootstrapping & Aggregating We call the arrangement of B trees that results from bagging a decision forest. Learning Sample L Bootstrapping Tree 1 Tree 2 Tree B ... mean in regression or majority vote in classification prediction Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 39 / 50 Ensembles, Bagging & Random Forests Random Forests Overview Random Forest ⇔ decision forest that is constructed with extra randomness3 : ▶ ▶ ▶ In principle, a random forest is grown using CART. However, whenever we look for a split in a tree, we only consider a random subset of features. Generally, this random subspace is very small. Random Forests typically perform more favorable than a decision forest. However, they also come with additional hyperparameters! 3 Injecting the right amount of randomness is not trivial! Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 40 / 50 Ensembles, Bagging & Random Forests Random Forests Example Consider our previous classification example. Bootstrapping may yield B learning samples: L1 = {Day 5, Day 2, Day 11, . . . }, L2 = {Day 8, Day 1, Day 4, . . . }, ... When training T1 on L1 , we might consider only, e. g., outlook and wind for the first split (selected randomly). When training T2 on L2 , we might consider only, e. g., outlook and temperature for the first split (selected randomly). Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High This procedure can be repeated until tree TB , and then recursed for each tree. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 41 / 50 Advanced Examples 1 Introduction 2 CART Algorithm 3 Simple Examples 4 Bias-Variance Tradeoff 5 Ensembles, Bagging & Random Forests 6 Advanced Examples 7 Conclusion Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 42 / 50 Advanced Examples Advanced Regression Example Assume that we have the following function: 2 2 f (x) = e −x + 1.5e −(x−2) , x ∈ [−5, 5] with Noise coming from N {µ, σ 2 } = N {0, 0.01} We do the following: Draw 200 learning samples L1 , . . . L200 from noisy f (x). 1.5 Use L1 , . . . L100 for training 100 decision trees and forests. 1 f (x) We compare the average performance of the learnt trees and forests versus a hypothetical regressor, that best matches (on average) the remaining learning samples L101 , . . . , L200 . 0.5 0 −4 Daniel Schermer (TU Kaiserslautern) Basics of Data Science −2 0 x 18.07.2022 2 4 43 / 50 2 2 f (x) = e −x + 1.5e −(x−2) with Noise coming from N {µ, σ 2 } = N {0, 0.01} Decision Tree Decision Forest 1.75 1.75 1.50 1.50 1.25 1.25 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 −0.25 −0.25 −4 −2 0 2 f (x) EL ŷ (x) −4 4 0.10 0.10 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 −2 0 2 4 error(x) bias 2 (x) variance(x) 0.00 0.00 −4 −2 0 2 4 −4 −2 0 2 4 Advanced Examples Advanced Classification Example We have L where each xi is a 8 by 8 pixel greyscale matrix and each yi ∈ {0, . . . , 9}; n = 1797. This is a straightforward classification task: We have 82 = 64 features (corresponding to each pixel) with greyscale values in [0, 1]. We have a single label yi ∈ {0, . . . , 9}. On the following slide we compare the performance of a Decision Tree, Decision Forest (100 Trees) and Random Forest (100 Trees). ▶ ▶ We use 1437 samples (≈ 80%) for learning the classifiers. We use 360 samples (≈ 20%) for testing the classifiers. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 45 / 50 Advanced Examples Advanced Classification Example Decision Forest Confusion Matrix Random Forest Confusion Matrix 34 0 0 0 0 0 1 0 0 0 0 34 0 0 0 1 0 0 0 0 0 0 34 0 0 0 1 0 0 0 0 0 1 0 23 0 5 0 0 0 1 4 3 1 0 27 0 1 0 0 0 0 1 7 1 0 35 0 0 0 0 0 0 0 1 2 1 0 29 0 0 1 1 1 1 1 2 1 0 32 0 0 0 0 2 0 0 2 1 0 34 0 0 0 0 0 0 0 3 0 3 0 24 0 4 1 1 4 0 3 0 2 0 25 0 2 0 2 6 0 3 0 0 0 30 0 2 0 0 5 0 4 0 0 0 0 30 0 2 3 1 1 4 0 0 0 0 34 0 0 2 0 1 4 0 0 0 0 34 0 0 3 0 0 5 0 0 0 3 2 28 2 1 0 1 5 0 0 0 2 0 35 0 0 0 0 5 0 0 0 0 0 37 0 0 0 0 6 0 2 0 0 0 0 32 0 3 0 6 0 1 0 0 0 0 36 0 0 0 6 0 0 0 0 0 0 37 0 0 0 7 1 0 0 1 2 0 0 32 0 0 7 0 0 0 0 0 0 0 36 0 0 7 0 0 0 0 0 0 0 36 0 0 8 1 2 2 1 0 0 0 0 25 2 8 1 2 0 0 1 0 0 0 28 1 8 0 3 0 0 1 0 0 1 27 1 9 0 0 0 1 0 3 0 5 1 27 9 0 1 0 0 0 1 0 2 1 32 9 0 0 0 0 0 2 0 0 1 34 0 1 2 3 4 5 6 Predicted label 7 8 9 0 1 2 3 4 5 6 Predicted label 7 8 9 0 1 2 3 4 5 6 Predicted label 7 8 9 True label 0 True label True label Decision Tree Confusion Matrix Accuracies: 79.0% (Decision Tree), 88.3% (Decision Forest), 93.3% (Random Forest). Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 46 / 50 Conclusion 1 Introduction 2 CART Algorithm 3 Simple Examples 4 Bias-Variance Tradeoff 5 Ensembles, Bagging & Random Forests 6 Advanced Examples 7 Conclusion Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 47 / 50 Conclusion Conclusion Decision Trees Advantages: Disadvantages: Simple to understand and interpret. Prone to overfitting (high variance). Require (almost) no data preparation. Regression trees weak at extrapolation. Require (almost) no hyperparameters. Unstable (change in L yields different tree). Decision Forests & Random Forests Advantages: Disadvantages: Powerful and typically more accurate. No longer easily interpretable. Require (almost) no data preparation. Computationally more expensive. Several trees make the forest stable. More hyperparameters than decision trees. Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 48 / 50 Conclusion Recommender Web Resources & Python Codes https://scikit-learn.org/stable/modules/tree.html https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html https://scikit-learn.org/stable/auto_examples/ensemble/plot_bias_variance.html https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 49 / 50 Conclusion Literature Breiman, Leo (1996). “Bagging predictors”. In: Machine Learning 24.2, pp. 123–140. Breiman, Leo (2001). “Random Forests”. In: Machine Learning 45.1, pp. 5–32. Breiman, Leo et al. (1984). Classification And Regression Trees. 1st ed. Routledge. Freund, Yoav and Robert E Schapire (1997). “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”. In: Journal of Computer and System Sciences 55.1, pp. 119–139. Hastie, Trevor, Robert Tibshirani, and J. H. Friedman (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. Springer series in statistics. New York, NY: Springer. Ho, Tin Kam (1998). “The random subspace method for constructing decision forests”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 20.8, pp. 832–844. Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill series in computer science. New York: McGraw-Hill. Pedregosa, F. et al. (2011). “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12, pp. 2825–2830. Quinlan, J. R. (1986). “Induction of decision trees”. In: Machine Learning 1.1, pp. 81–106. scikit-learn: machine learning in Python — scikit-learn 1.1.1 documentation (2022). Daniel Schermer (TU Kaiserslautern) Basics of Data Science 18.07.2022 50 / 50