Uploaded by Sebastian Zapata

0+-+Organizational+points+(Data+Science)-zusammengefügt

advertisement
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini, M.Sc. Manuel Hermes
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Technische Universität Kaiserslautern
www.bisor.de
Organizational Stuff
1
General
•
Part 0: Data -> Information -> Knowledge: Concepts in Science and
Engineering
•
Part 1: Organizing the “Data Lake” (from data mining to data
fishing)
•
Part 2: Stochastic Models on structured attribute data
•
Part 3: Getting Ready for the Digital Twin
•
Lectures and exercises (no tutorials).
•
OLAT PW:
2
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
https://www.billyloizou.com/blog/bridging-the-gap-between-data-creativity
Summary:
In this section, we will see:
• What is Data Science?
• What is/are Data/Data Set?
• Sources of Data
• Ecosystem of Data Science
• Legal, Ethical, and Social Issues
• Tasks of Data Science
Basics of Data Science
2
What is Data Science?
• Explosion of data:
o Social networks,
o Internet of Things, ….
• Heterogeneity:
o big and unstructured data that might be noisy, ….
• Technologies:
o huge storage capacity, clouds,
o computing power,
o algorithms,
o statistics and computation techniques
Basics of Data Science
3
What is Data Science?
trends.google.com (accessed March 29, 2022)
Basics of Data Science
4
What is Data Science?
Data Science vs. Machine Learning vs. Big Data
• Common point: focus on improving decision making through
analyzing data
• Difference:
o Machine Learning: focus on algorithms
o Big Data: focus on structured data
o Data Science: focus on unstructured data
Basics of Data Science
5
What is Data Science?
Data Science Venn Diagram
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Jake Vander Plas. Python Data Science Handbook. O'Reilly Media, (2016)
Basics of Data Science
6
What is Data Science?
A First Try to Define Data Science
“A data scientist is someone who is better at statistics than any
software engineer and better at software engineering than any
statistician.”
Josh Wills, Director of Data Science at Cloudera
“A data scientist is someone who is worse at statistics than any
statistician and worse at software engineering than any software
engineer.”
Will Cukierski, Data Scientist at Kaggle
“The field of people who decide to print ‘Data Scientist’ on their
business cards and get a salary bump.”
https://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html
Basics of Data Science
7
What is Data Science?
According to David Donoho, data science activities are classified as
follows:
• Data Gathering, Preparation, and Exploration
• Data Representation and Transformation
• Computing with Data
• Data Visualization and Presentation
• Data Modeling
• Science about Data Science
David Donoho. 50 Years of Data Science, Journal of Computational and Graphical Statistics, 26:4, 745-766, (2017).
Basics of Data Science
8
What is Data Science?
Definition:
Data Science concerns the recognition, formalization, and
exploitation of data phenomenology emerging from digital
transformation of business, society and science itself.
Definition:
Data Science holds a set of principles, problem definitions,
algorithms, and processes with the objective of extracting
nontrivial and useful patterns from large data sets.
David Donoho. Data Science: The End of Theory?
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
9
What is Data Science?
Some Examples: Data Science in History
Mid 19-th century:
● devastating outbreaks of Cholera had plagued London.
● Dr John Snow prepared a dot map for the Soho district in 1854.
● The clustering pattern from the map showed that the Broad
Street pump is likely the source of contamination.
● Using this evidence, the authorities disabled the pump, and
consequently they could decline Cholera incidence in the
neighborhood.
Basics of Data Science
10
What is Data Science?
Some Examples: Data Science Today
• Precision medicine
• Data Science in Sales and Marketing
• Governments Using Data Science
• Data Science in Professional Sports
• Predicting elections
N. Silver. The Signal and the Noise. Penguin (2012).
E. A. Ashley. Towards Precision Medicine. Nature Reviews Generics (2017).
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
11
What is Data Science?
The Hype Cycle
https://en.wikipedia.org/wiki/Hype_cycle
Jeremykemp at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=10547051
Basics of Data Science
12
What is Data Science?
Some Myths about Data Science
• Data science is an autonomous process that finds the answers
to our problems
• Every data science project requires big data and should use
deep learning
• Data science is easy to do
• Data science pays for itself in a very short time
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
13
What is/are Data/Data Set?
• A datum or a piece of information is an abstraction of a realworld entity (i.e., a person, an object, or an event).
• Individual abstraction: variable, feature, and attribute
• Attribute: Each entity is described by several attributes.
• A data set consists of the data concerning a collection of entities,
such that each entity is described in terms of a collection of
attributes.
Hint: “data set” and “analytics record”: are often used as
equivalent terms.
https://www.merriam-webster.com/dictionary/datum
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
14
What is/are Data/Data Set?
• The standard attribute types: numeric, nominal, and ordinal.
• Numeric attributes are used to describe measurable quantities,
which are integer or real values.
• Nominal (categorical) attributes take typically values from a
finite collection or set.
• Ordinal attributes are used to rank over the class of objects.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
15
What is/are Data/Data Set?
• Structured data: are data that can be stored in a table or
spreadsheet, such that each instance/row in the table has the
same structure and set of attributes.
• Unstructured data: are data where each instance in the data set
may have its own internal and specific structure, such that this
so-called structure can be different for each instance.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
16
What is/are Data/Data Set?
• Raw data: when attributes are raw abstractions regarding an
object or an event
• Derived data: if we derive them from other pieces of data
• Types of raw data:
o captured data: when we perform a direct measurement or an
observation to collect the data.
o exhaust data: these are by-product that we obtain from a process,
where the primary objective is not capturing data.
Kitchin, Rob. The Data Revolution: Big Data, Open Data, Data Infrastructures, and Their Consequences. Sage, (2014).
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
17
What is/are Data/Data Set?
• Metadata: this is the data that describes other data
Example:
US National Security Agency’s (NSA) surveillance program PRISM
collected a large collection of metadata about people’s phone
conversations, i.e., the agency was not recording the content of
phone calls (i.e., there was no wiretapping) but actually the NSA
was collecting the data regarding the calls, e.g., who the recipient
was, the duration of the call, etc.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Pomerantz, Jeffrey. Metadata. Cambridge, MA: MIT Press (2015).
Basics of Data Science
18
Sources of Data
Typically, 80% of project time is spent on getting data ready
Problems:
Unclear variable names, missing
values, misspelled text,
20%
numbers as text (in spreadsheet),
outliers, etc.
Example: Emoji
Unicode number: U+1F914
HTML-code: 🤔
CSS-code: \1F914
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
https://unicode-table.com/en/1F914/
Basics of Data Science
80%
19
Sources of Data
• In-house data: when the data is ready, e.g., your company
provides it.
o Advantages: it is fast and ready, and you can use the data in the frame
of your company and cooperate with people who collected and created
the data.
o Potential disappointments: It might happen that the data is not well
gathered, or not well documented, not well maintained, the person who
gathered them already left the company, what you actually need is not
in the data, ….
Poulson Barton. Data Science Foundations: Fundamentals. (2019)
Basics of Data Science
20
Sources of Data
Open data: is a kind of data that is (1) gratis and (2) free to use.
• Government, e.g.,
o The home of the U.S. Government’s open data (https://www.data.gov/)
o The global open data index (https://index.okfn.org)
• Science, e.g.,
o Nature: https://www.nature.com/sdata
o The Open Science Data Cloud (OSDC):
https://www.opensciencedatacloud.org
• Social Media, e.g., Google trends (https://trends.google.com/) and
Yahoo finance (https://finance.yahoo.com)
Basics of Data Science
21
Sources of Data
Other sources of data (with some overlaps):
• Application Programming Interface (API)
• Data Scraping (be careful!)
• Creating Data (be careful!)
• Passive Collection of Data (be careful!)
• Generating Data
Poulson Barton. Data Science Foundations: Fundamentals. (2019)
Basics of Data Science
22
Ecosystem of Data Science
Data Science Ecosystem: The term refers to the set of
programming languages, software packages and tools, methods
and algorithms, general infrastructure, etc. that an organization
uses to gather, store, analyze, and get maximum advantage from
data in the data science projects of the organization.
There are different sets of technologies used for the purpose of
data science:
● commercial products, ● open-source tools, ● mixture of opensource tools and commercial products.
https://online.hbs.edu/blog/post/data-ecosystem
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
23
Ecosystem of Data Science
A typical data
architecture for
data science.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
24
Ecosystem of Data Science
• The data sources layer: e.g., as online transaction processing
(OLTP) systems including banking, finance, call center, etc.
• The data-storage layer: the data sharing, data warehousing,
and data analytics across an organization, which has two parts:
o Most organizations use a data-sharing software.
o Managing (storage and analytics) of big data, by, e.g., Hadoop.
• The applications layer: the ready data is used in analyzing the
specific challenge of the data project.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
25
Legal, Ethical, and Social Issues
• Legal Issues: for example, privacy laws:
o The EU's General Data Protection Regulation (GDPR), California
Consumer Privacy Act (CCPA), etc.
If an organization violates seriously the policies of the GDPR, then
there might be billions of fine for the concerned organization.
• Ethical Issues: concerning authenticity, fairness (equality,
equity, and need), etc.
• Social Issues: public opinion should be respected and taken into
consideration.
David Martens. Data Science Ethics: Concepts, Techniques and Cautionary Tales. Oxford University Press (2022)
Poulson Barton. Data Science Foundations: Fundamentals. (2019)
Basics of Data Science
26
Legal, Ethical, and Social Issues
https://unctad.org/page/data-protection-and-privacy-legislation-worldwide (accessed December 2021)
Basics of Data Science
27
Legal, Ethical, and Social Issues
Data science in interaction with human and artificial intelligence:
● Recommendations: after processing the data, algorithm make
recommendations, but human decides to take or leave.
● Human-in-the-Loop decisions: algorithms make and execute
their own decisions, but humans are present to control.
● Human-Accessible decisions: algorithms make decisions
automatically and execute them, but the process should be
accessible and interpretable.
● Machine-Centric decisions: machine are communicating with
together, e.g., Internet of Things (IoT).
David Martens. Data Science Ethics: Concepts, Techniques and Cautionary Tales. Oxford University Press (2022)
Poulson Barton. Data Science Foundations: Fundamentals. (2019)
Basics of Data Science
28
Standard Tasks of Data Science
Most data science projects belong to one of the following classes
of tasks:
• Clustering
• Anomaly (outlier) detection
• Association-rule mining
• Prediction
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
29
Standard Tasks of Data Science
• Clustering:
Example: through identifying customers and their preferences
and needs, data science can support marketing and sales
campaign of companies via targeted marketing.
The standard data science approach: formulate the problem as a
clustering task, where clustering sorts the instances of a data set
into subgroups of similar instances. For this purpose, we need to
know number of subgroups (e.g., via some domain knowledge)
and a range of attributes that describe customers for clustering,
e.g., demographic information (age, gender, etc.), location (ZIP
code), and so on.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
30
Standard Tasks of Data Science
• Anomaly (outlier) detection:
In anomaly detection (or outlier analysis), we search for and
identify instances that do not match and conform to the typical
instance in a data set, e.g., fraudulent activities, fraudulent credit
card transactions.
A typical approach:
1. Using domain expertise, define some rules.
2. Then, use, e.g., SQL (or another language) to check business
databases or data warehouse.
Another approach: training a prediction model to classify
instances as anomalous versus normal.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
31
Standard Tasks of Data Science
• Association-rule mining:
For example, data science can be used in cross-selling, where the
vendor suggests to customers, who are currently buying some
products, if they are also interested to buy other similar, related,
or even complementary products.
For the purpose of cross-selling, we need to identify associations
between products. This can be done by unsupervised-dataanalysis techniques.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
32
Standard Tasks of Data Science
• Prediction (classification):
Example: In customer relationship management (CRM), a typical
task consists in performing propensity modeling, i.e., estimating
the likelihood that a customer will make a decision, e.g., leaving a
service.
Customer churn: when customers leave one service, e.g., cell
phone company, to join another one.
Using classification models (and training them), data science task
is to help in detecting (predicting) churns, i.e., classifying a
customer as a churn risk or not.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
33
Standard Tasks of Data Science
• Prediction (regression):
Example: Price prediction, which consists in estimating the price
of a product at some point in future.
• A typical approach: Using regression because price prediction
consists in estimating the value of a continuous attribute.
John D. Kelleher, Brendan Tierney. Data Science. The MIT Press, (2018)
Basics of Data Science
34
Legal, Ethical, and Social Issues
Basics of Data Science
Legal, Ethical, and Social Issues
Basics of Data Science
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 1, Section 2 → Relational Database Models: Modeling and
Querying Structured Attributes of Objects
Part 1: Organizing the “Data Lake” (from data mining to data fishing)
• Relational Database Models: Modeling and Querying Structured Attributes
of Objects
• Graph- and Network-based Data Models: Modeling and Querying
Structured Relations of Objects
• Information Retrieval: Document Mining and Querying of ill-structured Data
• Streaming Data and High Frequency Distributed Sensor Data
• The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate
evolving heterogeneous data lakes
Basics of Data Science
2
Relational Database Models:
Modeling and Querying
Structured Attributes of
Objects
Basics of Data Science
3
Summary:
In this section, we will see:
• Introduction
• The Relational Model
• Relational Database Management System (RDBMS)
• How to Design a Relational Database
• Database Normalization
• Other Types of Databases
Basics of Data Science
4
Introduction
• Data Storage Problem
• One-dimensional array
https://www.weforum.org/agenda/2015/02/a-brief-history-of-big-data-everyone-should-read/
Basics of Data Science
5
Using a List to Organize Data
Holidays Pictures:
Alexanderplatz.jpg
Brandenburg Gate.jpg
Eiffel Tower.jpg
London Eye.jpg
Louvre.jpg
Images from Internet.
Basics of Data Science
6
Organize Data in Table:
Country
City
Picture
Date
Person
Germany
Berlin
Brandenburg
Gate.jpg
1.7.2021
Joshua
Germany
Berlin
Alexanderplatz.jpg
1.7.2021
Hans
England
London
London Eye.jpg
1.9.2021
Hans
France
Paris
Eiffel Tower.jpg
1.8.2021
Joshua
France
Paris
Louvre.jpg
1.8.2021
Hans
Basics of Data Science
7
Pros and Cons of a Table Structure:
• It is easy to add attributes with additional columns
• It is easy to add new records with additional rows
• We have to store repetitive information
• It is not easy to accommodate special circumstances
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
8
Relational Data Bases:
Pictures
Picture#
FileName
Location
Date
001
Brandenburg
Gate.jpg
1
1.7.2021
002
Alexanderplatz.jpg
1
1.7.2021
003
Eiffel Tower.jpg
2
1.8.2021
004
Louvre.jpg
2
1.8.2021
005
London Eye.jpg
3
1.9.2021
Locations
People
Picture#
Person
001
Hans
002
Joshua
003
Joshua
004
Hans
Location
City
Country
005
Hans
1
Berlin
Germany
005
Joshua
2
Paris
France
005
Sarah
3
London
England
Basics of Data Science
9
The Relational Model
Basics of Data Science
10
The Relational Model:
• It was originally developed by data scientist E.F. Codd in his paper: "A
Relational Model of Data for Large Shared Data Banks“, which was
published in 1970
• Key points:
o The retrieval of information is separated from its storage
o Using some rules data is organized across multiple tables that are related
to each other
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
11
The Pictures Database:
Pictures
Picture#
FileName
Location
Date
001
Brandenburg Gate.jpg
1
1.7.2021
People
002
Alexanderplatz.jpg
1
1.7.2021
Picture#
Person
003
Eiffel Tower.jpg
2
1.8.2021
001
Hans
004
Louvre.jpg
2
1.8.2021
002
Joshua
005
London Eye.jpg
3
1.9.2021
003
Joshua
006
River Thames.jpg
3
1.9.2021
004
Hans
005
Hans
005
Joshua
005
Sarah
Locations
Location
City
Country
1
Berlin
Germany
2
Paris
France
3
London
England
Basics of Data Science
12
The Pictures Database:
Pictures
Picture#
FileName
Location
Date
001
Brandenburg Gate.jpg
1
1.7.2021
People
002
Alexanderplatz.jpg
1
1.7.2021
Picture#
Person
003
Eiffel Tower.jpg
2
1.8.2021
001
Hans
004
Louvre.jpg
2
1.8.2021
002
Joshua
005
London Eye.jpg
3
1.9.2021
003
Joshua
006
River Thames.jpg
3
1.9.2021
004
Hans
005
Hans
005
Joshua
005
Sarah
Locations
Location
City
Country
1
Berlin
Germany
2
Paris
France
3
London
England
Basics of Data Science
13
The Pictures Database:
Pictures
Picture#
FileName
Location
Date
001
Brandenburg Gate.jpg
1
1.7.2021
People
002
Alexanderplatz.jpg
1
1.7.2021
Picture#
Person
003
Eiffel Tower.jpg
2
1.8.2021
001
Hans
004
Louvre.jpg
2
1.8.2021
002
Joshua
005
London Eye.jpg
3
1.9.2021
003
Joshua
006
River Thames.jpg
3
1.9.2021
004
Hans
005
Hans
005
Joshua
005
Sarah
Locations
Location
City
Country
1
Berlin
Germany
2
Paris
France
3
London
England
Basics of Data Science
14
Relational Database
Management System (RDBMS)
Basics of Data Science
15
Some RDBMS systems and sellers
• Microsoft SQL Server (SQL: Structured Query Language)
• PostgreSQL
• Azure SQL Database
• IBM Db2
• Oracle
• MySQL
• SQLite
https://realpython.com/python-sql-libraries/
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
16
RDBMS Tasks
• Creating and modifying the structure of the data
• Defining names for tables and column
• Creating key-value columns and building relationships
• Manipulating records and performing CRUD operations
o Create new records of data
o Read data that exists
o Update values of data
o Delete records from the database
I. Robinson, J. Webber, and E. Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
17
Further RDBMS Tasks
• Performing regular backups
• Maintaining copies of the database
• Controlling access permissions
• Creating reports including visualizations
• Creating forms
How to interact with a RDBMS?
• Part 1: Graphical interface
• Part 2: Coding with SQL (Structured Query Language)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
18
Database Components
• Relations
• Domains
• Tuples
Alternative terms for database components
• Tables
• Columns
• Records
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
19
Data Table
Person Name
Favorite Color Eye Color
Columns, Fields, or Attributes
Records or Rows
Person Name
Person Name
Favorite Color Eye Color
Basics of Data Science
Favorite Color Eye Color
20
How to Design a
Relational Database
Basics of Data Science
21
How to Design a Relational Database
• Find out which information that should be stored
• Pay attention to what you want to extract or get out of the database
• Create table groups collecting information
• Hint: imagine tables as "nouns" and columns as "adjectives“ of the nouns
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
22
Example: Entity Relationship Diagram
Customers
Orders
CustomerID
FirstName
LastName
StreetAddress
City
State
Zip
OrderID
CustomerID
ProductName
Quantity
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
23
Example: Entity Relationship Diagram
One-to Many Relationship
Customers
CustomerID
FirstName
LastName
StreetAddress
City
State
Zip
1
Orders
N
Basics of Data Science
OrderID
CustomerID
ProductName
Quantity
24
Example: Entity Relationship Diagram
One-to Many Relationship
Customers
CustomerID
FirstName
LastName
StreetAddress
City
State
Zip
1
Orders
OrderID
CustomerID
ProductName
Quantity
Basics of Data Science
25
Example: Entity Relationship Diagram
• Crow's Foot Notation
One-to Many Relationship: Crow’s Foot Notation
Customers
CustomerID
FirstName
LastName
StreetAddress
City
State
Zip
Orders
OrderID
CustomerID
ProductName
Quantity
Gordon C. Everest, "Basic Data Structure Models Explained With A Common Example" Computing Systems, Proceedings Fifth Texas
Conference on Computing Systems, Austin, TX, 1976 October 18-19, pages 39-46.
Basics of Data Science
26
Example: Entity Relationship Diagram
Products
ProductName
PartNumber
Size
Color
Price
Supplier
QuantityInStock
Suppliers
int
int
int
int
int
int
int
SupplierName
PhoneNumber
StreetAddress
City
State
Zip
int
int
int
int
int
int
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
27
Database Diagram: Data Types
Order
Name
BirthDate
Salary
Number
Text
Date
Currency
Number
Text
Date
Currency
Number
Text
Date
currency
Benefits of Data Types
• Efficient storage
• Data consistency and improved quality
Ramon A. Mata-Toledo and Pauline K. Cushman.
Fundamentals of Relational Databases. McGrawHill (2000)
Andreas Meier and Michael Kaufmann. SQL &
NoSQL Databases. Springer (2019)
• Improved performance
Adam Wilbert. Relational Databases: Essential
Training. (2019)
Basics of Data Science
28
Data Type Categories
• Character or text: char(5), nchar(20), varchar(100),
• Numerical data: tinyint, int, decimal/float
• Currency, times, dates,
• Other data types: geographic coordinates, binary files, etc.
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
29
Example: Data Types
Products
Suppliers
ProductName
varchar(100)
PartNumber
int
Size
varchar(20)
Color
varchar(20)
Price
decimal
Supplier
varchar(100)
QuantityInStock
int
SupplierID
SupplierName
PhoneNumber
StreetAddress
City
State
Zip
int
varchar(100)
char(15)
varchar(100)
varchar(50)
char(3)
char(5)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
30
Primary Key
• Definition: A primary key is the column or columns that contain values
that are used to uniquely identify each row in a table.
• Guaranteed to be unique (forever)
Some ways to define primary keys:
o Natural keys: there may already be unique identifiers in your data
o Composite keys: concatenation of multiple columns
o Surrogate keys: created just for the database (product id, …).
https://www.ibm.com/docs/en/iodg/11.3?topic=reference-primary-keys
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
31
Example: Determining a Primary Key
FirstName
LastName
FavoriteColor
Hans
Kurz
Green
Hans
Long
Yellow
Joshua
Müller
Green
PersonalID (PK)
FirstName
LastName
FavoriteColor
001
Hans
Kurz
Green
002
Hans
Long
Yellow
003
Joshua
Müller
Green
Basics of Data Science
32
Some Notes on Naming Tables and Columns:
• Consistency
• Capitalization
• Spaces
• Avoid using reserved Words
The command keywords should not be used in your defined names
• Avoid acronyms
Avoid acronyms, use full and legible terms
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
33
Integrity and Validity
Data Integrity ensures that information is identical to its source, and the
information has not been, in an accidental or malicious way, altered, modified,
or destroyed.
Validation is about the evaluations used to determine compliance and
accordance with security requirements.
• Data Validation: Invalid data is not allowed to be in database and the user
receives error message in case of inappropriate inputs/values.
• Unique Values: When a value can appear only once, e.g., primary key.
• Business Rules: Take into account your organization’s constraints.
https://oa.mo.gov/sites/default/files/CC-DataIntegrityandValidation.pdf
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
34
Example: Unique Constraint
Products
ProductName
PartNumber
Size
Color
Price
Supplier
QuantityInStock
Suppliers
varchar(100) UNIQUE
 int
varchar(20)
varchar(20)
decimal
varchar(100)
int
SupplierID
SupplierName
PhoneNumber
StreetAddress
City
State
Zip
 int
varchar(100)
char(15)
varchar(100)
varchar(50)
char(3)
char(5)
In creating the table [Products], you should add the following lines:
•
In MS SQL Server: CONSTRAINT [UK_Products_ProductName] UNIQUE (
[ProductName]
),
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
35
• NULL values: indicate that data is not known, is not specified, or is not
applicable.
• NOT NULL values: indicate a required column.
Example: Birthdate of customers or employees?
People
FirstName
Birthdate
Joshua
April 25, 1990
Albert
Valentin
July 14, 2000
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
36
Indexes
• Indexes are added to any column which is used in frequent searches.
o Clustered indexes: primary keys
o Non-clustered indexes: all other indexes
• Issues in adding too many indexes
o It will reduce speed in adding new records
o Note: you can still search on non-indexed fields
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
37
Example: Creating Index
Products
ProductName
PartNumber
Size
Color
Price
Supplier
QuantityInStock
•
Suppliers
varchar(100) UNIQUE
 int
varchar(20)
varchar(20)
decimal
varchar(100)
int
SupplierID
SupplierName
PhoneNumber
StreetAddress
City
State
Zip
 int
varchar(100)
char(15)
varchar(100)
varchar(50)
char(3)
char(5)
In MS SQL Server: CREATE INDEX [idx_Suppliers_SupplierName]
ON [Suppliers] ([SupplierName])
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
38
Check (Integrity) Constraints
These are built directly into the design of the table and then entry data is
checked and validated before being saved to the table.
• Numerical Checks: It can include a range of acceptable values.
• Character Checks: It can be used to limit possible cases.
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
39
Example: Check Constraint
Products
ProductName
PartNumber
Size
Color
Price
Supplier
QuantityInStock
•
Suppliers
varchar(100) UNIQUE
 int
varchar(20)
varchar(20)
decimal
varchar(100)
int
SupplierID
SupplierName
PhoneNumber
StreetAddress
City
State
Zip
 int
varchar(100)
char(15)
varchar(100)
varchar(50)
char(3)
char(5)
In MS SQL Server: [State] char(3) NOT NULL
CONTRAINT CHK_State CHECK (State = 'RLP' OR State = 'NRW'),
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
40
Relationships
Basics of Data Science
41
The Pictures Database:
Pictures
Picture#
FileName
Location
Date
001
Brandenburg Gate.jpg
1
1.7.2021
People
002
Alexanderplatz.jpg
1
1.7.2021
Picture#
Person
003
Eiffel Tower.jpg
2
1.8.2021
001
Hans
004
Louvre.jpg
2
1.8.2021
002
Joshua
005
London Eye.jpg
3
1.9.2021
003
Joshua
004
Hans
005
Hans
005
Joshua
005
Sarah
Locations
Location
City
Country
1
Berlin
Germany
2
Paris
France
3
London
England
Basics of Data Science
42
Example: Creating Relationships and Primary Key (PK)
Pictures
Pictures#
FileName
Location
Date
People
(PK)
Pictures#
Person
(PK)
(PK)
Locations
Location
City
Country
(PK)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
43
Example: Creating Relationships and Links
Pictures
Pictures#
FileName
Location
Date
People
(PK)
Pictures#
Person
(PK)
(PK)
Locations
Location
City
Country
(PK)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
44
Example: Creating Relationships and Foreign Key (FK)
Pictures
Pictures#
FileName
Location
Date
People
(PK)
Pictures# (PK) (FK)
Person
(PK)
(FK)
Locations
Location
City
Country
(PK)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
45
Some Notes:
• Generally, relationships are created on the foreign key (FK)
• Same data types for the FK and PK columns
• Same or different name for the FK and PK columns
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
46
Exercise: Create relationship between the following tables.
Hint: First, you need to change one of the columns in the Products table.
Products
ProductName
PartNumber
Size
Color
Price
Supplier
QuantityInStock
Suppliers
varchar(100) UNIQUE
 int
varchar(20)
varchar(20)
decimal
varchar(100)
int
Basics of Data Science
SupplierID
SupplierName
PhoneNumber
StreetAddress
City
State
Zip
 int
varchar(100)
char(15)
varchar(100)
varchar(50)
char(3)
char(5)
47
Optionality
Cardinality
• The minimum number of
related records
• The maximum number of
related records
• Usually, 0 or 1
• Usually, 1 or many (N)
Example:
Example:
• If a course must have a
responsible, optionality= 1
• If a course can have only one
responsible, cardinality = 1
• If a course might have a
responsible, optionality= 0
• If a course can have several
responsible, cardinality = N
1 .. N means the range of
Optionality=1 (must case) .. Cardinality = N (unspecified maximum)
Basics of Data Science
48
Example: Database Diagram and Optionality-Cardinality
Pictures
Pictures#
FileName
Location
Date
1 .. 1
0 .. N
(PK)
People
Pictures# (PK) (FK)
Person
(PK)
(FK)
0 .. N
Locations
1 .. 1
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
Location
City
Country
(PK)
49
Example: Database Diagram and Optionality-Cardinality
Pictures
Pictures#
FileName
Location
Date
1 .. 1
0 .. N
(PK)
People
Pictures# (PK) (FK)
Person
(PK)
(FK)
0 .. N
1 .. 1
Locations
1 .. 1
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
Location
City
Country
(PK)
50
Optionality
Cardinality
1 = NOT NULL
1 = UNIQUE constraint
0 = NULL
N = No Constraint
Optionality .. Cardinality
1 .. 1
Not Null + Unique
0 .. N
Null + Not Unique
1 .. N
Not Null + Not Unique
0 .. 1
Null + Unique
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
51
One-to-Many Relationships
Library Database
BookLoans
LibraryUsers
CardNumber
CardNumber
UserName
50001
Valentin
50002
Laura
50003
Christian
BookName
CheckoutDate
50001
Operations Research
15.10.2021
50001
SQL & NoSQL
Databases
01.11.2021
50001
SQL For Dummies
01.12.2021
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
52
One-to-One Relationships
Employee Database
Employees
1 .. 1
1 .. 1
EmployeeID
(PK)
FirstName
LastName
Position
OfficeNumber
HumanResources
EmployeeID
Salary
JobRating
(PK)
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
53
Many-to-Many Relationships
Example: Class Schedule Database
Students
0 .. N
0 .. N
StudentID
(PK)
StudentName
Courses
CourseID
CourseName
RoomName
(PK)
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
54
Example: Class Schedule Database
0 .. N
StudentCourses
CourseID
StudentID
Grade
0 .. N
(PK)
(PK)
Students
Courses
StudentID
(PK)
StudentName
CourseID
CourseName
RoomName
1 .. 1
1 .. 1
(PK)
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
55
Self Joins (Recursive Relationships)
• Other names: recursive relationship, self-referencing relationship
• Relationship rules and types: the same rules and types between two
tables
Employees Table
EmployeeID (PK)
Name
SupervisorID
1008
Maxim
1009
Joshua
1008
1020
Sven
1008
1021
Sarah
1020
Ramon A. Mata-Toledo and Pauline K. Cushman.
Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL
Databases. Springer (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
56
Employee Organizational Chart
Maxim
Self Join Diagram
Employees
Joshua
Sven
EmployeeID
(PK)
Name
Sarah
SupervisorID
(FK)
Check constraint:
SupervisorID ≠ EmployeeID
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
57
Cascade Updates and Deletes:
Example:
Pictures
Picture# FileName
Location
Date
001
Brandenburg Gate.jpg
1
1.7.2021
002
Alexanderplatz.jpg
1
1.7.2021
Locations
003
Eiffel Tower.jpg
2
1.8.2021
Location
City
Country
004
Louvre.jpg
2
1.8.2021
1
Berlin
Germany
005
London Eye.jpg
3
1.9.2021
2
Paris
France
London
England
3
Basics of Data Science
7
58
Cascade Updates and Deletes
Pictures
Picture# FileName
Location
Date
001
Brandenburg Gate.jpg
1
1.7.2021
002
Alexanderplatz.jpg
1
1.7.2021
003
Eiffel Tower.jpg
2
7 1.8.2021
Location
City
Country
004
Louvre.jpg
2
7 1.8.2021
1
Berlin
Germany
005
London Eye.jpg
3
1.9.2021
2
Paris
France
London
England
Locations
3
Basics of Data Science
7
59
Cascade Updates and Deletes
Pictures
Picture# FileName
Location
Date
001
Brandenburg Gate.jpg
1
1.7.2021
002
Alexanderplatz.jpg
1
1.7.2021
003
Eiffel Tower.jpg
2
7 1.8.2021
Location
City
Country
004
Louvre.jpg
2
7 1.8.2021
1
Berlin
Germany
005
London Eye.jpg
3
1.9.2021
2
Paris
France
London
England
Locations
3
Basics of Data Science
7
60
How to implement cascade changes
• Note that cascade update and delete don’t concern insertion of new data
• If you choose to switch off the cascade functionality, then you can still
protect data integrity, e.g., from accidental changes
• How to activate: depends on the platform that you use
• In SQL: ON UPDATE CASCADE and ON DELETE CASCADE
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
61
Database Normalization
Basics of Data Science
62
Database Normalization
• Normalization consists of a set of rules that describe proper design of
database.
• The rules for table structure are called "normal forms" (NFs).
• There are first, second, and third normal forms that should be satisfied in
order.
• A database has a good if it design satisfies "third normal form" (3NF).
First Normal Form (1NF)
• It requires that all fields of table include a single piece of data
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
63
Example 1: not satisfied 1NF
Picture#
FileName
Person
001
Brandenburg Gate.jpg
Hans
002
Alexanderplatz.jpg
Joshua
003
Eiffel Tower.jpg
Joshua
004
Louvre.jpg
Hans
005
London Eye.jpg
Hans, Joshua, Sarah
006
River Thames.jpg
A bad solution to the issue:
Picture#
FileName
Person1
001
Brandenburg Gate.jpg
Hans
002
Alexanderplatz.jpg
Joshua
003
Eiffel Tower.jpg
Joshua
004
Louvre.jpg
Hans
005
London Eye.jpg
Hans
006
River Thames.jpg
Basics of Data Science
Person2
Person3
Joshua
Sarah
64
Example 1: satisfying 1NF
Pictures
People
Picture#
FileName
Location
Date
Picture#
Person
001
Brandenburg Gate.jpg
1
1.7.2021
001
Hans
002
Alexanderplatz.jpg
1
1.7.2021
002
Joshua
003
Eiffel Tower.jpg
2
1.8.2021
003
Joshua
004
Louvre.jpg
2
1.8.2021
004
Hans
005
London Eye.jpg
3
1.9.2021
005
Hans
005
Joshua
005
Sarah
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
65
Example 2: satisfying 1NF
Address
Gottlieb Daimler Str. 42, Kaiserslautern, RLP 67663
Street
Building
City
State
PostalCode
Gottlieb Daimler
Str.
42
Kaiserslautern
RLP
67663
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
66
Second Normal Form (2NF)
• A table fulfills the 2NF if all of the fields in the primary key are required to
determine the other fields, i.e., the non-key fields.
People
People
Picture# (PK)
Person (PK)
Picture# (PK)
Person (PK)
LastName
001
Hans
001
Hans
Schmidt
002
Joshua
002
Joshua
Schmidt
003
Joshua
003
Joshua
Schmidt
004
Hans
004
Hans
Schmidt
005
Hans
005
Hans
Schmidt
005
Joshua
005
Joshua
Schmidt
005
Sarah
005
Sarah
Woods
Basics of Data Science
67
Example: satisfying 2NF
People
Picture# (PK)
Person (PK)
001
1
002
2
003
2
004
1
005
1
005
2
005
3
Person
Basics of Data Science
Person (PK)
FirstName
LastName
1
Hans
Schmidt
2
Joshua
Schmidt
3
Sarah
Woods
68
Third Normal Form
• A table fulfills 3NF if all of the non-key fields are independent from any
other non-key field.
Example: 3NF is violated
Person
Picture# (PK)
FirstName
LastName
StateAbbv
StateName
1
Hans
Schmidt
RLP
Rheinland-Pfalz
2
Joshua
Schmidt
BWG
Baden-Württemberg
3
Sarah
Woods
NRW
Nordrhein-Westfalen
Ramon A. Mata-Toledo and Pauline K. Cushman. Fundamentals of Relational Databases. McGraw-Hill (2000)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
69
Example: satisfying 3NF
Sate
Person
Picture# (PK)
FirstName
LastName
StateAbbv
1
Hans
Schmidt
RLP
2
Joshua
Schmidt
BWG
3
Sarah
Woods
NRW
Basics of Data Science
StateAbbv (PK)
StateName
RLP
Rheinland-Pfalz
BWG
BadenWürttemberg
NRW
NordrheinWestfalen
70
Denormalization
• The objective of normalization process consists in removing redundant
information from the database and make the database work properly.
• In contrast: the aim of denormalization is to introduce redundancy and
violates deliberately one of the normalization forms if database design has
a good reason to do so, e.g., this might be done for the objective of
increasing performance in some application contexts.
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
71
Example: denormalizing a Table
Normalized tables:
Sate
Person
Picture# (PK)
FirstName
LastName
StateAbbv
1
Hans
Schmidt
RLP
2
Joshua
Schmidt
BWG
3
Sarah
Woods
NRW
Basics of Data Science
StateAbbv (PK)
StateName
RLP
Rheinland-Pfalz
BWG
BadenWürttemberg
NRW
NordrheinWestfalen
72
Example denormalizing a Table
Picture# (PK)
FirstName
LastName
StateAbbv
StateName
1
Hans
Schmidt
RLP
Rheinland-Pfalz
2
Joshua
Schmidt
BWG
Baden-Württemberg
3
Sarah
Woods
NRW
Nordrhein-Westfalen
Basics of Data Science
73
Other Types of Databases
Basics of Data Science
74
Graph Databases (GD)
• Nodes and edges are used to store information.
• In a GD, each node can have relationships with (connected to) any other
node.
• In a GD, nodes can be used to represent different kinds and types of
information.
• Application area: typically used for modeling, representing, and studying
social networks.
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
75
Example:
Married to
Has played
Friends with
Has played
Parent of
Has played
Friends with
Basics of Data Science
Parent of
● Birthplace
● Age
● Gender
● Job title
● Salary
76
Document Databases (DD)
• Document databases are used to store documents, where each document
represents a single object
• Document databases Support files that are in different formats
• In Document databases, a variety of operations can be performed on
documents, e.g., reading, classifying, ….
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Allen G. Taylor. SQL For Dummies, 9th Edition (2019)
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
77
Example: Document Databases
Basics of Data Science
78
NoSQL Databases
(Not relational)
Basics of Data Science
79
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 1, Section 3 → Graph- and Network-based Data Models:
Modeling and Querying Structured Relations of Objects
Part 1: Organizing the “Data Lake” (from data mining to data fishing)
• Relational Database Models: Modeling and Querying Structured Attributes
of Objects
• Graph- and Network-based Data Models: Modeling and Querying
Structured Relations of Objects
• Information Retrieval: Document Mining and Querying of ill-structured Data
• Streaming Data and High Frequency Distributed Sensor Data
• The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate
evolving heterogeneous data lakes
Basics of Data Science
2
Summary:
In this section, we will see:
• A short introduction.
• What is a “graph database”?
• Why do we need “graph database”?
• How can we use a “graph database”?
If you are interested in this topic, for further and deeper knowledge on graph
databases, please refer to references and books on graph databases, e.g.,
● Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
● Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
● Dave Bechberger, Josh Perryman. Graph Databases in Action. Manning (2020)
Basics of Data Science
3
NoSQL Databases
https://hostingdata.co.uk/nosql-database/
Basics of Data Science
4
Basic structure of a relational database management system.
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Basics of Data Science
5
Variety of sources for Big Data
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Basics of Data Science
6
NoSQL: Nonrelational SQL, Not Only SQL, No to SQL?
The term NoSQL is basically used for nonrelational data management systems
that meet the following conditions:
(1) We don’t store data in table data structures.
(2) The database querying language is not SQL.
NoSQL databases support various database models.
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
7
Basic structure of a NoSQL database management system
• Mostly, NoSQL database
management systems use
Massively distributed
storage architecture.
• Multiple consistency
models, e.g., strong
consistency, weak
consistency, etc.
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Basics of Data Science
8
The definition of NoSQL databases based on the web-based NoSQL Archive:
A Web-based storage system is a NoSQL database system if the following
requirements are met:
• Model: it does not use relational database model.
• At least three Vs: volume, variety, and velocity.
• Schema: no fixed database schema.
• Architecture: it supports horizontal scaling and massively distributed web
applications.
• Replication: data replication is supported.
• Consistency assurance: consistency is ensured.
NoSQL Archive: http://nosql-database.org/, retrieved February 17, 2015
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Basics of Data Science
9
Three different NoSQL databases.
Andreas Meier and Michael Kaufmann. SQL & NoSQL Databases. Springer (2019)
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Adam Fowler. NoSQL For Dummies, 9th Edition (2015)
Basics of Data Science
10
Graph Databases
Basics of Data Science
11
A graph is a network composed of nodes (also called vertices) and connections,
the so-called edges (or arcs, if directed).
a
f
b
g
d
c
h
•
•
•
•
•
•
e
Adjacent
• Cycle
Path
• Tree
Degree of a node
• Forest
Connected graph
Disconnected graph
Directed Graph
i
HMM: OR, 7.1 u. 7.2
Basics of Data Science
12
• Graph databases leverage relationships in highly connected data with the
objective of generating insights.
• Indeed, when we have connected data with a significant size or value, using
graph databases is the best choice to represent and query connected data.
• Large companies realized the importance of graphs and graph data bases
longtime ago, but in recent years using graph infrastructures became more
and more common and used by many organizations.
• Despite this renaissance of graph data and graph thinking to use in
information management, it is interesting and important to note that the
graph theory itself is not new.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
13
What Is a Graph?
In a formal language, a graph is a collection or set of vertices (nodes) and edges
that connect the vertices (nodes).
Example: A small social graph
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
14
Example: A simple graph model for
publishing messages in social network.
Relationships: CURRENT and PREVIOUS.
Question: How can you identify Ruth's timeline?
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
15
Why Graph Databases?
(i) Relational Databases Lack Relationships
• Initially, the relational databases were designed to codify tabular structures.
Even though this task does very well, the relational databases struggle
when they try to model the ad hoc relationships that we encounter in the
real world.
• In the relational databases, there exist relationships, but only at modeling
stage just for the purpose of joining tables. Moreover, this becomes an
issue in highly connected domains.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
16
Why Graph Databases?
(ii) NoSQL Databases Also Lack Relationships
•
Most of the NoSQL databases store collections of disconnected objects
(whether documents, values, or columns).
• In the NoSQL databases, we can use aggregate’s identifiers to add
relationships. But this can become quickly excessively expensive.
• Moreover, the NoSQL databases don’t support operations to point
backward.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
17
Why Graph Databases?
(iii) Graph Databases Embrace Relationships
• In the previous models that we studied, if there is any implicit connection in
the data, the data models and the databases are blind to these connections.
But, in the graph world, connected data is stored truly as connected data.
• The graph models are flexible in the sense that they allow to add new nodes
and new relationships without having any need to migrate data or to
compromise the existing network, i.e., the original data and its intent
remain unchanged.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
18
Example:
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
19
Graph Databases
A graph database management system (or for short, a graph database) is an
online database management system that can perform Create, Read, Update,
and Delete (CRUD) methods on a graph data models.
https://www.avolutionsoftware.com/abacus/the-graph-database-advantage-for-enterprise-architects/
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
20
Properties of graph Databases
• (a) The underlying storage:
There are graph databases that use native graph storage which is optimized
and designed for storing and managing graphs. However, some other graph
databases do not use native graph storage
• (b) The processing engine:
In some definitions, the connected nodes of a graph point physically "point"
to each other in the database. Such a graph database use a so-called indexfree adjacency or native graph processing.
In a broader definition, which we use in this course too, we assume a graph
database as a one that can perform CRUD operations on a graph data.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
21
Figure: An overview of some of the graph databases.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
22
Graph Compute Engines
A graph compute engine is a technology that is used to run graph
computational algorithms (such as identifying clusters, …) on large datasets.
Some Graph Compute Engines: Cassovary, Pegasus, and Giraph.
https://github.com/twitter/cassovary
http://www.cs.cmu.edu/~pegasus/
https://giraph.apache.org/
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
23
Data Modeling with Graphs
• Question: how do we model as graphs?
Models and Goals
•
What is “modeling”? An abstracting process which is motivated by a
specific goal, purpose, or need.
• How to model? There are no unique and natural way!
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
24
Labels and Paths in the Graph
• In a given network, the nodes might play one or more roles, e.g., some
nodes might show users, whereas others might represent orders or maybe
products, etc. To attribute roles to a node, we can use “labels”. Since a
node of a given graph can take various roles (simultaneously), we might
need to associate more than one label to a given node.
• Using labels, we can ask database to perform different tasks, e.g., find all
the nodes labeled “product”.
• In a graph model, a natural representation of relationships is done by
“paths”. Hence, querying (or traversing) a given graph model is done
through following paths.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
25
The Labeled Property Graph Model
• A labeled property graph is composed of nodes, relationships, labels, and
properties.
• Nodes hold properties.
• Nodes can have one or more labels to group nodes and indicate their role(s).
• Nodes are connected by relationships, which are named, have direction and
point from a start node to an end node.
• Similar to the case of nodes, relationships can have properties too.
In addition to a labeled property graph model, we need a query language to
create, manipulate, and query data in a graph database.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
26
Which Query Language?
• Which graph database query language? → Cypher
• Why Cypher? → Standard and widely deployed, easy to learn and
understand (if you have a background in SQL).
• There are other graph database query languages, e.g., SPARQL and Gremlin.
https://neo4j.com/developer/cypher/
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
27
Querying Graphs: An Introduction to Cypher
• Using Cypher, we can ask the database to search for data that corresponds
and matches a given pattern.
Identifiers: Ian, Jim, and Emil
Example: ASCII art representation of this diagram in Cypher:
(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
28
Some Notation:
• We draw nodes with parentheses.
•
We draw relationships with --> and <-- (where the signs < and > indicate
direction of relationship). Moreover, between the dashes, set off by
square brackets [] and put a colon and then the name of the relationship.
• Similarly, we put a colon as prefix to node labels.
• We use curly braces, i.e., {} to specify node (and relationship) property
key-value pairs.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
29
(emil:Person {name:'Emil'})
<-[:KNOWS]-(jim:Person {name:'Jim'})
-[:KNOWS]->(ian:Person {name:'Ian'})
-[:KNOWS]->(emil)
• Identifiers: Ian, Jim, and Emil
• Property: name
• Label: Person
Example: The identifier “Emil” is assigned to a node in the dataset, where this
nodes has the “label Person” and a “name property” with the value “Emil”.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
30
Cypher is made up with clauses/keywords, for example: MATCH clause that
is followed by a RETURN clause.
Example 1: Find Person nodes in the graph that have a name of ‘Tom Hanks'.
MATCH (tom:Person {name: 'Tom Hanks’})
RETURN tom
Example 2: Find which ‘Movie’s Tom Hanks has directed.
MATCH (:Person {name: 'Tom Hanks'})-[:DIRECTED]->(movie:Movie)
RETURN movie
https://neo4j.com/developer/cypher/querying/
https://gist.github.com/DaniSancas/1d5265fc159a95ff457b940fc5046887
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
31
Example 3: In the following Cypher query,
we use these clauses to find the mutual
friends of a user whose name is Jim.
MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
(a)-[:KNOWS]->(c)
RETURN b, c
Alternatively:
MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)
WHERE a.name = 'Jim'
RETURN b, c
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
32
The RETURN clause
Using this clause, we specify which nodes, properties, and relationships in the
matched data must be returned to the user.
Some other Cypher clauses
• WHERE: used for filtering results that match a pattern.
• CREATE and CREATE UNIQUE: used for creating nodes and relationships.
• DELETE: to removes nodes, properties, and relationships.
• FOREACH: to performs an update action on a list.
• START: we use it to specify one or more explicit starting points (i.e., nodes
or relationships) in the given graph.
Ian Robinson, Jim Webber, and Emil Eifrem. Graph Databases New Opportunities for Connected Data. O'Reilly (2015)
Basics of Data Science
33
Appendix:
• Some definitions from graph theory
(for self study)
Basics of Data Science
34
A graph G is a network composed of nodes (also called vertices) and connections,
the so-called edges (or arcs if directed).
We may denote such a graph with G=(V,E), where
V is the set of vertices (nodes) and E is the set of edges.
v2
v1
Suppose that the graph G has n vertices and m edges.
v3
Let V={v1, …, vn} is the set of vertices and E={e1, …, em}
Each edge is defined by two nodes, for example: e1 = (v1, v2).
v4
Two nodes vi and vj are adjacent if they are connected by an edge.
Basics of Data Science
35
(*) Degree of a vertex is the number of edges passing from it. Example: degree
of v3 is 3.
(*) A path is a sequence of edges that connect a sequence of (adjacent)
vertices.
Example: v2, v1 , v3, v4
v2
v3
v1
Basics of Data Science
v4
36
(*) A cycle is a sequence of vertices starting and ending at a same vertex. The
length of a cycle is the number of edges in the cycle.
Example: {v1, v2 , v3 , v1} define a cycle of length 3.
v2
v1
v3
Basics of Data Science
v4
37
Some definitions:
(*) Connected graph: There is a path between any two vertices.
(*) Disconnected graph: There are at least 2 vertices such that
there is no path for connecting them.
Basics of Data Science
38
(*) Tree: Any connected graph that has no cycle.
Example:
(*) Forest: a set of trees.
Example:
Basics of Data Science
39
(*) Complete graph (or a Clique): All vertices are adjacent to each other.
Example:
(*) Planar graph: The edges do not cross (i.e., do not overlap).
Example:
Basics of Data Science
40
(*) Directed Graphs:
(*) Weighted graph: is a graph whose vertices or edges have been assigned
weights.
v2
20
15
v1
v4
v3
5
Basics of Data Science
50
41
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 1, Section 4 → Information Retrieval: Document Mining
and Querying of ill-structured Data
Part 1: Organizing the “Data Lake” (from data mining to data fishing)
• Relational Database Models: Modeling and Querying Structured Attributes
of Objects
• Graph- and Network-based Data Models: Modeling and Querying
Structured Relations of Objects
• Information Retrieval: Document Mining and Querying of ill-structured Data
• Streaming Data and High Frequency Distributed Sensor Data
• The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate
evolving heterogeneous data lakes
Basics of Data Science
2
Summary:
In this section, we will see:
• A short introduction
• What is a “document database”?
• Why do we need “document database”?
• How can we use a “document database”?
Basics of Data Science
3
Information Retrieval (IR): Information retrieval might be defined as follows:
• “Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).”
• The term “unstructured data” means the data that does not have a
structure which is clear, semantically apparent, and easy-to-understand for
a computer. This is unlike what we find in relational databases.
• The information retrieval also means supporting users in processing,
browsing, filtering, or clustering collections of (retrieved) documents
collections.
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval. Cambridge University Press (2009)
Basics of Data Science
4
Examples:
• In a web search, an IR system should find something (text, …) out of billions
of documents that are stored on millions/billions of servers and computers.
• Personal information retrieval:
Email programs contain not only search features but also text classification,
e.g., spam filter to divert junk e-mails to specific folder(s).
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval. Cambridge University Press (2009)
Basics of Data Science
5
Document Databases
Basics of Data Science
6
Document Databases (DD) or Document Stores
• The focus of the document databases is on storage and access methods that
are optimized for documents, which is different from rows or records that
are common in a relational database management system (RDBMS).
• Document databases are used to store documents, where each document
represents a single object.
• DDs Support files in different formats, and in DDs, a variety of operations
can be performed on documents, e.g., reading, classifying, putting in a
collection, ….
• No need to have a final structure in advance: the organization of a DD comes
completely from the individual documents that are stored in the DD.
Adam Wilbert. Relational Databases: Essential Training. (2019)
Basics of Data Science
7
Example: Document Databases
Basics of Data Science
8
Document Databases Solutions
• There are several platforms for document databases, e.g., MongoDB,
MarkLogic, CouchDB (Apache), etc.
• MongoDB (https://www.mongodb.com/) is probably the most popular
document database system. MongoDB integrates extremely well with
Python (PyMongo).
• MongoDB is a schema-free, document-oriented database that uses a
collection-oriented storage, which are analogous to tables in a relational
database. Each collection contains documents (possibly nested), and a
document is a set of fields, each one being a key-value pair.
Olivier Curé and Guillaume Blin (editors). RDF Database Systems Triples Storage and SPARQL Query Processing (Chapter 2). Elsevier
(2015)
Basics of Data Science
9
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 1, Section 5 → Streaming Data and High Frequency
Distributed Sensor Data
Part 1: Organizing the “Data Lake” (from data mining to data fishing)
• Relational Database Models: Modeling and Querying Structured Attributes
of Objects
• Graph- and Network-based Data Models: Modeling and Querying
Structured Relations of Objects
• Information Retrieval: Document Mining and Querying of ill-structured Data
• Streaming Data and High Frequency Distributed Sensor Data
• The Semantic Web: Ontologist’s Dream (or nightmare?) of how to integrate
evolving heterogeneous data lakes
Basics of Data Science
2
Summary:
In this section, we will see:
• A short introduction
• What is a “streaming data”?
• Challenges of “streaming data”?
• Notes on query languages for streaming data.
Basics of Data Science
3
• Big Data: we might define big data as the case when the dataset is so large
that we cannot manage it if we don’t use nonconventional technologies or
algorithms to extract information and knowledge.
•
Big data could be characterized the three “V”s of big data management,
i.e., Volume (more and more data), Variety (different types of data), and
Velocity (arriving continuously).
• According to Gartner, the three-V concept is summarized as follows:
“high volume, velocity and variety information assets that demand costeffective, innovative forms of information processing for enhanced insight
and decision making.”
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Doug Laney. 3-D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, 2001.
https://www.gartner.com/
Basics of Data Science
4
There are some other Vs that have been added:
• Variability: The structure of the data changes over time.
• Value: we consider data as a valuable object only it helps un in making
better decisions.
• Validity and Veracity: It is important to notice that some or some of the
data might not be fully reliable, and it is an important task to manage and
control this uncertainty.
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Basics of Data Science
5
Using big data technologies has the objective of improving service and life
quality, for example:
• Business: big data can be used to improve service quality by customer
personalization and detecting churns.
• Technology: using big data technologies, we can reduce processing time of
data from days and hours to just some seconds.
• Health: by mining medical information and records of people, we can
monitor health conditions.
• Smart cities: collecting huge volume of data and processing them effectively
permits us to ensure sustainable economic development, better use of
natural resources, and a higher life quality.
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Basics of Data Science
6
Real-Time Analytics (on demand vs. continuous)
Real-time analytics is a particular case of the big data. According to Gartner,
“real-time analytics is the discipline that applies logic and mathematics to data
to provide insights for making better decisions quickly”.
Data Streams:
As an algorithmic abstraction in real-time analytics, data streams are defined
as a sequence of items or elements, which are possibly infinite, such that each
item has a timestamp and a temporal order. In such a sequence, the items
arrive one-by-one, and our objective consists in developing algorithms which
do predictions or detect pattern in real time.
https://www.gartner.com/en/information-technology/glossary/real-time-analytics
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Basics of Data Science
7
Time, Memory, and Accuracy:
In stream mining process, we are interested in algorithms that require short
computation time, low volume of memory, but with the highest accuracy.
Applications:
Streaming data take place in many contexts, for example:
• Sensor data and the Internet of Things: we find sensors almost everywhere,
in industry and in our cities.
• Telecommunication data: with billions of phones in the world,
telecommunication companies collect a huge amount of phone call data.
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Basics of Data Science
8
Applications (cntd.):
• Social media: by using social networks, e.g., facebook, twitter, Instagram,
and LinkedIn, we continuously produce data for the corresponding
companies.
• Marketing and e-commerce: online businesses collect a huge amount of
data in real time, which can be used for different purposes, e.g., fraud
detection.
• Epidemics and disasters.: data streaming can be used in detecting
epidemics and natural disasters.
• Electricity demand prediction: energy providers want to know in advance
the quantity of demand.
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Basics of Data Science
9
• Example: An electrical grid, where each dot represents a sensor.
Joao Gama. Knowledge Discovery from Data Streams. Chapman and Hall CRC (2010)
Basics of Data Science
10
• Hardware technology: possible to collect and store data continuously.
• Sources of data generation continuously: surfing on the internet, using a
phone or credit card, etc.
• Challenges:
o Due to high-speed nature of data streams, data processing can be done
in only one pass.
o Temporal locality: in fact, stream data may evolve over time.
o Unbounded Memory Requirements: huge (illimited) of volume of
generated data streams.
Charu C. Aggarwal. Data Streams Models and Algorithms. Springer (2007)
Basics of Data Science
11
• Tradeoff between Accuracy and Efficiency: The algorithm must ensure a
tradeoff between the accuracy of the result and the computation time and
the required space (memory).
• The data are not independent and identically distributed.
• Visualization: It is a big challenge to present effectively numerical results
and information obtained from a huge amount of data.
• Hidden big data: A large amount of potentially useful data are not used for
many reasons.
Charu C. Aggarwal. Data Streams_Models and Algorithms. Springer (2007)
Moharned Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. A Survey of Classification Methods in Data Streams. In: Data
Streams Models and Algorithms. Springer (2007)
Joao Gama. Knowledge Discovery from Data Streams. Chapman and Hall CRC (2010)
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Basics of Data Science
12
• Data stream: The term data stream is used for a countably infinite
sequence of items/elements with the objective of representing data
elements that become available progressively over time.
• A stream-based application: it analyzes elements that become available
from streams to produce instantly new results with the objective of
providing fast reactions, if it is required.
• Types of data streams models:
o Structured: data elements exhibit a certain schema or format.
o Unstructured: may contain arbitrary formats and contents.
Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
13
• Types of structured streams :
I.
The turnstile model: this is the most general model, where a vector of
elements is used to model the stream. Each element of the stream is an
update to an element which is in the underlying vector whose size is the
domain of the streaming elements.
II.
The cash register model: in this model, the stream elements can only be
added to the underlying vector.
III. The time series model: this model considers each stream element as a new
and independent entry to the vector. Consequently, the underlying vector
is constantly increasing, and in general, it can be unbounded. This model is
frequently used in current stream processing engines.
Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
14
• Stream processing: it consists in analyzing streaming data on-the-fly with
the objective of producing updated results as soon as new data are
received.
• “Time” in stream processing: in many stream processing applications, time
plays a central role, i.e., either we need to update the results by taking the
recent data, or we want to detect the temporal trends.
• “Windows” in stream processing: windows are used to define bounded
segments of elements over an unbounded data stream.
Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
15
• “Windows” are used to compute some statistical information, e.g., average
of some input data.
• The most common types of windows:
o count-based windows: the size is defined in terms of the number of
elements,
o time-based windows: the size is defined in terms of a time frame.
• In both types, we have:
o sliding windows: the window progress continuously upon arrival of new
data elements,
o tumbling windows: they can collect multiple elements before moving.
Alessandro Margara, Tilmann Rabl. Definition of Data Streams. In: Encyc. Big Data Tech., Springer (2019)
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
Basics of Data Science
16
Examples:
https://docs.microsoft.com/en-us/stream-analytics-query/sliding-window-azure-stream-analytics
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
Nicoló Rivetti. Introduction to Stream Processing Algorithms. In: Encyc. Big Data Tech., Springer (2019)
Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6):1794–1813 (2002)
Basics of Data Science
17
Monitoring massive data streams:
There are two main approaches:
• Sampling: In the sampling approaches, all elements are read once, but only
a subset of them are kept for further process. There are several methods
for selecting the samples, which are expected to be representative. In a
competitive market, it might be necessary to keep the sampling policy
secret; otherwise, an adversary can get benefit.
• Summaries: A summary approach scans each piece of stream input data onthe-fly and keeps locally compact sketches or synopses that contain the
most representative and important information.
Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer. Machine Learning for Data Streams with Practical Examples in MOA. MIT
Press (2018)
Nicoló Rivetti. Introduction to Stream Processing Algorithms. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
18
Query language for processing streaming data:
• An essential difference between a stream versus a conventional query
language: stream queries continue to produce answers as new elements
arrive; hence, queries are stored and evolve over time.
• Most stream query languages try to extend SQL.
o Academic languages: CQL, SQuAl, ESL, etc.
o Commercial languages: StreamSQL, CCL, EQL, StreaQuel, etc.
• Difference between stream query languages is mainly on their approach in
addressing requirements of stream processing: language closure,
windowing, correlation, and pattern matching.
Mitch Cherniack and Stan Zdonik. Stream-Oriented Query Languages and Operators. In: Encyclopedia of Database Systems. Springer (2009)
Basics of Data Science
19
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 1, Section 6 → The Semantic Web: Ontologist's Dream (or
nightmare?) of how to integrate evolving heterogeneous data
lakes
Part 1: Organizing the “Data Lake” (from data mining to data fishing)
• Relational Database Models: Modeling and Querying Structured Attributes
of Objects
• Graph- and Network-based Data Models: Modeling and Querying
Structured Relations of Objects
• Information Retrieval: Document Mining and Querying of ill-structured Data
• Streaming Data and High Frequency Distributed Sensor Data
• The Semantic Web: Ontologist's Dream (or nightmare?) of how to integrate
evolving heterogeneous data lakes
Basics of Data Science
2
Summary:
In this section, we will see:
• What is “Data Lake”?
• What is “Ontology”?
• What is a “Semantic Web”?
• What is “Data Integration”?
• How to integrate evolving heterogeneous data sources?
Basics of Data Science
3
Data lake
Basics of Data Science
4
Definition: a data repository in which we store different datasets coming from
multiple sources and in original structures is defined as a data lake.
Data Lakes versus Data Warehouses:
• A data warehouse is a database that is optimized to analyze relational data
that come from transactional systems. Moreover,
(i) the data is cleaned,
(ii) the data structure and schema are defined in advance.
• A data lake stores relational as well as non-relational data. In data lake,
when data is captured, we are not aware of the structure of the data or
schema.
Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019)
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Basics of Data Science
5
Tasks and features of a “data lake”:
• Extracting data and metadata from multiple and heterogeneous source.
• Ingest the extracted data into a storage system.
• Transform, clean, and integrate data with other datasets.
• Provide possibility to explore and to query the data and metadata.
Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019)
https://lakefs.io/data-lakes/
Basics of Data Science
6
Data lake architecture
Four layers:
• ingestion,
• storage,
• transformation,
• interaction.
Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
7
Ingestion Layer: This layer has the responsibility of importing data. More
precisely, ingesting data and extracting metadata should be done automatically
(as far as possible). Here, the data quality (DQ) control is used to ensure the
quality of the ingested data.
Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
8
Storage Layer: The main components are:
• the metadata repository: it stores all the metadata of the data lake, i.e., no
matter if they have been partially collected automatically or will be later
added manually.
• the raw data repositories: the data are ingested in original formats, and we
need different storage systems for different data types, e.g., for relational,
XML, graph, etc.
There should be a data access interface for querying.
Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
9
Transformation Layer:
This layer transforms the raw data into a desired target structure. For this
purpose, a data lake has a data transformation engine where data can be
cleaned, transformed, and integrated in a scalable way.
Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
10
Interaction Layer:
The focus of this layer is on interactions between users and the data lake.
The components data exploration and metadata manager are in close
relationship to provide access and exploration of the data by the users.
Christoph Q. and Rihan H., Data Lake. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
11
Ontology
Basics of Data Science
12
Terminology of Ontology: Originated from ancient Greek (from philosophy),
the term ontology is composed of on (genitive ontos), which is indeed the verb
“being”, and logia that refers to the term “science” or “study”. Hence, the term
ontology might be interpreted as “the study of being”. In recent years, the
computer scientists has adopted the term ontology.
Ontology: Classically, in artificial intelligence (AI), ontologies were defined as a
kind of “knowledge representation” and “knowledge models”. In computer
science, an ontology can be defined as “a set of representational primitives
with which to model a domain of knowledge or discourse”. In this definition,
primitives refer to “concepts and relations” or “classes and attributes” (or
properties or other things that define relations between elements/terms).
Eva Blomqvist. Ontologies for Big Data. In: Encyc. Big Data Tech., Springer (2019)
Gruber T. Ontology. In: Encyclopedia of database systems. Springer (2009)
Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012)
Basics of Data Science
13
An example: in the setting of a university, faculty members, staffs, students,
courses, lecture rooms, and disciplines are some important classes of concepts
for which we can define different relationships.
In this context, ontologies may also contain information like:
● Properties, e.g., “A” teaches “B”,
● Restrictions, e.g., only professors can have PhD students,
● Statements: professor and other staffs are disjoint.
In the web context, by using ontologies we create a shared understanding of a
given domain. The most important ontology languages:
● Resource Description Framework (RFD): a vocabulary description language.
● Web Ontology Language (OWL): a richer vocabulary description language.
Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012)
Basics of Data Science
14
An example of an RDF graph:
Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012)
Basics of Data Science
15
Applications:
The following list includes some of the tasks for which ontologies can be used:
• Data integration: ontologies can act as a model that unifies representation
and linking various datasets.
• Data access: ontologies can act as vocabularies with the objective of
understanding and querying datasets.
• Data analysis, cleaning, and constraint checking: provides opportunities for
performing analytical queries.
• Integration with ML approaches: ontologies can be used as a structure of
input and output features.
Gruber T. Ontology. In: Encyclopedia of database systems. Springer (2009)
Basics of Data Science
16
Semantic Web
Basics of Data Science
17
Semantic Web: “to make the web more accessible to computers.”
More precisely, in the current state, the computers use the values and
keywords to search for information, then these are sent from servers to users.
This is all the task that is done in the current web, and all intelligent works are
done by human.
The idea of semantic web (or web of data) consists in making the web richer
for machines, i.e., the web becomes source of machine-readable and machineunderstandable data.
https://www.merriam-webster.com/dictionary/semantic
Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012)
Basics of Data Science
18
Semantic Web follow the following design principles:
1. Creating standard format of structured and semi-structured data on the
web,
2. Creating not only datasets, but also individual data-elements, and their
relations and make all accessible on the web,
3. Making semantics of the data and make it understandable by machines.
Semantic Web technology uses labeled graphs, Uniform Resource Identifiers
(URI) to identify the data elements and their relations in the datasets, and
ontologies to formally represent the semantics of the data.
In this context, RDF and OWL are used as “knowledge representation”
languages.
Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012)
Basics of Data Science
19
Querying the Semantic Web:
SPARQL: similar to SQL, however specifically designed for RDF, is a query
language to select, extract, … from a knowledge expressed in RDF.
Triplestore or RDF store: a software to execute SPARQL queries.
A sample SPARQL code:
Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012)
Basics of Data Science
20
An example:
Grigoris Antoniou, Paul Groth, Frank van Harmelen, Rinke Hoekstra. A Semantic Web Primer. MIT Press (2012)
Basics of Data Science
21
Definition: Information integration, or data integration, consists in posing a
single query that involves several data sources with the aim of receiving a
single answer.
Integration-Oriented Ontology:
Using integration-oriented ontology, we want to conceptualize a domain of
interest to automate data integration from evolving heterogenous sources of
data, and this is done using Semantic Web technologies.
In this way, we make a connection between domain concepts and the
underlying data sources. Then, ontology-mediated queries of a user (a data
analyst) are automatically translated to corresponding query language of the
available sources.
Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
22
To implement data-integration settings, we can use Semantic Web
technologies. Thanks to flexibility and simplicity of ontologies, these are used
to define unified interface for heterogeneous environments.
In fact, ontologies are structured into two levels
• TBox: represents terminology,
• ABox: represents assertions.
In this context, Resource Description Framework (RDF) can be used to
represent the knowledge for an automated processing.
Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
23
Example: a query execution in the Semantic Web.
Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
24
Example: a query execution in integration-oriented ontologies.
Sergi Nadal and Alberto Abelló. Integration-Oriented Ontology. In: Encyc. Big Data Tech., Springer (2019)
Basics of Data Science
25
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 2, Section 1 → From linear to non-linear regression models
Part 2: Stochastic Models on structured attribute data:
• From Linear to Non-Linear Regression models
• Support Vector Machines
• Deep? Neural Network Models
• Learning from Data Streams: Training and Updating Deterministic and
Stochastic Models
• Reinforcement Learning
Basics of Data Science
2
Summary:
In this chapter, we will see:
• What is “Machine Learning”?
• Some Basic Concepts
• Regression Analysis
• Linear Regression
• Validity of Linear Regression Model
• Logistic Regression
Basics of Data Science
3
What you should know for this part of the course:
• Some statistics
• Python
• Data collection and data cleaning
Basics of Data Science
4
Machine Learning:
• Machine learning, or ML, is a field that is devoted to understanding and
building methods that learn automatically from data with the objective of
improving performance on some set of tasks.
• Different ML methods: supervised, unsupervised, semi-supervised, and
reinforcement learning methods.
• Example of supervised machine learning: teacher and student.
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Basics of Data Science
5
Supervised learning problems:
• Regression: continuous variables
• Classification: categorical variables
House
#
Location Building
factor
year
Surface
1
0.9
2000
50
2
0.8
1995
120
3
1.2
1980
80
Basics of Data Science
6
General steps of creating a ML model:
• Cleaning the available data.
• Splitting data:
o Testing dataset: for testing the performance of the model.
o Training dataset: for training the model.
o Validation dataset: for adjusting the model.
Tools:
• Python and python packages (NumPy, pandas, Matplotlib, and scikit-learn)
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Basics of Data Science
7
Some notes on supervised learning:
• Supervised machine learning models need our help!
o Clean data
o Training
o Testing
• Evaluation metrics: accuracy and precision.
o Measured in terms of percentage.
• Challenges: Preparing and cleaning data.
• Advantages: working with labeled data, low complexity, and easier
interpretation.
Basics of Data Science
8
Graphical Representation
Scatter Plot
• A scatter plot (Chambers 1983) reveals relationships or association between
two variables. A scatter plot usually consists of a large body of data.
• The relationship between two variables is called correlation.
• The closer the data points come when plotted to making a straight line, the
higher the correlation between the two variables, or the stronger the
relationship.
• If the data points make a straight line going from the origin out to high xand y-values, then the variables are said to have a positive correlation.
• If the line goes from a high-value on the y-axis down to a high-value on the
x-axis, the variables have a negative correlation.
Basics of Data Science
9
Graphical Representation
Scatter plots are especially useful when there is a large number of data points.
They provide the following information about the relationship between two
variables:
• Strength
• Shape: linear, curved, etc.
• Direction - positive or negative
• Presence of outliers
A correlation between the variables results in the clustering of data points
along a line.
Basics of Data Science
10
Graphical Representation
The following is an example of a scatter plot suggestive of a positive linear
relationship.
Basics of Data Science
11
Basic Statistics
Correlation Coefficient (r)
It is a coefficient that indicates the strength of the association between any
two metric variables.
• The sign (+ or -) indicates the direction of the relationship.
• The value can range from “+1” to “-1”, with:
o “+1” indicating a perfect positive relationship,
o “0” indicating no relationship,
o and “-1” indicating a perfect negative or reverse relationship (as one
variable grows larger, the other variable grows smaller)
Basics of Data Science
12
Regression Analysis
Basics of Data Science
13
Regression Analysis
In regression, we want to do some predictions.
• How is sales volume affected by the weather?
• How does oil price affect bread price?
• How does oil price affect inflation?
• How does the amount of a drug absorbed by the patient’s body affect the
blood pressure?
Common point: ask for a response (dependent variable) which can be written
as a combination of one or more predictors (independent variables).
In regression, we build a model to predict the response (dependent variable)
from the independent variables.
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Basics of Data Science
14
Linear Regression
In regression, our objective consists in building a model to describe the
relation between the response (dependent variable) 𝑦 ∈ 𝑅 and a combination
of one or more (independent) variables 𝑥𝑖 ∈ 𝑅 𝑛 .
In linear regression model, we describe the response 𝑦 as a linear combination
of 𝑚 variables 𝑥𝑖 :
𝑦 = 𝛽1 𝑥1 + … + 𝛽𝑚 𝑥𝑚
Based on the number of predictors, we have two types of linear regression:
1. Simple Linear Regression: one response and one predictor.
2. Multiple Linear Regression: one response and two or more predictors.
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022)
Basics of Data Science
15
Linear Regression
Simple linear regression: Assume that 𝑛 samples 𝑥1 , 𝑦1 , . . . , 𝑥𝑛 , 𝑦𝑛 are
given, then the regression line is defined as follows:
𝑦 = 𝛽0 + 𝛽1 𝑥
The parameter 𝛽0 : intercept or the constant term.
The parameter 𝛽1 : the slope.
(𝑥𝑖 , 𝑦𝑖 ): the observations,
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑒𝑖
𝑒𝑖 defines the error term.
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022)
Basics of Data Science
16
Linear Regression
Example:
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Basics of Data Science
17
Linear Regression
The Ordinary Least Squares (OLS): an approach to find the values for the “𝛽”
parameters through minimizing the squared distance of the predicted values
from the actual values:
𝑛
||𝛽0 + 𝛽1 𝑥 − 𝑦||22 = ෍(𝛽0 + 𝛽1 𝑥𝑖 − 𝑦𝑖 )2
𝑖=1
The Residual Sum of Squares (RSS) of prediction, is quadratic convex function
෢0 , 𝛽
෢1 , where
and has a unique global minimum at 𝑤
ෝ= 𝛽
σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത
෢
𝛽1 =
σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2
෢0 = 𝑦ത − 𝛽
෢1 𝑥ҧ
𝛽
Where 𝑥ҧ and 𝑦ത are the sample means.
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Basics of Data Science
18
Validity of Linear Regression Model
1. Linearity: There should be a linear relationship between the response
variable and the predictor(s).
How to check:
• Scatterplot.
• Covariance Analysis (COV).
• Correlation Analysis (e.g., Pearson’s Correlation Coefficient).
• etc.
Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022)
https://www.alpharithms.com/simple-linear-regression-modeling-502111/
Basics of Data Science
19
Validity of Linear Regression Model
2. Normality: The error terms (residuals) should follow a normal distribution.
How to check:
• Sometimes we can ignore it!
• Visual check can be done by Quantile-Quantile (Q-Q) plots.
• Other tests: Omnibus test (in Python), Shapiro-Wilk, Kolmogorov-Smirnov,
etc.
Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022)
https://www.alpharithms.com/simple-linear-regression-modeling-502111/
Basics of Data Science
20
Validity of Linear Regression Model
3. Independence: There should be no correlation among the error terms.
In fact, if there is correlation among the error terms, then this is
called "autocorrelation".
How to check:
• Durbin-Watson test: available in the package statsmodels of Python
• Breusch-Godfrey test.
• Ljung-Box test.
Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022)
https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html
Basics of Data Science
21
Validity of Linear Regression Model
4. No Multicollinearity: The predictors should be independent of each other.
In absence of this independence, then there will be the multicollinearity Issue.
How to check:
- Sensitivity check of the regression coefficients
- Farrar-Gluaber Test
- Condition Number Test
- etc.
Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022)
https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html
Basics of Data Science
22
Validity of Linear Regression Model
5. Homoscedasticity (constant variance):
When the variance of the error terms (residuals) appears constant over a range
of predictor variables, the data are said to be homoscedastic. To have a valid
linear regression model, we should have “homoscedasticity”, i.e., all error
terms (residuals) should have the same variance.
How to check:
- Scatterplot.
- Breusch-Pagan test (exists in the statsmodels of Python).
- Levene’s test.
- Park test, etc.
Sanjeev J. Wagh, Manisha S. Bhende, and Anuradha D. Thakare. Fundamentals of Data Science. CRC Press (2022)
https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html
Basics of Data Science
23
Validity of Linear Regression Model
Example: Heteroscedasticity.
https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html
Basics of Data Science
24
Nonlinear Regression?
Basics of Data Science
25
Relative performance as a function of
layer size, resource dimensions and evaluations
50.000 evaluations
Basics of Data Science
500.000 evaluations
26
Concentration in network effect markets
for low price products
Basics of Data Science
27
Concentration in network effect markets
for high price products
Basics of Data Science
28
Market concentration as a joint function of
network centrality and closeness
Basics of Data Science
29
Does the surplus for the monopolists depend
on the closeness of the network structure?
Basics of Data Science
30
Does anticipation in agent‘s decision making lead to
significantly different concentration and overall
welfare?
Basics of Data Science
31
How to deal with integer covariates?
Basics of Data Science
32
Fitness depending on
Population Size and Sampling Rate
Basics of Data Science
33
Logistic Regression
Basics of Data Science
34
Logistic Regression
• Logistic regression is a classification approach.
• Logistic regression is mainly used for qualitative (categorical) response
variable. In fact, for qualitative (categorical) response variables, linear
regression is not a reliable option because:
o (a) a linear regression method cannot handle qualitative response
variables with more than two classes;
o (b) a linear regression method does not provide meaningful estimates
even if there are only two classes.
• Logistic regression is suitable for the binary qualitative response values.
G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021)
Basics of Data Science
35
Logistic Regression
• Given: set of training observations 𝑥1 , 𝑦1 , . . . , 𝑥𝑛 , 𝑦𝑛
• We use the given data to build a classifier.
• The question: how should we model the relationship between predictor 𝑥
and
𝑝 𝑥 = 𝑝𝑟𝑜𝑏 𝑦 = 1 𝑥)
• In linear regression, we used
𝑝 𝑥 ≔ 𝑦 = 𝛽0 + 𝛽1 𝑥
• However, this does not fit well to the case of 0/1 classification.
G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021)
Basics of Data Science
36
Logistic Regression
Linear versus Logistic Regression
G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021)
Basics of Data Science
37
Logistic Regression
To overcome this issue, we need formulate 𝑝 𝑥 by using a function that gives
us values between 0 and 1 no matter the value of the variable 𝑥.
There are many functions that does this job for us; however, in logistic
regression, the following logistic function, which is nonlinear, is used:
𝑒 𝛽0+𝛽1𝑥
𝑝 𝑥 =
1 + 𝑒𝛽0+𝛽1𝑥
In this function, the value of 𝑝 𝑥 never
drops below 0 and goes never over 1.
This function always produces an
S-shaped curve between 0 and 1.
G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021)
Laura Igual and Santi Seguí. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications. Springer (2017)
Basics of Data Science
38
Logistic Regression
“Odds”:
After rearranging terms in the logistic function, we obtain:
𝑝 𝑥
= 𝑒 𝛽0+𝛽1𝑥
1 −𝑝 𝑥
Here, the quantity
𝑝 𝑥
1 −𝑝 𝑥
is called the “odds”. The odds can take on any value
between 0 and ∞. Indeed, values of the odds close to 0 and ∞ indicate very
low and very high probabilities of response = 1, respectively.
G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021)
Basics of Data Science
39
Logistic Regression
“Log Odds” or “Logit”:
The left-hand side of the following equation is called the “log odds” or “logit”. :
log
𝑝 𝑥
1 −𝑝 𝑥
= 𝛽0 + 𝛽1 𝑥
• In linear regression, 𝛽1 gives the average change in 𝑦 if we increase the
value of 𝑥 by 1 unit.
• By increasing the value of 𝑥 by 1 unit, the value of logit is changed by 𝛽1 .
• Such a relation does not hold for 𝑝(𝑥). But anyway, if is 𝛽1 positive
(negative), then increasing 𝑥 will increase (decrease) 𝑝(𝑥) too.
G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021)
Basics of Data Science
40
Logistic Regression
• The 𝛽 parameters in the logistic regression model should be estimated.
• There are several approaches to achieve this objective.
• The preferred approach is the method of maximum likelihood.
• The mathematical formulation of a likelihood function:
𝑙(𝛽0 , 𝛽1 ) = ෑ 𝑝(𝑥𝑖 ) ෑ (1 − 𝑝(𝑥𝑖 ′ ))
𝑖:𝑦𝑖 =1
𝑖 ′ :𝑦𝑖′ =0
෢0 and 𝛽
෢1 that maximize this likelihood function.
• Objective: estimating 𝛽
• In Python: Logistic Regression with scikit-learn, and its LogisticRegression
class.
G. James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning in R. Springer (2021)
Basics of Data Science
41
Prof. Dr. Oliver Wendt
M.Sc. Manuel Hermes
Business Information Systems & Operations Research
• Basics of Data Science
• Summer Semester 2022
Part 2, Section 2 → Deep? Neural Network Models
Part 2: Stochastic Models on structured attribute data:
• From Linear to Non-Linear Regression models
• Deep? Neural Network Models
• Support Vector Machines
• Learning from Data Streams: Training and Updating Deterministic and
Stochastic Models
• Reinforcement Learning
Basics of Data Science – ANN
2
Artificial Neural Networks (I)
Basics of Data Science – ANN 1
3
Application Areas (I)
• Text recognition, OCR
– Handwritten/scanned text
http://www.plate-recognition.info
Basics of Data Science – ANN 1
http://geeknizer.com
4
Application Areas (II)
• Facial recognition system / Deep fakes (GAN)
http://www.giga.de
Basics of Data Science – ANN 1
5
Application Areas (III)
•
•
•
•
•
Early warning systems (tornados)
Time-Series-Analysis (weather, shares)
Virtual agents, AI in games and simulations
Medical diagnostics
Autonomous vehicles (ATO / driverless cars)
Basics of Data Science – ANN 1
6
Why ANN? (I)
• Many problems can‘t be solved by using explicit knowledge
(i.e. knowledge that can be stored symbolically via characters
(language, writing))
• Implicit knowledge is needed (= being able to do sth. without
knowing how to do it) -> Transfer learning
• Examples:
–
–
Facial Recognition Systems (many pixels, few feasible solutions)
Autonomous Vehicles
Basics of Data Science – ANN 1
7
Why ANN? (II)
• Example: Time-Series-Analysis
– Advantages:
• Non-linear relations are easier to represent with non-linear
activation functions
• Flexible, as they don‘t need information about probability
distributions or formal model specifications
• No assumptions necessary for building the forecast model
• Many parameters are given by the application area and the
available data
• ANN are quite robust against noise
• Training during forecast is possible, adaptions according to
changing relations can be made (Continuous Learning, CL)
Basics of Data Science – ANN 1
8
Why ANN? (III)
• Example: Time-Series-Analysis
– Disadvantages:
• Learning process is time-consuming
• Knowledge only by learning (i.e. known relations can‘t be
implemented in advance -> Transfer learning)
• Not all parameters are fixed in advance and have to be
specified
-> frequently, there are no satisfying heuristics
-> time-consuming
Basics of Data Science – ANN 1
9
The biological role model
The human brain consists of:
Ca. 1011 neurons and
cell body
nucleus
synaptical
node
Ca. 1013 connections
Dendrites
Direct the incoming signals to the cell nucleus
Cell nucleus
Processes the incoming signals
axon
dendrite
nucleolus
Axon
Forwards the output signals to other cells
Synapses
Connection between the axon of one neuron
and the dendrite of another one
Basics of Data Science – ANN 1
10
The Artificial Neuron
Input signals
X1
X2
Xn
“activate”
Transfer
funktion
Local
memory
y
Output signal
Copies of output signal
Basics of Data Science – ANN 1
11
Artificial Neural Network - Basics
• An artificial neural network consists of several
neurons/units and connections
• Neurons/Units:
–
–
–
–
Input unit
Hidden unit
Bias unit
Output unit
• Connections:
– Directed
– Weighted
• Several units of the same type form a layer
Basics of Data Science – ANN 1
12
Input and propagation function
• ai = activity level of the sending unit
• wij = weight of the connection between neuron i and j
• Input of unit j:
inputji = aiwij
• Net input of unit j:
netinputj = Σi inputji = Σi aiwij
(propagation function)
http://www.neuronalesnetz.de
Basics of Data Science – ANN 1
13
Activation Function,
Output and Bias
• Activation function:
relationship between net input and
activity level of a neuron
• Activity level is transformed into output by an
output function (often the identity function)
Basics of Data Science – ANN 1
14
Different Activation Functions
• Linear function
• Threshold function
(Rectified linear Unit)
• Binary function
• Sigmoid function
Basics of Data Science – ANN 1
15
Sigmoid Activation Function
activity level
1
f ( x) =
−x
1+ e
net input
Basics of Data Science – ANN 1
16
Rectified Linear Unit (ReLU)
Activation Function
Pauly et at, 2017
Basics of Data Science – ANN 1
17
Bias Units
• Bias units:
–
–
–
–
Have an activity level of +1
Weight to another unit positive or negative
Positive weight: unit stays active (high activity desired)
Negative weight: units stays inactive (barrier)
http://www.neuronalesnetz.de
Basics of Data Science – ANN 1
18
Classification of ANN
• Different classifications possible, e.g. by…
– Number of layers / hidden layers
– Activation function (e.g. ReLU, binary, sigmoid)
– Learning paradigm ((un-)supervised, reinforcement
learning, stochastic)
– Feedforward, Feedback (Recurrent)
Basics of Data Science – ANN 1
19
Classification of ANN
Neural Networks
Supervised
Learning
Feedforward
Perceptron
Multi-LayerPerceptron
GANs (Generative
Adversarial
Networks)
Radial Basis
Function
Unsupervised
Learning
Feedback
Feedforward
Feedback
ARTMAP
(Predictive ART)
Kohonen Maps
Adaptive
Resonance Theory
(ART)
Reinforcement
learning
LTSM (Long shortterm memory)
GRU (Gated
Recurrent Unit)
Basics of Data Science – ANN 1
20
Learning - Training set vs. Test set
• Training set:
– Input vector (where desired output or response is known)
– Really used for training the system
– Weights are adjusted according to the result
• Test set:
– Input vector (where desired output or response is known)
– Verification of the learning effects
– No adjustment of weights
• Relation training set vs. test set about 70/30
• Also important: order of the patterns presented
Basics of Data Science – ANN 1
21
Date Example - OCR
A
not A
not A
A
Basics of Data Science – ANN 1
not A
22
Learning Paradigms
• Unsupervised learning
• Supervised learning
• Reinforcement learning
Basics of Data Science – ANN 1
23
Unsupervised and Supervised Learning
• Supervised learning
– For a training set of data input vectors and correct output
vectors are known
– Search for optimal network weights minimizing an error
measure (e.g. mean squared error) on the training set
– Hopefully generalizing to minimizing the error in the
application phase
• Unsupervised Learning
– Correct output Vectors are not known
– Goal: Finding patterns in the input data
– Application field: Similar to Linear Models:
Interdependence Analysis
Basics of Data Science – ANN 1
24
Reinforcement Learning
•
•
•
•
No labelled data is needed -> no output vector is known
Optimizes a cumulative reward
High degree of generality
High potential for decision problems from various
disciplines like Control theory, Operations Research,
Multi Agent Systems, …
• Can learn complex strategies (but a problem specific
parameter set is still needed)
• Hopefully generalizing to minimizing the error in the
application phase
Basics of Data Science – ANN 1
25
Network Topology
• Feedforward Network
• Feedback Network
Basics of Data Science – ANN 1
26
Network Topology
Feedforward Network
Input
Output
Input layer
Hidden layer
Output layer
Basics of Data Science – ANN 1
27
Network Topology
Feedback Network (Recurrent Network)
Input
Output
Input layer
Hidden layer
Output layer
Basics of Data Science – ANN 1
28
Network Topology
Feedback Network (Recurrent Network)
• Feedback networks contain recurrent arcs in the
same or previous layer
• Examples:
– Adaptive Resonance Theory (ART)
– ARTMAP (Predictive ART)
– GRU (Gated Recurrent Unit)
Basics of Data Science – ANN 1
29
Output Vector designs
• Preferred: One-hot coding
-
Used for Categorical data, Example: OCR
1 Output Neuron for each distinct output
Desired Output: 1 active neuron, all others inactive
Advantage: Evaluation of Output quality
• Other examples:
– Grey code
– Categorical encoders (NLP)
– Embeddings (conversion to N-dimensional Vectors)
Basics of Data Science – ANN 1
30
Artificial Neural Networks (II)
Basics of Data Science – ANN 2
31
Classification - Perceptron
Neural Networks
Supervised
Learning
Feedforward
Perceptron
Multi-LayerPerceptron
GANs (Generative
Adversarial
Networks)
Radial Basis
Function
Unsupervised
Learning
Feedback
Feedforward
Feedback
ARTMAP
(Predictive ART)
Kohonen Maps
Adaptive
Resonance Theory
(ART)
LTSM (Long shortterm memory)
GRU (Gated
Recurrent Unit)
Basics of Data Science – ANN 2
32
The (Single-Layer) Perceptron
Origin: McCulloch und Pitts 1943
 n

f   wi xi 
 i =1

X:
Vector of inputs x1-xn (Dendrites)
w:
Vector of weights
y:
Output vector
f (.):
Activation function
Basics of Data Science – ANN 2
33
Single-Layer Perceptron (SLP)
•
•
•
•
•
Earliest kind of neural network
Simple associative memory
Binary activation function
Only capable of learning linearly separable patterns
Used for simple Classification problems
Basics of Data Science – ANN 2
34
XOR-Problem
(lack of) Linear Separability
Source: Prof. S. Krüger
Basics of Data Science – ANN 2
35
XOR-Problem
(lack of) Linear Separability
Source: Prof. S. Krüger
Basics of Data Science – ANN 2
36
XOR-Problem Solution:
Multi-Layer Perceptron (MLP)
Source: Prof. S. Krüger
Basics of Data Science – ANN 2
37
Neural Networks
Supervised
Learning
Feedforward
Perceptron
Multi-LayerPerceptron
GANs (Generative
Adversarial
Networks)
Radial Basis
Function
Unsupervised
Learning
Feedback
Feedforward
Feedback
ARTMAP
(Predictive ART)
Kohonen Maps
Adaptive
Resonance Theory
(ART)
LTSM (Long shortterm memory)
GRU (Gated
Recurrent Unit)
Basics of Data Science – ANN 2
38
The Perceptron
Origin: McCulloch und Pitts 1943
 n

f   wi xi 
 i =1

X:
Vector of inputs x1-xn (Dendrites)
w:
Vector of weights
y:
Output vector
f (.):
Activation function
Basics of Data Science – ANN 2
39
Multi-Layer Perceptron (MLP)
• One of the most popular neural network models
• Activation function was historically mostly sigmoidal
function, nowadays ReLU is commonly used due to
computational efficiency
• Important proofs (for sigmoid activation fn only!):
– Two-layer perceptron can approximate any nonlinear
function
– Three-layer perceptron sufficient to separate any (convex
or non-convex) polyhedral decision region
Basics of Data Science – ANN 2
40
Multi-Layer Perceptron (MLP)
• Consists of multiple layers of nodes in a directed
graph, with each layer fully connected to the next
one (one input and output layer, one or more
hidden layers)
• Except for the input nodes, each node is a neuron
(or processing element) with a nonlinear activation
function
• MLP utilizes a supervised learning technique called
backpropagation for training the network
Basics of Data Science – ANN 2
41
Backpropagation
• Formulated in 1974 by Paul Werbos, widely used
since published by David Rumelhart, Geoffrey
Hinton and Ronald Williams in 1986
• Generalization of the Delta-Rule in MPLs
• Most well-known learning procedure
• Historically, an „external teacher“ is required
knowing the correct output value. RL Algorithms
work with the reward signal only.
• Special case of a gradient procedure based on the
mean squared error
Basics of Data Science – ANN 2
42
A side note to…
Gradient procedures:
• Also known as steepest descent method
– Beginning: approximated value
– Go in direction of negative gradient (indicates direction of
the steepest descent form the approximated value)
– Stop when there is no numerical improvement
• Convergence is often very slow
Basics of Data Science – ANN 2
43
Backpropagation
Prone to the following problems
Very slow on flat parts of
parameter space
Oscillating between steep walls of
(multimodal) error function
Basics of Data Science – ANN 2
44
A side note to…
Squared error:
• Squared difference between estimator and true
values
E = Error
n = Number of patterns
ti = Target value
oi = Output
Basics of Data Science – ANN 2
45
Backpropagation – Algorithm
• Input pattern is propagated forward through the network
• Output is compared with target. Difference is the error of
the network
• Error is back-propagated from the output to the inputlayer
• Weights are changed according to their influence on the
error -> if the same input is used again, the output
approximates to the target
Basics of Data Science – ANN 2
46
Multi-Layer Perceptrons
Special Forms
Neural
Networks
Supervised
Learning
Feedforward
Perceptron
Multi-LayerPerceptron
Radial Basis
Function
Unsupervised
Learning
Feedback
Feedforward
Feedback
ARTMAP
(Predictive
ART)
Kohonen Maps
Adaptive
Resonance
Theory (ART)
Basics of Data Science – ANN 2
47
Special Forms of MLPs
Autoencoders
Source: https://towardsdatascience.com/generating-images-with-autoencoders-77fd3a8dd368
Basics of Data Science – ANN 2
48
Special Forms of MLPs
Generative Adversarial Networks (GANs)
Source: https://www.freecodecamp.org/news/an-intuitive-introduction-to-generative-adversarial-networks-gans-7a2264a81394/
Basics of Data Science – ANN 2
49
Artificial Neural Networks (III)
Basics of Data Science – ANN 3
50
Neural Networks
Supervised
Learning
Feedforward
Perceptron
Multi-LayerPerceptron
GANs (Generative
Adversarial
Networks)
Radial Basis
Function
Unsupervised
Learning
Feedback
Feedforward
Feedback
ARTMAP
(Predictive ART)
Kohonen Maps
Adaptive
Resonance Theory
(ART)
LTSM (Long shortterm memory)
GRU (Gated
Recurrent Unit)
Basics of Data Science – ANN 3
51
Radial Basis Functions (RBF)
• Real-valued function
• Value is only dependent on the distance from some
point c (center, can be the origin)
𝜙 𝑥, 𝑐 = 𝜙( 𝑥 − 𝑐 )
• Norm is usually the Euclidean distance; others are
possible, too
• Suitable for classification Problems
• Additionally, RBF can be used to approximate
functions or to solve partial differential equations
Basics of Data Science – ANN 3
52
Radial basis function (RBF)
Networks
• Similar to perceptrons, but with exactly 3 layers
(input, hidden, output)
• Feedforward, fully connected, no short cuts
Basics of Data Science – ANN 3
53
Radial basis function (RBF)
Networks
• Input neurons:
– Only direct the input without any weights to the next layer
• Output neurons:
– Activation function: identity function
– Propagation function: weighted sum
• Hidden neurons (=RBF-neurons):
– Propagation function: norm (distance between net input
and center of the neuron, i.e., difference between input
vector and center vector)
– Activation function: radial basis function
Basics of Data Science – ANN 3
54
Radial basis function (RBF)
Networks
Basics of Data Science – ANN 3
55
Radial basis function (RBF)
Networks
• Learning and training via adjustment of
– Centers of RBF-neurons
– Widths of Gaussian function
– Weights of connections between RBF and output layer
Basics of Data Science – ANN 3
56
RBF NN is More Suitable for
Probabilistic Pattern Classification
Hyperplane
MLP
Kernel function
Basics of Data Science – ANN 3
RBF
57
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 2, Section 3 → Support Vector Machines
Part 2: Stochastic Models on structured attribute data:
• From Linear to Non-Linear Regression models
• Deep? Neural Network Models
• Support Vector Machines
• Learning from Data Streams: Training and Updating Deterministic and
Stochastic Models
• Reinforcement Learning
Basics of Data Science – Support Vector Machines
2
Support Vector Machines
Basics of Data Science – Support Vector Machines
3
SVM in a nutshell (1)
• Beginning:
Training set of objects (vectors) with known
classification, represented in a vector space
Basics of Data Science – Support Vector Machines
4
SVM in a nutshell (2)
• Task:
Find hyperplane separating objects into two classes
Basics of Data Science – Support Vector Machines
5
SVM in a nutshell (3)
• Important:
Maximize distance from vectors nearest to
hyperplane (i.e. the margin) (needed for better
classification, in case test objects don‘t match training objects
exactly)
Max.
Basics of Data Science – Support Vector Machines
6
SVM in a nutshell (4)
• Not all training vectors need to be considered (too
far away from hyperplane,
„hidden“ by other vectors)
• Hyperplane only depends
on nearest vectors (called
support vectors)
Basics of Data Science – Support Vector Machines
7
SVM in a nutshell (5)
• Linear separability:
Hyperplanes cannot be bent, so objects need to be
linearly separable
Basics of Data Science – Support Vector Machines
8
SVM in a nutshell (6)
• Most of real data is not linearly separable
Basics of Data Science – Support Vector Machines
9
SVM in a nutshell (7)
• Solution approach:
Basics of Data Science – Support Vector Machines
10
SVM in a nutshell (8)
• Transfer vector space (incl. all training vectors) into
higher-dimensional space (up to infinitely high) and
find the hyperplane
• Re-transfer into low-dimensional space: linear
hyperplane turns into nonlinear one, but training
vectors are exactly separated into two classes
• Problems:
1. Transformation into higher dimension computationally
expensive
2. Representation in lower dimension very complex and so
not usable
Basics of Data Science – Support Vector Machines
11
SVM in a nutshell (9)
• Kernel trick:
Use proper kernel functions that
1. describe hyperplane in high-dimensional space
AND
2. are manageable in lower dimensions
Transformation into both dimensions is in fact
possible without calculating
Basics of Data Science – Support Vector Machines
12
Support Vector Machines
Basics of Data Science – Support Vector Machines
1
Preliminaries
• Task of this class of algorithms:
detect and exploit complex patterns in data
(e.g., by clustering, classifying, ranking, cleaning, …
the data)
• Typical problems:
– How to represent complex patterns
(computational problem)
– How to exclude unstable patterns / overfitting
(statistical problem)
Basics of Data Science – Support Vector Machines
2
Very Informal Reasoning
• The class of kernel methods implicitly defines the
class of possible patterns by introducing a notion of
similarity between data
• Example: Similarity between documents
– By length
– By topic
– By language …
• Choice of similarity -> Choice of relevant features
Basics of Data Science – Support Vector Machines
3
More formal reasoning
• Kernel methods exploit information about the
inner products between data items
• Many standard algorithms can be rewritten to only
require inner products between data (inputs)
• Kernel functions = inner products in some feature
space (potentially very complex)
• If kernel given, no need to specify what features of
the data are being used
Basics of Data Science – Support Vector Machines
4
Overview
•
•
•
•
•
Linear Learning Machines (LLM)
Kernel Induced Feature Spaces
Generalization Theory
Optimization Theory
Support Vector Machines (SVM)
Basics of Data Science – Support Vector Machines
5
Modularity
• Any kernel-based learning algorithm is composed of
two modules:
– A general purpose learning machine
– A problem specific kernel function
• Any kernel-based algorithm can be fitted with
any kernel
• Kernels themselves can be constructed in a modular
way
→ Great for software engineering (and for analysis)
Basics of Data Science – Support Vector Machines
6
Linear Learning Machines
• Simplest case: classification
→ Decision function is a hyperplane in input space
• The Perceptron Algorithm (Rosenblatt, 1957)
• Useful to analyze the Perceptron algorithm,
before looking at SVMs and Kernel Methods in
general
Basics of Data Science – Support Vector Machines
7
Linear Learning Machines
Basic Notation
•
•
•
•
•
•
•
Input space
Output space
Hypothesis
Real-valued
Training Set
Test error
Dot product
ℝ
Basics of Data Science – Support Vector Machines
8
Linear Learning Machines
Dot Product?
• Inner product /scalar product (here dot product)
between vectors
• Hyperplane:
(in Hesse normal form, good
for calculating distances of
points to the plane;
w = normal vector of plane
b = distance from origin)
Basics of Data Science – Support Vector Machines
9
Linear Learning Machines
Perceptron
• Linear separation of the input space
sign= +1
sign= -1
sign= 0
Basics of Data Science – Support Vector Machines
10
Linear Learning Machines
Perceptron Algorithm
• Update rule (ignoring threshold):
if
then
Basics of Data Science – Support Vector Machines
11
Linear Learning Machines
Observations
• Solution is a linear combination of training points
𝑤 = ෍ ∝𝑖 𝑦𝑖 𝑥𝑖
∝𝑖 ≥ 0
• Only used informative points (mistake driven)
• The coefficient 𝛼𝑖 of a point in combination
reflects its ‘difficulty’
Basics of Data Science – Support Vector Machines
12
Excursion: Duality
Primal program:
Dual program:
Basics of Data Science – Support Vector Machines
13
Linear Learning Machines
Dual Representation
• The decision function can be re-written as follows:
Basics of Data Science – Support Vector Machines
14
Linear Learning Machines
Dual Representation
• And also the update rule can be rewritten as
follows:
• If
then
• Note:
In dual representation,
data appears only inside dot products
Basics of Data Science – Support Vector Machines
15
Linear Learning Machines
Duality: First Property of SVMs
• DUALITY is the first feature of Support Vector
Machines
• SVMs are Linear Learning Machines represented in a
dual fashion
• Data appear only within dot products (in decision
function and in training algorithm)
Basics of Data Science – Support Vector Machines
16
Linear Learning Machines
Limitations of LLMs
• Linear classifiers cannot deal with
– Non-linearly separable data
– Noisy data
• This formulation only deals with vectorial data
Basics of Data Science – Support Vector Machines
17
Linear Learning Machines
Non-Linear Classifiers
• Alternative 1:
Creating a network of simple linear classifiers
(neurons): a Neural Network
(Problems: local minima; many parameters;
heuristics needed to train; etc)
• Alternative 2:
Map data into a richer feature space including
non-linear features,
then use a linear classifier
Basics of Data Science – Support Vector Machines
18
Overview
•
•
•
•
•
Linear Learning Machines (LLM)
Kernel Induced Feature Spaces
Generalization Theory
Optimization Theory
Support Vector Machines (SVM)
Basics of Data Science – Support Vector Machines
19
Linear Learning Machines
Learning in the Feature Space
• Map data into a feature space where they are
linearly separable
Basics of Data Science – Support Vector Machines
20
Linear Learning Machines
Problems with Feature Space
• Working in high dimensional feature spaces solves
the problem of expressing complex functions
BUT:
• There is a computational problem
(working with very large vectors)
• And a generalization theory problem
(curse of dimensionality)
Basics of Data Science – Support Vector Machines
21
Kernel-induced Feature Spaces
Implicit Mapping to Feature Space
We will introduce Kernels:
• Solve the computational problem of working with
many dimensions
• Can make it possible to use infinite dimensions –
efficiently in time/space
• Other advantages, both practical and conceptual
Basics of Data Science – Support Vector Machines
22
Kernel-induced Feature Spaces
Implicit Mapping to Feature Space
• In the dual representation, the data points only
appear inside dot products:
• The dimensionality of feature space not necessarily
important
• We may not even know the map
Basics of Data Science – Support Vector Machines
23
Kernel-induced Feature Spaces
Kernels
• A function that returns the value of the dot product
between the images of the two arguments
• Given a function K, it is possible to verify that it is a
kernel
Basics of Data Science – Support Vector Machines
24
Kernel-induced Feature Spaces
Kernels
• One can use LLMs in a feature space by simply
rewriting it in dual representation and replacing dot
products with kernels:
Basics of Data Science – Support Vector Machines
25
Kernel-induced Feature Spaces
The Kernel Matrix
• (aka the Gram matrix):
Basics of Data Science – Support Vector Machines
26
Kernel-induced Feature Spaces
The Kernel Matrix
• The kernel matrix is the central structure in kernel
machines
• Information ‘bottleneck’: contains all necessary
information for the learning algorithm
• Fuses information about the data AND the kernel
Basics of Data Science – Support Vector Machines
27
Kernel-induced Feature Spaces
Mercer’s Theorem
• Many interesting properties:
– The kernel matrix is
Symmetric Positive Definite
– Any symmetric positive definite matrix
• Can be regarded as a kernel matrix
• Is an inner product matrix in some feature space
Basics of Data Science – Support Vector Machines
28
Symmetric positive definite matrix
• Symmetric:
1 2
2 4
−3 1
5 3
−3
1
−2
2
5
3
2
6
• Positive definite:
Quadratic form 𝑞𝐴 𝑥 = 𝑥 𝑡 ⋅ 𝐴 ⋅ 𝑥 > 0, 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 ≠ 0
Example:
Basics of Data Science – Support Vector Machines
29
Kernel-induced Feature Spaces
More Formally: Mercer’s Theorem
• Every (semi-) positive definite, symmetric function
is a kernel: i.e. there exists a mapping
such that it is possible to write:
Pos. Def.
Basics of Data Science – Support Vector Machines
30
Kernel-induced Feature Spaces
Examples of Kernels
• Simple examples of kernels are:
Basics of Data Science – Support Vector Machines
32
Kernel-induced Feature Spaces
Example: Polynomial Kernels
Basics of Data Science – Support Vector Machines
33
Kernel-induced Feature Spaces
Example: Polynomial Kernels
Basics of Data Science – Support Vector Machines
34
Kernel-induced Feature Spaces
Example: Polynomial Kernels
Quelle: http://goo.gl/8JowV
Basics of Data Science – Support Vector Machines
35
Kernel-induced Feature Spaces
Example: the two spirals
• Separated by a
hyperplane in
feature space
(gaussian kernels)
Basics of Data Science – Support Vector Machines
36
Kernel-induced Feature Spaces
Making Kernels
• The set of kernels is closed under some operations.
If K, K’ are kernels, then:
–
–
–
–
K+K’ is a kernel
cK is a kernel, if c>0
aK+bK’ is a kernel, for a,b >0
And many more……
• Can construct complex kernels from simple ones:
modularity !
Basics of Data Science – Support Vector Machines
37
Kernel-induced Feature Spaces
Second Property of SVMs:
SVMs are Linear Learning Machines, that
• Use a dual representation
AND
• Operate in a kernel induced feature space
(that is:
is a linear
function in the feature space implicitly defined by K)
Basics of Data Science – Support Vector Machines
38
Kernel-induced Feature Spaces
Kernels over General Structures
• Kernels over sets, over sequences, over trees, …
• Applied in text categorization, bioinformatics, …
Basics of Data Science – Support Vector Machines
39
Kernel-induced Feature Spaces
A bad kernel …
• … would be a kernel whose kernel matrix is mostly
diagonal: all points orthogonal to each other, no
clusters, no structure …
Basics of Data Science – Support Vector Machines
40
Kernel-induced Feature Spaces
No Free Kernel
• If mapping in a space with too many irrelevant
features, kernel matrix becomes diagonal
• Need some prior knowledge of target to choose a
good kernel
Basics of Data Science – Support Vector Machines
41
Kernel-induced Feature Spaces
Other kernel-based algorithms
• Not just LLMs can use kernels
– clustering
– Principal Component Analysis
– Others...
• Dual representation often possible
Basics of Data Science – Support Vector Machines
42
Overview
•
•
•
•
•
Linear Learning Machines (LLM)
Kernel Induced Feature Spaces
Generalization Theory
Optimization Theory
Support Vector Machines (SVM)
Basics of Data Science – Support Vector Machines
1
Linear Learning Machines
Problems with Feature Space
• Working in high dimensional feature spaces solves
the problem of expressing complex functions
BUT:
• There is a computational problem
(working with very large vectors)
• And a generalization theory problem
(curse of dimensionality)
Basics of Data Science – Support Vector Machines
2
The Generalization Problem
• The curse of dimensionality
– Easy to overfit in high dimensional spaces
– Regularities could be found in the training set that are
accidental → would not be found again in a test set
• The SVM problem is ill posed
– Finding one hyperplane that separates the data
– → Many such hyperplanes may exist!
• How to choose the best possible hyperplane?
Basics of Data Science – Support Vector Machines
3
The Generalization Problem
• Many methods exist to choose a good hyperplane
(inductive principles)
– Bayes,
– Statistical learning theory / PAC
• Each can be used
• We will focus on a simple case motivated by
statistical learning theory (will give the basic SVM)
Basics of Data Science – Support Vector Machines
4
Generalization Theory
Statistical (Computational) Learning Theory
• Generalization bounds on the risk of overfitting
– PAC setting: probably approximately correct
– assumption of i.i.d. data
• Standard bounds from VC (Vapnik–Chervonenkis)
theory give upper and lower bound proportional to
VC dimension
• VC dimension of LLMs proportional to dimension of
space (can be huge)
Basics of Data Science – Support Vector Machines
5
Generalization Theory
Vapnik–Chervonenkis dimension
• Measure of the capacity of a statistical classification
algorithm
• Defined as the cardinality of the largest set of
points that the algorithm can shatter
• Core concept in Vapnik–Chervonenkis theory
Basics of Data Science – Support Vector Machines
6
Generalization Theory
Assumptions and Definitions
• Distribution D over input space X
• Train and test points drawn randomly (i.i.d.) from D
• Training error of hyp:
fraction of points in S misclassifed by hyp
• Test error of hyp:
probability under D to misclassify a point x
• VC dimension h:
size of largest subset of X shattered by hyp
(every dichotomy implemented)
Basics of Data Science – Support Vector Machines
7
Generalization Theory
Vapnik–Chervonenkis dimension
• Allows to predict error on test points from error on
training set
• For sample size m and
VC-dimension h << m
it will hold with probability 1-η
that
2m/
m
Basics of Data Science – Support Vector Machines
8
Generalization Theory
VC Bounds
m
m
Basics of Data Science – Support Vector Machines
9
Generalization Theory
VC Bounds
• But often VC dimension h >> m,
so bound is very weak 
• Does not tell us which hyperplane to choose 
• However: margin-based bounds exist, too !!
Basics of Data Science – Support Vector Machines
10
Generalization Theory
Margin Based Bounds
• The VC worst case bound still holds,
but if lucky (margin is large)
the other bounds can be applied and
better generalization can be achieved:
• Best hyperplane:
the maximal margin one
• Margin is large if kernel is chosen well
Basics of Data Science – Support Vector Machines
11
Generalization Theory
Maximal Margin Classifier
• Minimize the risk of overfitting by choosing the
maximal margin hyperplane in feature space
• Third feature of SVMs:
Maximize the margin
• SVMs control capacity by
– Increasing the margin,…
– Not by reducing the number of degrees of freedom
(dimension free capacity control)
Basics of Data Science – Support Vector Machines
12
Generalization Theory
Maximal Margin Classifier
Basics of Data Science – Support Vector Machines
13
Generalization Theory
Max Margin = Minimal Norm
• Distance between
the two convex hulls
Basics of Data Science – Support Vector Machines
18
Generalization Theory
The primal problem
• Minimize:
subject to:
Basics of Data Science – Support Vector Machines
19
Overview
•
•
•
•
•
Linear Learning Machines (LLM)
Kernel Induced Feature Spaces
Generalization Theory
Optimization Theory
Support Vector Machines (SVM)
Basics of Data Science – Support Vector Machines
20
Optimization Theory
• The problem of finding the maximal margin
hyperplane: constrained optimization
(quadratic programming)
• Use Lagrange theory (or Kuhn-Tucker Theory)
• Lagrangian:
Basics of Data Science – Support Vector Machines
21
Optimization Theory
From Primal to Dual
• Differentiate and substitute:
Basics of Data Science – Support Vector Machines
22
Optimization Theory
The Dual Problem
• Maximize:
• Subject to:
• The duality again! Can use kernels!
Basics of Data Science – Support Vector Machines
23
Optimization Theory
Convexity
• This is a Quadratic Optimization problem
→ Convex
→ No local minima !!! ☺☺☺
• (Second effect of Mercer’s conditions)
• Solvable in polynomial time …
• (Convexity is another fundamental property of SVMs)
Basics of Data Science – Support Vector Machines
24
Optimization Theory
Kuhn-Tucker Theorem
Properties of the solution:
• Duality: can use kernels
• KKT conditions:
• Sparseness: only the points nearest to the hyperplane
(margin = 1) have positive weight
• They are called support vectors
Basics of Data Science – Support Vector Machines
25
Optimization Theory
KKT Conditions Imply Sparseness
• Sparseness: another fundamental property of SVMs
Basics of Data Science – Support Vector Machines
26
Optimization Theory
XOR-Example: Polynomial Kernel
K(xi, xj)=
(xi‘xj+1)2
Basics of Data Science – Support Vector Machines
27
Optimization Theory
XOR-Example: Gaussian Kernel
K(xi, xj)=
exp(-(xi-xj)2
/
2(sigma)2)
Basics of Data Science – Support Vector Machines
28
Optimization Theory
other example: Gaussian Kernel
K(xi, xj)=exp(-(xi-xj)2/2(sigma)2)
Basics of Data Science – Support Vector Machines
29
Overview
•
•
•
•
•
Linear Learning Machines (LLM)
Kernel Induced Feature Spaces
Generalization Theory
Optimization Theory
Support Vector Machines (SVM)
Basics of Data Science – Support Vector Machines
30
Support Vector Machines
Properties of SVMs - Summary
✓
✓
✓
✓
✓
Duality
Kernels
Margin
Convexity
Sparseness
Basics of Data Science – Support Vector Machines
31
Support Vector Machines
Dealing with noise
• In the case of non-separable data in feature space,
the margin distribution can be optimized
Basics of Data Science – Support Vector Machines
32
Support Vector Machines
The Soft-Margin Classifier
Minimize:
Or:
Subject to:
Basics of Data Science – Support Vector Machines
33
Support Vector Machines
Maximal Margin versus Soft Margin
Max
margin
2-norm
soft
margin
Basics of Data Science – Support Vector Machines
1-norm
soft
margin
36
Support Vector Machines
The regression case
• For regression, all the above properties are
retained, introducing epsilon-insensitive loss:
Basics of Data Science – Support Vector Machines
37
Support Vector Machines
Regression: the ε-tube
Basics of Data Science – Support Vector Machines
38
Support Vector Machines
Implementation Techniques
• Maximizing a quadratic function, subject to a linear
equality constraint (and inequalities as well)
Basics of Data Science – Support Vector Machines
39
Support Vector Machines
Simple Approximation
• Initially complex QP packages were used.
• Stochastic Gradient Ascent
(sequentially update 1 weight at the time)
gives excellent approximation in most cases
Basics of Data Science – Support Vector Machines
40
Support Vector Machines
Sequential Minimal Optimization
• SMO: update two weights simultaneously
• Realizes gradient ascent without leaving the linear
constraint (J. Platt).
• Online versions exist (Li-Long; Gentile)
Basics of Data Science – Support Vector Machines
41
Support Vector Machines
Comparison to Neural Networks
Model Estimation
Model Evaluaton
Support Vector
Machines
Relatively quick
(convex QP)
Complexity dependent
→ Could be slow
Artifical Neural
Networks
Relatively slow
(gradient search)
Compact Model
→ Fast
Basics of Data Science – Support Vector Machines
42
Prof. Dr. Oliver Wendt
Dr. habil. Mahdi Moeini
Business Information Systems & Operations Research
Basics of Data Science
Summer Semester 2022
Part 2, Section 4 → Reinforcement Learning
1
Part 2: Stochastic Models on structured attribute data:
• From Linear to Non-Linear Regression models
• Deep? Neural Network Models
• Support Vector Machines
• Reinforcement Learning
• Learning from Data Streams: Training and Updating Deterministic and
Stochastic Models
Basics of Data Science: Reinforcement Learning
2
Reinforcement Learning
Basics of Data Science: Reinforcement Learning
3
Recap:
Supervised Learning
Source: Bishop 2006, p. 7
•
•
•
•
Optimize (i.e. minimize) mean squared error
Based on training samples of data points
Hopefully generalizing well to forecast future data points
Assuming a uniform relevance of the parameter space
(independent variables) ?
• What to do if there is no teacher / trainer ?
Basics of Data Science: Reinforcement Learning
4
Supervised? Learning
...from Simulation?
expected
profit
y=f(x1,x2)
Basics of Data Science: Reinforcement Learning
5
Markov Process
• Markov Property
state transitions must be history independent, i.e.
transition probability T(s, s’)
– of reaching a state s’(t+1) at time t+1
– from current state s(t)
does NOT depend on any earlier state or transition
• Markov Chain
stochastic state-transition process (ST-Process)
complying with the markov property
Basics of Data Science: Reinforcement Learning
6
Markov Decision Process
• Markov Decision Processes (MDP) are defined by:
–
–
–
–
set of states S
set of actions A
reward function R: S x A→ Real
state transition probability T: S x A x S → [0;1]
giving the probability of reaching state s’ from state s
when action a is taken
– policy : S → A
prescribing which action to take in a given state
Basics of Data Science: Reinforcement Learning
7
Reinforcement Learning
• Reinforcement Learning:
Established as a “scientific
community” since 20 years
• Origins /
cybernetics, psychology, statistics, robotics,
influences: artificial intelligence, neuro sciences,
• goal:
programming of agents by reward and punishment
without necessity to explicitly specify the agents’
action strategies
• method:
agents act in a dynamic environment and
learn by trial-and-error
Basics of Data Science: Reinforcement Learning
8
Reinforcement Learning
– agent is related with the environment via “sensors”
– in each interaction step the agent receives as input a
reward signal r and a feedback concerning the
environmental state s
– agent chooses an action a as output, that may or may
not change the environmental state
– agent gets to know the value of its action only via the
reinforcement / reward signal
– goal of the agent is to maximize the long run sum of
all reward signals received
Basics of Data Science: Reinforcement Learning
9
Reinforcement Learning
Agent
Action
State s
Reward r
ar
rt+1
st+1
Environment
Basics of Data Science: Reinforcement Learning
10
RL-Model Types
• Models with finite horizon
–
–
–
–
Optimisation of reward over h steps:
Non-stationary policy at the end of time horizon
presumes finite life-time of agent
stationary policy, if h is a “floating” horizon
• Discounted models with infinite horizon
– Optimisation of discounted reward over
infinite numbers of steps:
• Models with average reward
Basics of Data Science: Reinforcement Learning
E t =0 rt
(
)
(
)
h
E t =0  t rt

1 h 
E  t =0 rt 
h

11
Reinforcement Learning
vs. neighboring domains
•
Adaptive Control
–
–
•
Structure of the dynamic model is not to be changed
adaption problems are reduced to parameter estimate of
the control strategy
supervised learning (neural networks)
–
–
–
RL does not get training samples
Reinforcement System has to explore the environment to
enhance its performance
→ exploration vs. exploitation trade-off
Basics of Data Science: Reinforcement Learning
12
State–Value Function
State–Value Function of an arbitrary policy
:


V (s) = E  R t | s t = s = E    k rt + k +1 | s t = s 
k = 0



V (s ' ' )
V (s ' )
V (s)
Basics of Data Science: Reinforcement Learning
13
Action–Value Function
Action-Value Function Q of an arbitrary policy
:




k
Q ( s, a ) = E   rt + k +1 | st = s, at = a 
 k =0

Q (s1 , a1 )
Q (s 2 , a 3 )
s2
s1
Basics of Data Science: Reinforcement Learning
14
Optimal State and
Action–Value Function
Optimal State-Value Function V* :
V* (s) = max V (s)

Optimal Action-Value Function Q*


Q* ( s, a) = E rt +1 +  V * ( st +1 ) | st = s, at = a


*
V ( s ) = max r ( s, a ) +   T ( s, a, s ' )V ( s ') 
a
s'


*
Basics of Data Science: Reinforcement Learning
15
Dynamic Programming
• explore decision tree by trial and error of all
possibilities and find the best way
• Offline Version:
possible solutions are calculated ex ante and strategy
stored in look–up-table
• Online Version:
new solution paths are explored and evaluated during
“runtime”
• PROBLEM: exponential growth of state space
Basics of Data Science: Reinforcement Learning
16
Value-Iteration
Algorithm: Value-Iteration
Initialise V(s) arbitrarily
iterate until decision policy is good enough
iterate for s  S
iterate for a  A
Q( s, a) := R( s, a) +  s 'S T ( s, a, s ' )V ( s ' )
end
V ( s ) := max a Q( s, a)
end
end
Basics of Data Science: Reinforcement Learning
17
Policy-Iteration
Algorithm: Policy-Iteration
initialise decision policy  ' arbitrarily
repeat
 ='
calculate the value function of the decision policy
solve the linear system of equations
V ( s ) := R( s,  ( s )) +  s 'S T ( s,  ( s ), s ' )V ( s ' )
improve the decision policy for each state:
 '( s) := arg max a ( R( s, a) +   s 'S T ( s, a, s ')V ( s ') )
until
 ='
Basics of Data Science: Reinforcement Learning
18
Monte-Carlo-Method
-
learning via experience
-
learning in episodes
-
no total decision tree necessary
-
generation of average-returns for determination
of V(s)
Basics of Data Science: Reinforcement Learning
19
first visit Monte-Carlo-Method
• generate an episode; choose a policy 
• run through whole episode
• calculate average returns R for each V(s)
visited
• use all returns after particular s in episode
• in next episode calculate V(s) average
returns only for those states not visited in
prior episodes
Basics of Data Science: Reinforcement Learning
20
first visit Monte-Carlo-Method
Example:

V (s' ) = 5,5
V (s' ' ) = 6
r7 = 6

V (s) = 4,34
V (s' ' ) = 9
r8 = 9
Basics of Data Science: Reinforcement Learning
21
every visit Monte-Carlo-Method
• generate an episode; choose a policy 
• run through whole episode
• calculate average returns R for each V(s)
visited
• use all returns after particular s in episode
• in next episode update V(s) for all states
visited no matter whether
visited before or not
Basics of Data Science: Reinforcement Learning
22
Monte-Carlo-Method
Example:
V (s) alt = 4,34 V (s' ) alt = 5,5 V (s' ' ) = 6

V (s) neu = 5 V (s' ) neu = 6,5 V (s' ' ) = 9
r7 = 6
r8 = 9
Update-rule: V(st)  V(st) + [Rt - V(st)]
Basics of Data Science: Reinforcement Learning
23
Temporal-Difference-Learning
• combines Dynamic Programming with MonteCarlo- Method
• uses episodes
• uses estimates for V(s) in the beginning of
the episode
• corrects estimate value for V(s,t) via sum of
immediate return and state value function
for the following state
• episode does not need to be completed for
for calculation of estimate values!
Basics of Data Science: Reinforcement Learning
24
Temporal-Difference-Learning


V (s t +1 )
V (s t + 2 )
r7

V (s t )
r8
Update-rule:

V ( st ) V ( st ) +  [rt +1 +  V ( st +1 ) − V ( st )]
Basics of Data Science: Reinforcement Learning
25
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 1
 = 0.2
V π (s t +1)
r7 = 2
V π (s t )
1.2
r8 = 5
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
26
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 1
 = 0.2
V π (s t )
1.2
V π (s t +1)
r7 = 2
1.6
r8 = 5
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
27
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 1
 = 0.2
V π (s t )
1.2
V π (s t +1)
0.4
r7 = 2
1.6
r8 = 5
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
28
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 1
 = 0.2
V π (s t )
1.2
V π (s t +1)
0.4
r7 = 2
0
1.6
r8 = 5
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
29
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 2
 = 0.2
V π (s t )
2.48
V π (s t +1)
0.72
r7 = 2
0
2.96
r8 = 5
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
30
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 3
 = 0.2
V π (s t )
3.78
V π (s t +1)
0.98
r7 = 2
0
4.11
r8 = 5
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
31
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 20
 = 0.2
V π (s t )
14.9
V π (s t +1)
1.97
r7 = 2
0
9.77
r8 = 5
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
32
Temporal-Difference-Learning
Example
V π (s t + 2 )
Episode 21
 = 0.2
V π (s t )
V π (s t +1)
1.97
r7 = 2
9.22
1.0
15.1
0
r8 = 5
0
V π (s t )  V(s t ) + α [rt+1 + γV(s t+1) − V(s t )]
Basics of Data Science: Reinforcement Learning
33
On/Off-Policy-Method
On-policy-Method:
policy, which generate the decisions and policy used
to estimate V(s) are identical
Off-policy-Method:
action policy and policy for updating estimates are
different
Basics of Data Science: Reinforcement Learning
34
Q-Learning
on policy
On-Policy-Temporal-Difference Algorithm
Initialize Q (s, a) arbitrary
Repeat for each episode
Initialize s
Select a from s using a policy derived from Q
Repeat (for each step of the episode):
Perform action a and observe r, s’
Select a’ from s’ using a policy derived from Q
Q(s,a)  Q(s,a) + αr + γ Q(s' ,a' ) − Q(s,a)
s  s’ ; a  a’
until s is terminal state
Basics of Data Science: Reinforcement Learning
35
Q-Learning
off policy
Q-Learning: Off-Policy Temporal-Difference-Learning
-
Optimal path is not determined by update of V(s),
but by update of Q(s,a)
-
action policy determines path
-
estimation policy is used to update Q(s,a)
-
action policy is -greedy; estimation policy is greedy
-
Advantage: global optimum is to be found with
higher probability
Basics of Data Science: Reinforcement Learning
36
Q-Learning
Repeat for each episode:
1.
Start from a given s
2.
Choose an action a, starting from s and with use of
the chosen behavioural policy e.g. -greedy
3.
observe reward r and subsequent state s‘
4.
generate an update of Q as follows:
Q(s, a )  Q(s, a ) +  [rt +1 +  max Q(s' , a ' ) − Q(s, a )]
5.
Move from s to s‘
a'
Basics of Data Science: Reinforcement Learning
37
Literature
• D.P.Bertsekas, J.N.Tsitsiklis:
Neuro-Dynamic Programming, Athena Scientific,
Belmont, MA, 1996
• M.L.Putermann:
Markov Decision Processes: Discrete Stochastic
Dynamic Programming, Wiley, New York, 1994
• R.S.Sutton, A.G. Barto:
Reinforcement Learning: An Introduction, second
edition, MIT Press, 2018
Basics of Data Science: Reinforcement Learning
38
Basics of Data Science
Decision Trees & Random Forests
Daniel Schermer
Technische Universität Kaiserslautern
Department of Business Studies & Economics
Chair of Business Information Systems & Operations Research
https://bisor.wiwi.uni-kl.de
18.07.2022
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
1 / 50
Introduction
Contents
1
Introduction
2
CART Algorithm
3
Simple Examples
4
Bias-Variance Tradeoff
5
Ensembles, Bagging & Random Forests
6
Advanced Examples
7
Conclusion
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
2 / 50
Introduction
Decision Trees
Motivation
Consider the following:
Whenever heart attack patients are admitted to a hospital, several variables are monitored.
Based on these observations, decision trees allow us to construct simple rule-based systems.
yes
Minimum systolic
blood pressure over
24h period ≤ 91?
yes
no
Age ≤ 65
no
Sinus tachycardia present?
yes
high risk
Daniel Schermer (TU Kaiserslautern)
low risk
high risk
Basics of Data Science
no
low risk
18.07.2022
3 / 50
Introduction
Decision Trees
Introduction
It is convenient to introduce features and labels1 :
x1 : Date
yes
x5 ≤ 91?
no
x2 : Age
yes
x3 : Height
no
x2 ≤ 65
x4 : Weight
x6
x5 : Minimum systolic blood pressure, 24h
x6 : Sinus tachycardia present?
1
y : 0 (low risk) or 1 (high risk)
0
yes
no
1
0
Given a sample x =(22.07.2022, 25, 175cm, 70kg, 115, 0), what is the prognosis?
1A
label can also be a numerical value, e. g., remaining life expectancy.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
4 / 50
Introduction
Decision Trees
Introduction
From a graph theoretic perspective, a decision tree is a directed rooted tree with three main elements:
The root (decision) node.
(Internal) decision nodes.
Leaf nodes.
Root
D1
D2
D3
L1
Daniel Schermer (TU Kaiserslautern)
L2
L3
Basics of Data Science
L4
L5
18.07.2022
5 / 50
Introduction
Decision Trees
Introduction
Learning Sample L
A learning sample L consists of data: L = {(x1 , y1 ), . . . , (xn , yn = f (xn ))} where xi ∈ X and yi ∈ Y .
We distinguish two general types of variables:
A variable is called categorical if it takes values in a finite set with no natural ordering (e. g., color).
A variable is called numerical if its values are real numbers (e. g., blood pressure, age).
Generally, xi is a vector consisting of one or more numerical or categorical features (variables).
We assume the label yi to be either a numerical or categorical (e. g., temperature, yes/no) variable.
A decision tree partitions L based on X to group similar parts of Y together.
The CART-Algorithm2 achieves such a partitioning recursively by finding splits θ greedily.
We define the cardinality of L as n, i.e., |L| = |X | = |Y | = n.
2 Other
noteworthy methods are ID3 or C4.5.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
6 / 50
CART Algorithm
1
Introduction
2
CART Algorithm
3
Simple Examples
4
Bias-Variance Tradeoff
5
Ensembles, Bagging & Random Forests
6
Advanced Examples
7
Conclusion
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
7 / 50
CART Algorithm
Classification and Regression Trees (CART)
Mathematical Formulation
CART Algorithm
We differentiate two cases when splitting L on a feature f :
For a categorical feature fc : Let S be the set of all values that variables xi ∈ L exhibit on feature fc .
Let T be a subset of features of S, i. e., T ⊂ S. θ(L, fc , T ) splits L as follows:
Lleft contains all (xi , yi ) ∈ L for which xi (fc ) ∈ T
right
L
(1)
contains all (xi , yi ) ∈ L for which xi (fc ) ∈ S \ T
(2)
For a numerical feature fn : θ = (L, f , t) contains a threshold t such that L is split as follows:
Lleft contains all (xi , yi ) ∈ L for which xi (fn ) ≤ t
L
Daniel Schermer (TU Kaiserslautern)
right
(3)
contains all (xi , yi ) ∈ L for which xi (fn ) > t
Basics of Data Science
(4)
18.07.2022
8 / 50
CART Algorithm
Classification and Regression Trees (CART)
Mathematical Formulation
CART Algorithm
The goodness of a split G (L, θ) is computed using an impurity or loss function H(·), the choice of
which depends on the task being solved (see Slide 10):
G (L, θ) =
nleft
nright
H(Lleft (θ)) +
H(Lright (θ))
n
n
(5)
When learning decision trees, we want to minimize the loss function, i. e., we want to find the best
split θ∗ such that the goodness G (·) implied by the partitioning into Lleft and Lright is minimal:
θ∗ = arg minG (L, θ)
(6)
θ
After the first split θ∗ , we can recurse for the newly created Lleft and Lright until a termination
criterion is met (or no reduction in impurity is possible).
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
9 / 50
CART Algorithm
Classification and Regression Trees (CART)
Mathematical Formulation
Commonly Used Loss Functions
If the target is a classification with values {k1 , . . . , kK } and pk is the frequency with which class k
occurs in L, then the Gini Impurity is defined as follows:
H(L) = 1 −
K
X
pk2
(7)
k=1
If the target is a continuous value, then the Mean Squared Error (MSE) is defined as follows:
y=
1 X
yi
n
(8)
1 X
(yi − y )2
n
(9)
yi ∈L
H(L) =
yi ∈L
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
10 / 50
CART Algorithm
Classification and Regression Trees (CART)
Recap
To recap, we now have all components of the CART algorithm:
1
A procedure to partition L based on θ for categorical or numerical features f (see Slide 8).
2
A general measure to assess the goodness of a split based on loss functions (see Slide 9)
3
We use the Gini Impurity (classification) and MSE (regression) as loss functions (see Slide 10).
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
11 / 50
Simple Examples
1
Introduction
2
CART Algorithm
3
Simple Examples
4
Bias-Variance Tradeoff
5
Ensembles, Bagging & Random Forests
6
Advanced Examples
7
Conclusion
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
12 / 50
Simple Examples
Classification Tree — Example
Learning Sample L
Features
Samples
Label
Day
Outlook
Temperature
Humidity
Wind
Play Tennis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
13 / 50
Simple Examples
Classification Tree — Example
Wind as root decision node
Day
∗
We need to test every possible split to find θ .
We start (arbitrarily) with the feature f = Wind.
θ1 (L, Wind, T = {Strong}) yields:
T = {Strong} ⇒ S \ T = {Weak}
2 2
3
3
left
H(L ) = 1 −
−
= 0.50
6
6
2 2
2
6
H(Lright ) = 1 −
−
= 0.38
8
8
|6|
|8|
G (L, θ1 ) =
· 0.50 +
· 0.38 = 0.43
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Strong
Weak
Tennis: 3Y, 3N
Tennis: 6Y, 2N
18.07.2022
14 / 50
Simple Examples
Classification Tree — Example
Humidity as root decision node
Day
We continue (arbitrarily) with the feature Humidity.
θ2 (L, Humidity, T = {High}) yields:
T = {High} ⇒ S \ T = {Normal}
2 2
3
4
H(Lleft ) = 1 −
−
= 0.49
7
7
2 2
6
1
right
H(L
)=1−
−
= 0.24
7
7
|7|
|7|
· 0.49 +
· 0.24 = 0.37
G (L, θ2 ) =
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Humidity
High
Normal
Tennis: 3Y, 4N
Tennis: 6Y, 1N
18.07.2022
15 / 50
Simple Examples
Classification Tree — Example
Outlook or Temperature as root decision node
Using Outlook and Temperature as candidate features is more complex:
If we split L on Outlook (S = {Sunny, Overcast, Rain}), we can build the following subsets T :
▶
▶
▶
Option 1: T = {Sunny, Overcast} and S \ T = {Rain}
Option 2: T = {Sunny, Rain} and S \ T = {Overcast}
Option 3: T = {Overcast, Rain} and S \ T = {Sunny}
If we split L on Temperature (S = {Cool, Mild, Hot}), we can build the following subsets T :
▶
▶
▶
Option 1: T = {Cool, Mild} and S \ T = {Hot}
Option 2: T = {Mild, Hot} and S \ T = {Cool}
Option 3: T = {Cool, Hot} and S \ T = {Mild}
We investigate all of these options on the following slides.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
16 / 50
Simple Examples
Classification Tree — Example
Outlook as root decision node, Option 1
Day
We continue with Outlook.
θ3 (L, Outlook, T = {Sunny, Overcast}) yields:
T = {Sunny, Overcast} ⇒ S \ T = {Rain}
2 2
3
6
left
−
= 0.44
H(L ) = 1 −
9
9
2 2
3
2
H(Lright ) = 1 −
−
= 0.48
5
5
|9|
|5|
· 0.44 +
· 0.48 = 0.45
G (L, θ3 ) =
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Outlook
Sunny,
Overcast
Rain
Tennis: 6Y, 3N
Tennis: 3Y, 2N
18.07.2022
17 / 50
Simple Examples
Classification Tree — Example
Outlook as root decision node, Option 2
Day
We continue with Outlook.
θ4 (L, Outlook, T = {Sunny, Rain}) yields:
T = {Sunny, Rain} ⇒ S \ T = {Overcast}
2 2
5
5
left
−
= 0.50
H(L ) = 1 −
10
10
2 2
4
0
H(Lright ) = 1 −
−
= 0.00
4
4
|10|
|4|
· 0.50 +
· 0.00 = 0.36
G (L, θ4 ) =
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Outlook
Sunny,
Rain
Overcast
Tennis: 5Y, 5N
Tennis: 4Y, 0N
18.07.2022
18 / 50
Simple Examples
Classification Tree — Example
Outlook as root decision node, Option 3
Day
We continue with Outlook.
θ5 (L, Outlook, T = {Overcast, Rain}) yields:
T = {Overcast, Rain} ⇒ S \ T = {Sunny}
2 2
2
7
left
−
= 0.35
H(L ) = 1 −
9
9
2 2
2
3
H(Lright ) = 1 −
−
= 0.48
5
5
|9|
|5|
· 0.35 +
· 0.48 = 0.40
G (L, θ5 ) =
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Outlook
Overcast,
Rain
Sunny
Tennis: 7Y, 2N
Tennis: 2Y, 3N
18.07.2022
19 / 50
Simple Examples
Classification Tree — Example
Temperature as root decision node, Option 1
Day
We continue with Temperature.
θ6 (L, Temperature, T = {Cool, Mild}) yields:
T = {Cool, Mild} ⇒ S \ T = {Hot}
2 2
3
7
left
−
= 0.42
H(L ) = 1 −
10
10
2 2
2
2
H(Lright ) = 1 −
−
= 0.50
4
4
|10|
|4|
· 0.42 +
· 0.50 = 0.44
G (L, θ6 ) =
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Temperature
Cool,
Mild
Hot
Tennis: 7Y, 3N
Tennis: 2Y, 2N
18.07.2022
20 / 50
Simple Examples
Classification Tree — Example
Temperature as root decision node, Option 2
Day
We continue with Temperature.
θ7 (L, Temperature, T = {Mild, Hot}) yields:
T = {Mild, Hot} ⇒ S \ T = {Cool}
2 2
4
6
left
−
= 0.48
H(L ) = 1 −
10
10
2 2
3
1
H(Lright ) = 1 −
−
= 0.38
4
4
|10|
|4|
· 0.48 +
· 0.38 = 0.45
G (L, θ7 ) =
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Temperature
Mild,
Hot
Cool
Tennis: 6Y, 4N
Tennis: 3Y, 1N
18.07.2022
21 / 50
Simple Examples
Classification Tree — Example
Temperature as root decision node, Option 3
Day
We continue with Temperature.
θ8 (L, Temperature, T = {Cool, Hot}) yields:
T = {Cool, Hot} ⇒ S \ T = {Mild}
2 2
5
3
left
H(L ) = 1 −
−
= 0.47
8
8
2 2
2
4
H(Lright ) = 1 −
−
= 0.44
6
6
|8|
|6|
· 0.47 +
· 0.44 = 0.46
G (L, θ8 ) =
|14|
|14|
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Temperature
Cool,
Hot
Mild
Tennis: 5Y, 3N
Tennis: 4Y, 2N
18.07.2022
22 / 50
Simple Examples
Classification Tree — Example
First split
Day
We select Outlook for our first (root) decision node and use
the split θ4 (L, Outlook, T = {Sunny, Rain}) because:
θ∗ = θ4 = arg minG (L, θ)
θ
Currently, we have two leaf nodes:
For each node that does not perfectly classify the labels
(H ̸= 0), we can recurse the procedure.
This yields the decision tree shown on Slide 24.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Outlook
Sunny,
Rain
Overcast
Tennis: 5Y, 5N
Tennis: 4Y, 0N
18.07.2022
23 / 50
Simple Examples
Classification Tree
After many more splits . . .
Outlook
Sunny, Rain
Overcast
Humidity
Yes
High
Features
Normal
Temperature
Outlook
Sunny
Rain
Hot, Mild
Cold
No
Wind
Yes
Wind
Weak
Strong
Weak
Strong
Yes
No
Yes
No
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
Day
Outlook
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Label
Temperature Humidity
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
18.07.2022
Wind
Tennis
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
24 / 50
Simple Examples
Regression Tree — Example
We have the following learning sample L = {X , Y }:
X = {x1 = 0, x2 = 1, x3 = 2, x4 = 3, x5 = 4}
f (xi ) = Y = {y1 = 0, y2 = 1, y3 = 2, y4 = 1, y5 = 1}
We introduced the following partitioning scheme (Slide 8)
2
Lleft contains all (xi , yi ) ∈ L for which xi (fn ) ≤ t
1.5
y
Lright contains all (xi , yi ) ∈ L for which xi (fn ) > t
1
If the samples are sorted in ascending order for feature f ,
a common idea is to consider all thresholds:
0.5
t1 =
0
0
1
2
x
Daniel Schermer (TU Kaiserslautern)
3
x1 (f ) + x2 (f )
x2 (f ) + x3 (f )
, t2 =
,...
2
2
4
Here, we have just a single feature (the x value).
Basics of Data Science
18.07.2022
25 / 50
Simple Examples
Regression Tree — Example
All relevant thresholds
The first threshold is t1 =
x1 + x2
= 0.5:
2
The second threshold is t2 =
Lleft = {x1 , y1 }
y left =
H(Lleft ) =
Lleft = {x1 , x2 , y1 , y2 }
1
1X
yi = 0
1 i=1
y left =
1
1X
(yi − y left )2 = 0
1 i=1
H(Lleft ) =
Lright = {x2 , . . . , x5 , y2 , . . . , y5 }
y right =
H(Lright ) =
G =
2
1X
yi = 0.5
2 i=1
2
1X
(yi − y left )2 = 0.25
2 i=1
Lright = {x3 , . . . , x5 , y3 , . . . , y5 }
5
1X
yi = 1.25
4 i=2
y right =
5
1X
(yi − y right )2 = 0.1875
4 i=2
H(Lright ) =
1
4
· 0 + · 0.1875 = 0.15
5
5
Daniel Schermer (TU Kaiserslautern)
x2 + x3
= 1.5:
2
G =
Basics of Data Science
5
1X
yi = 1.33
3 i=3
5
1X
(yi − y right )2 = 0.2222
3 i=3
2
3
· 0.25 + · 0.2222 = 0.2333
5
5
18.07.2022
26 / 50
Simple Examples
Regression Tree — Example
All relevant thresholds
The third threshold is t3 =
x3 + x4
= 2.5:
2
The fourth threshold is t4 =
Lleft = {x1 , x2 , x3 , y1 , y2 , y3 }
y left =
H(Lleft ) =
Lleft = {x1 , x2 , x3 , x4 , y1 , y2 , y3 , y4 }
3
1X
yi = 1
3 i=1
y left =
3
1X
2
(yi − y left )2 =
3 i=1
3
H(Lleft ) =
Lright = {x4 , x5 , y4 , y5 }
y right =
H(Lright ) =
G =
4
1X
yi = 1
4 i=1
4
1X
1
(yi − y left )2 =
4 i=1
2
Lright = {x5 , y5 }
5
1X
yi = 1
2 i=4
y right =
5
1X
(yi − y right )2 = 0
2 i=4
H(Lright ) =
3 2
2
· + · 0 = 0.4
5 3
5
Daniel Schermer (TU Kaiserslautern)
x4 + x5
= 3.5:
2
G =
Basics of Data Science
5
1X
yi = 1
1 i=5
5
1X
(yi − y right )2 = 0
1 i=5
4 1
1
· + · 0 = 0.4
5 2
5
18.07.2022
27 / 50
Simple Examples
Regression Tree — Example
First Iteration
The optimal split θ∗ is θ(L, 0.5)
We see that the tree provides a piecewise linear
approximation by using axis-aligned splits.
2
1
5 Samples, y = 1
MSE = 0.4, x ≤ 0.5
True
0
False
−1
1 Sample, y =0
MSE = 0
Daniel Schermer (TU Kaiserslautern)
4 Samples, y =1.25
MSE = 0.188,
Basics of Data Science
0
1
2
3
18.07.2022
4
5
28 / 50
Simple Examples
Regression Tree — Example
After many more splits . . .
The procedure can be recursed, yielding an improved piecewise-linear approximation after each iteration.
5 Samples, y = 1
MSE = 0.4, x ≤ 0.5
2
True
False
1 Sample, y =0
MSE = 0
4 Samples, y =1.25
MSE = 0.188, x ≤ 2.5
1
True
True
2 Samples, y =1.5
MSE = 0.25, x ≤ 1.5
2 Samples y =1
MSE = 0
True
False
1 Sample, y =1
MSE = 0
1 Sample, y =2
MSE = 0
0
−1
Daniel Schermer (TU Kaiserslautern)
0
Basics of Data Science
1
2
3
18.07.2022
4
5
29 / 50
Bias-Variance Tradeoff
1
Introduction
2
CART Algorithm
3
Simple Examples
4
Bias-Variance Tradeoff
5
Ensembles, Bagging & Random Forests
6
Advanced Examples
7
Conclusion
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
30 / 50
Bias-Variance Tradeoff
Bias-Variance Tradeoff
High-Level Overview
Bias
Bias measures the average amount by which the predictions of a model ŷi differ from the true value yi .
Low Bias: Weak assumptions regarding the functional relationship between the input and output.
High Bias: Strong assumptions regarding the functional relationship between the input and output.
Variance
Variance measures the variability of the predictions when a model is learnt over different L.
Low variance ⇒ Small changes in L cause small changes in the model.
High variance ⇒ Small changes in L cause large changes in the model.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
31 / 50
Bias-Variance Tradeoff
Bias-Variance Tradeoff
High-Level Overview
Optimum Model
Complexity
Error
In general, the trade-off between bias and variance is non-trivial!
Total
error
Bias2
Variance
Model Complexity
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
32 / 50
Bias-Variance Tradeoff
Bias-Variance Tradeoff
Decision Trees & Example
Decision trees have low bias and high variance:
▶
▶
▶
We make (almost) no assumption about the functional relationship underlying L.
A small change in L can lead to a completely different decision tree.
This puts decision trees at risk of overfitting (not generalizing well)!
A more intuitive example may be fitting a polynomial function of degree d on a learning sample.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
33 / 50
Bias-Variance Tradeoff
Bias-Variance Tradeoff
Decision Trees & Example
There are several ways to address the Bias-Variance Tradeoff for tree-based learners:
Pre-Regularization (Pre-Pruning): Stop growing the tree prematurely.
▶
▶
▶
Minimum number of samples required for each split or leaf.
Maximum depth of the tree.
Goodness required to make a new split.
Post-Regularization (Post-Pruning): Grow a full tree, then prune it.
Ensembling, Bagging & Random Forests.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
34 / 50
Ensembles, Bagging & Random Forests
1
Introduction
2
CART Algorithm
3
Simple Examples
4
Bias-Variance Tradeoff
5
Ensembles, Bagging & Random Forests
6
Advanced Examples
7
Conclusion
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
35 / 50
Ensembles, Bagging & Random Forests
Ensembles
Introduction
Some of the most powerful machine learning models are ensemble methods.
An ensemble combines two or more base predictors, aiming to create a more powerful model.
We can distinguish two types of approaches:
▶
▶
Averaging methods (Bagging, Random Forests, . . . )
Boosting methods (AdaBoost, Gradient Boosted Decision Trees, . . . )
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
36 / 50
Ensembles, Bagging & Random Forests
Bagging
Bootstrapping & Aggregating
Bagging (Bootstraping and aggegating)
Bootstrapping:
Randomly sample with replacement from L until we have a new learning sample L̃ of the same size.
Iterate this procedure B times, until we have L˜1 , . . . , L˜B bootstraped learning samples.
Learn B predictors (e.g., decision trees, using CART) based on each L˜1 , . . . , L˜B .
Aggregating:
For classification: The predicted label is the majority vote amongst the B predictors.
For regression: The predicted value is the average amongst the values predicted by the B predictors.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
37 / 50
Ensembles, Bagging & Random Forests
Bagging
Bootstrapping & Aggregating
Consider our previous regression example where we had:
L = {(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), (x4 , y4 ), (x5 , y5 )} = {(0, 0), (1, 1), (2, 2), (3, 1), (4, 1)}
Random sampling with replacement might yield the following B bootstrap samples:
L˜1 = {(x3 , y3 ), (x4 , y4 ), (x3 , y3 ), (x1 , y1 ), (x2 , y2 )}
L˜2 = {(x4 , y4 ), (x5 , y5 ), (x5 , y5 ), (x2 , y2 ), (x1 , y1 )}
...
L˜B = {(x1 , y1 ), (x2 , y2 ), (x4 , y4 ), (x5 , y5 ), (x3 , y3 )}
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
38 / 50
Ensembles, Bagging & Random Forests
Bagging
Bootstrapping & Aggregating
We call the arrangement of B trees that results from bagging a decision forest.
Learning Sample L
Bootstrapping
Tree 1
Tree 2
Tree B
...
mean in regression or majority vote in classification
prediction
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
39 / 50
Ensembles, Bagging & Random Forests
Random Forests
Overview
Random Forest ⇔ decision forest that is constructed with extra randomness3 :
▶
▶
▶
In principle, a random forest is grown using CART.
However, whenever we look for a split in a tree, we only consider a random subset of features.
Generally, this random subspace is very small.
Random Forests typically perform more favorable than a decision forest.
However, they also come with additional hyperparameters!
3 Injecting
the right amount of randomness is not trivial!
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
40 / 50
Ensembles, Bagging & Random Forests
Random Forests
Example
Consider our previous classification example.
Bootstrapping may yield B learning samples:
L1 = {Day 5, Day 2, Day 11, . . . },
L2 = {Day 8, Day 1, Day 4, . . . },
...
When training T1 on L1 , we might consider only, e. g., outlook
and wind for the first split (selected randomly).
When training T2 on L2 , we might consider only, e. g., outlook
and temperature for the first split (selected randomly).
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook Temperature Humidity
Wind
Play Tennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
This procedure can be repeated until tree TB , and then
recursed for each tree.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
41 / 50
Advanced Examples
1
Introduction
2
CART Algorithm
3
Simple Examples
4
Bias-Variance Tradeoff
5
Ensembles, Bagging & Random Forests
6
Advanced Examples
7
Conclusion
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
42 / 50
Advanced Examples
Advanced Regression Example
Assume that we have the following function:
2
2
f (x) = e −x + 1.5e −(x−2) , x ∈ [−5, 5] with Noise coming from N {µ, σ 2 } = N {0, 0.01}
We do the following:
Draw 200 learning samples L1 , . . . L200 from noisy f (x).
1.5
Use L1 , . . . L100 for training 100 decision trees and forests.
1
f (x)
We compare the average performance of the learnt trees and forests
versus a hypothetical regressor, that best matches (on average) the
remaining learning samples L101 , . . . , L200 .
0.5
0
−4
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
−2
0
x
18.07.2022
2
4
43 / 50
2
2
f (x) = e −x + 1.5e −(x−2) with Noise coming from N {µ, σ 2 } = N {0, 0.01}
Decision Tree
Decision Forest
1.75
1.75
1.50
1.50
1.25
1.25
1.00
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
0.00
−0.25
−0.25
−4
−2
0
2
f (x)
EL ŷ (x)
−4
4
0.10
0.10
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0.02
−2
0
2
4
error(x)
bias 2 (x)
variance(x)
0.00
0.00
−4
−2
0
2
4
−4
−2
0
2
4
Advanced Examples
Advanced Classification Example
We have L where each xi is a 8 by 8 pixel greyscale matrix and each yi ∈ {0, . . . , 9}; n = 1797.
This is a straightforward classification task:
We have 82 = 64 features (corresponding to each pixel) with greyscale values in [0, 1].
We have a single label yi ∈ {0, . . . , 9}.
On the following slide we compare the performance of a Decision Tree, Decision Forest (100 Trees)
and Random Forest (100 Trees).
▶
▶
We use 1437 samples (≈ 80%) for learning the classifiers.
We use 360 samples (≈ 20%) for testing the classifiers.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
45 / 50
Advanced Examples
Advanced Classification Example
Decision Forest
Confusion Matrix
Random Forest
Confusion Matrix
34
0
0
0
0
0
1
0
0
0
0
34
0
0
0
1
0
0
0
0
0
0
34
0
0
0
1
0
0
0
0
0
1
0
23
0
5
0
0
0
1
4
3
1
0
27
0
1
0
0
0
0
1
7
1
0
35
0
0
0
0
0
0
0
1
2
1
0
29
0
0
1
1
1
1
1
2
1
0
32
0
0
0
0
2
0
0
2
1
0
34
0
0
0
0
0
0
0
3
0
3
0
24
0
4
1
1
4
0
3
0
2
0
25
0
2
0
2
6
0
3
0
0
0
30
0
2
0
0
5
0
4
0
0
0
0
30
0
2
3
1
1
4
0
0
0
0
34
0
0
2
0
1
4
0
0
0
0
34
0
0
3
0
0
5
0
0
0
3
2
28
2
1
0
1
5
0
0
0
2
0
35
0
0
0
0
5
0
0
0
0
0
37
0
0
0
0
6
0
2
0
0
0
0
32
0
3
0
6
0
1
0
0
0
0
36
0
0
0
6
0
0
0
0
0
0
37
0
0
0
7
1
0
0
1
2
0
0
32
0
0
7
0
0
0
0
0
0
0
36
0
0
7
0
0
0
0
0
0
0
36
0
0
8
1
2
2
1
0
0
0
0
25
2
8
1
2
0
0
1
0
0
0
28
1
8
0
3
0
0
1
0
0
1
27
1
9
0
0
0
1
0
3
0
5
1
27
9
0
1
0
0
0
1
0
2
1
32
9
0
0
0
0
0
2
0
0
1
34
0
1
2
3
4
5
6
Predicted label
7
8
9
0
1
2
3
4
5
6
Predicted label
7
8
9
0
1
2
3
4
5
6
Predicted label
7
8
9
True label
0
True label
True label
Decision Tree
Confusion Matrix
Accuracies: 79.0% (Decision Tree), 88.3% (Decision Forest), 93.3% (Random Forest).
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
46 / 50
Conclusion
1
Introduction
2
CART Algorithm
3
Simple Examples
4
Bias-Variance Tradeoff
5
Ensembles, Bagging & Random Forests
6
Advanced Examples
7
Conclusion
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
47 / 50
Conclusion
Conclusion
Decision Trees
Advantages:
Disadvantages:
Simple to understand and interpret.
Prone to overfitting (high variance).
Require (almost) no data preparation.
Regression trees weak at extrapolation.
Require (almost) no hyperparameters.
Unstable (change in L yields different tree).
Decision Forests & Random Forests
Advantages:
Disadvantages:
Powerful and typically more accurate.
No longer easily interpretable.
Require (almost) no data preparation.
Computationally more expensive.
Several trees make the forest stable.
More hyperparameters than decision trees.
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
48 / 50
Conclusion
Recommender Web Resources & Python Codes
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html
https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
https://scikit-learn.org/stable/auto_examples/ensemble/plot_bias_variance.html
https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
49 / 50
Conclusion
Literature
Breiman, Leo (1996). “Bagging predictors”. In: Machine Learning 24.2, pp. 123–140.
Breiman, Leo (2001). “Random Forests”. In: Machine Learning 45.1, pp. 5–32.
Breiman, Leo et al. (1984). Classification And Regression Trees. 1st ed. Routledge.
Freund, Yoav and Robert E Schapire (1997). “A Decision-Theoretic Generalization of On-Line Learning and
an Application to Boosting”. In: Journal of Computer and System Sciences 55.1, pp. 119–139.
Hastie, Trevor, Robert Tibshirani, and J. H. Friedman (2009). The elements of statistical learning: data
mining, inference, and prediction. 2nd ed. Springer series in statistics. New York, NY: Springer.
Ho, Tin Kam (1998). “The random subspace method for constructing decision forests”. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence 20.8, pp. 832–844.
Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill series in computer science. New York:
McGraw-Hill.
Pedregosa, F. et al. (2011). “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning
Research 12, pp. 2825–2830.
Quinlan, J. R. (1986). “Induction of decision trees”. In: Machine Learning 1.1, pp. 81–106.
scikit-learn: machine learning in Python — scikit-learn 1.1.1 documentation (2022).
Daniel Schermer (TU Kaiserslautern)
Basics of Data Science
18.07.2022
50 / 50
Download