Uploaded by Kaniz Fatema Tanni

Final Machine learning Autumn-22

advertisement
MACHINE LEARNING AND DATA MINING
Final syllabus topics and solve
https://drive.google.com/drive/folders/1CoNE3BkgOJNj6q9lhPhjcs
qyK0fYDUAG?usp=sharing
TASNIM
C191267
1
Contents
1
Segment 4 .............................................................................................................................................................................................................. 1
1.1
2
3
1.1.1
Data Cube and OLAP, ............................................................................................................................................................................. 1
1.1.2
Data Warehouse Design and Usage ...................................................................................................................................................... 1
Classification: ......................................................................................................................................................................................................... 6
2.1.1
Basic Concepts,....................................................................................................................................................................................... 6
2.1.2
Decision Tree Induction, ......................................................................................................................................................................... 6
2.1.3
Bayes Classification Methods, ................................................................................................................................................................ 6
2.1.4
Rule-Based Classification, ....................................................................................................................................................................... 8
2.1.5
Model Evaluation and Selection. ............................................................................................................................................................ 9
Classification Advanced Topics: .............................................................................................................................................................................. 9
3.1.1
Techniques to Improve, .......................................................................................................................................................................... 9
3.1.2
classification Accuracy: ........................................................................................................................................................................... 9
3.1.3
Ensemble Methods................................................................................................................................................................................. 9
3.1.4
Bayesian Belief Networks, .................................................................................................................................................................... 10
3.1.5
Classification by Backpropagation, ....................................................................................................................................................... 11
3.1.6
Support Vector Machines, .................................................................................................................................................................... 13
3.1.7
Lazy Learners (or Learning from Your Neighbors): ............................................................................................................................... 13
3.1.8
Other Classification Methods ............................................................................................................................................................... 14
3.2
4
Data Warehouse Modeling: .......................................................................................................................................................................... 1
Cluster Analysis: ........................................................................................................................................................................................... 14
3.2.1
Basic Concepts,..................................................................................................................................................................................... 14
3.2.2
Partitioning Methods,........................................................................................................................................................................... 14
3.2.3
Hierarchical Methods ........................................................................................................................................................................... 16
3.2.4
Density-Based Methods deletion. ........................................................................................................................................................ 20
Segment 5 ............................................................................................................................................................................................................ 20
4.1
Outliers Detection and Analysis: .............................................................................................................................................................. 20
4.2
Outliers Detection Methods....................................................................................................................................................................... 20
4.2.1
5
6
Mining Contextual and Collective ........................................................................................................................................................ 21
From class lecture:............................................................................................................................................................................................... 23
5.1
K-Nearest Neighbor (KNN) Algorithm for Machine Learning ..................................................................................................................... 23
5.2
Apriori vs FP-Growth in Market Basket Analysis – A Comparative Guide .................................................................................................... 25
5.3
Random Forest Algorithm ........................................................................................................................................................................... 30
5.4
Regression Analysis in Machine learning ................................................................................................................................................... 31
5.5
Confusion Matrix in Machine Learning....................................................................................................................................................... 34
Exercise: ............................................................................................................................................................................................................... 35
1 Segment 4
1.1 Data Warehouse Modeling:
1.1.1 Data Cube and OLAP,
1.1.2 Data Warehouse Design and Usage
1. a) What do you mean by Data Warehouse? Do you think if you have proper database, still need data warehouse? Justify your
answer.
b) Consider about Agora Super Shop. Based on "Agora" develop and draw a sample data cube.
c) Why do we need data mart?
1.
a) Draw the multi-tiered architecture of a data warehouse and explain briefly. 4
b) Compare between star and snowflake schema with necessary figures. 3
c) Explain the different choices for data cube materialization. 3
What Is a Data Warehouse?
Tasnim (C191267)
2
Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make
strategic decisions. Datawarehouse systems are valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms
have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry,
data warehousing is the latest must-have marketing weapon—a way to retain customers by learning more about their needs.
Let’s take a closer look at each of these key features.
•
•
•
•
Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier, product, and sales. Rather than
concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling
and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view of particular subject issues
by excluding data that are not useful in the decision support process.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat
files, and online transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming
conventions, encoding structures, attribute measures, and so on.
Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5–10 years). Every key structure in the
data warehouse contains, either implicitly or explicitly, a time element.
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the
operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.
Differences between Operational Database Systems and Data Warehouses
Because most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by
comparing these two kinds of systems. The major task of online operational database systems is to perform online transaction and
query processing. These systems are called online transaction processing (OLTP) systems. They cover most of the day-to-day
operations of an organization such
as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand,
serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in
various formats in order to accommodate the diverse needs of different users. These systems are known as online analytical
processing (OLAP) systems.
The major distinguishing features of OLTP and OLAP are summarized as follows: Users and system orientation: An OLTP system
is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An
OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts.
Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An
OLAP system manages large amounts of historic data, provides facilities for summarization and aggregation, and stores and manages
information at different levels of granularity.
These features make the data easier to use for informed decision making.
4.1.3 But, Why Have a Separate Data Warehouse?
Reasons for having a separate data warehouse:
1. Performance: Operational databases are optimized for specific tasks, while data warehouse queries are more complex and
computationally intensive. Processing OLAP queries directly on operational databases would degrade performance for
operational tasks.
2. Concurrency: Operational databases support concurrent transactions and employ concurrency control mechanisms. Applying
these mechanisms to OLAP queries would hinder concurrent transactions and reduce OLTP throughput.
3. Data Structures and Uses: Operational databases store detailed raw data, while data warehouses require historic data for
decision support. Operational databases lack complete data for decision making, requiring consolidation from diverse sources
in data warehouses.
Although separate databases are necessary now, vendors are working to optimize operational databases for OLAP queries, potentially
reducing the separation between OLTP and OLAP systems.
Tasnim (C191267)
3
Data Warehousing: A Multitiered Architecture
Data warehouses often adopt a three-tier architecture, as presented in Figure 4.1.
Figure from Slide
Figure from book
The three-tier data warehousing architecture consists of a bottom tier with a warehouse database server, a middle tier with an OLAP server, and
a top tier with front-end client tools.
The bottom tier, implemented as a relational database system, handles data extraction, cleaning, transformation, and loading functions. It
collects data from operational databases or external sources, merges similar data, and updates the data warehouse. Gateways like ODBC, OLEDB,
and JDBC are used for data extraction, and a metadata repository stores information about the data warehouse.
The middle tier consists of an OLAP server, which can be based on a relational OLAP (ROLAP) or a multi-dimensional OLAP (MOLAP) model.
ROLAP maps multidimensional operations to relational operations, while MOLAP directly implements multidimensional data and operations. The
OLAP server is responsible for processing and analyzing the data stored in the data warehouse.
The top tier is the front-end client layer that provides tools for querying, reporting, analysis, and data mining. It includes query and reporting
tools, analysis tools, and data mining tools. These tools allow users to interact with the data warehouse, perform ad-hoc queries, generate
reports, and gain insights through analysis and data mining techniques.
Overall, the three-tier data warehousing architecture facilitates efficient data extraction, processing, and analysis, enabling organizations to make
informed decisions and leverage business intelligence.
Data Warehouse Models: Enterprise Warehouse,
Data Mart, and Virtual Warehouse
The three data warehouse models are the enterprise warehouse, data mart, and virtual warehouse.
1. Enterprise Warehouse: An enterprise warehouse collects data from various sources across the entire organization. It provides
corporate-wide data integration and is cross-functional in scope. It contains detailed and summarized data, and its size can
range from gigabytes to terabytes or more. It requires extensive business modeling and may take a long time to design and
build. Examples include a data warehouse that integrates sales, marketing, finance, and inventory data from multiple
departments within a company.
Tasnim (C191267)
4
2. Data Mart: A data mart is a subset of data from the enterprise warehouse that is tailored for a specific group of users or a
particular subject area. It focuses on selected subjects, such as sales, marketing, or customer data. Data marts usually contain
summarized data and are implemented on low-cost departmental servers. They have a shorter implementation cycle compared
to enterprise warehouses. Examples include a marketing data mart that provides data on customer demographics, purchase
history, and campaign performance for the marketing department.
3. Virtual Warehouse: A virtual warehouse is a set of views over operational databases. It provides efficient query processing by
materializing only selected summary views. Virtual warehouses are relatively easy to build but require excess capacity on
operational database servers. They can be used when real-time access to operational data is required, but with the benefits of a
data warehouse's querying capabilities.
Pros and Cons of Top-Down and Bottom-Up Approaches:
• Top-Down Approach: In the top-down approach, an enterprise warehouse is developed first, providing a systematic solution
and minimizing integration problems. However, it can be expensive, time-consuming, and lacks flexibility due to the difficulty
of achieving consensus on a common data model for the entire organization.
• Bottom-Up Approach: The bottom-up approach involves designing and deploying independent data marts, offering flexibility,
lower cost, and faster returns on investment. However, integrating these disparate data marts into a consistent enterprise data
warehouse can pose challenges.
It is recommended to develop data warehouse systems in an incremental and evolutionary manner. This involves defining a high-level
corporate data model first, which provides a consistent view of data across subjects and reduces future integration problems. Then,
independent data marts can be implemented in parallel with the enterprise warehouse, gradually expanding the data warehouse
ecosystem.
houses and departmental data marts, will greatly reduce future integration problems.
Second, independent data marts can be implemented in parallel with the enterprise
Data Cube: A Multidimensional Data Model:
•
What Are the Elements of a Data Cube?
Now that we’ve laid the foundations, let’s get acquainted with the data cube terminology. Here is a summary of the individual
elements, starting from the definition of a data cube itself:
•
A data cube is a multi-dimensional data structure.
•
A data cube is characterized by its dimensions (e.g., Products, States, Date).
•
Each dimension is associated with corresponding attributes (for example, the attributes of the Products dimensions are T-Shirt,
Shirt, Jeans and Jackets).
•
The dimensions of a cube allow for a concept hierarchy (e.g., the T-shirt attribute in the Products dimension can have its own,
such as T-shirt Brands).
•
All dimensions connect in order to create a certain fact – the finest part of the cube.
•
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
•
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
•
Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
•
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highestlevel of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
What Are the Data Cube Operations?
Data cubes are a very convenient tool whenever one needs to build summaries or extract certain portions of the entire dataset. We will
cover the following:
•
Rollup – decreases dimensionality by aggregating data along a certain dimension
•
Drill-down – increases dimensionality by splitting the data further
•
Slicing – decreases dimensionality by choosing a single value from a particular dimension
•
Dicing – picks a subset of values from each dimension
•
Pivoting – rotates the data cube
Advantages of data cubes:
•
Multi-dimensional analysis
•
Interactivity
•
Speed and efficiency:
•
Data aggregation
•
Helps in giving a summarized view of data.
•
Data cubes store large data in a simple way.
•
Data cube operation provides quick and better analysis,
•
Improve performance of data.
Disadvantages of data cube:
Tasnim (C191267)
5
•
Complexity:
•
Data size limitations:
•
Performance issues:
•
Data integrity
•
Cost:
•
Inflexibility
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
Schema is a logical description of the entire database. It includes the name and description of records of all record types including all
associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the
schemas used in a data warehouse.
Star Schema:
• Each dimension in a star schema is represented with only one-dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and
location.
•
•
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Snowflake Schema
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star
schema is normalized and split into two dimension tables, namely item and supplier table.
Tasnim (C191267)
6
•
•
Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key
and supplier_type.
Fact Constellation Schema
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
•
•
•
•
The sales fact table is same as that in the star schema.
The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are
shared between the sales and shipping fact table.
Extraction, Transformation, and Loading (ETL)
▪ Data extraction
▪ get data from multiple, heterogeneous, and external sources
▪ Data cleaning
▪ detect errors in the data and rectify them when possible
▪ Data transformation
▪ convert data from legacy or host format to warehouse format
▪ Load
▪ sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions
▪ Refresh
▪ propagate the updates from the data sources to the warehouse
2 Classification:
2.1.1 Basic Concepts,
https://www.datacamp.com/blog/classification-machine-learning
2.1.2 Decision Tree Induction,
https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-mining/
2.1.3 Bayes Classification Methods,
2.1.3.1
Naïve Bayes Classifier Algorithm
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification
problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
Tasnim (C191267)
7
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features.
Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this dataset we need to decide
that whether we should play or not on a particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
0
Outlook
Play
Rainy
Yes
1
Sunny
2
Overcast
3
Overcast
4
Sunny
5
Rainy
6
Sunny
7
Overcast
8
Rainy
9
Sunny
10
Sunny
11
Rainy
12
Overcast
13
Overcast
Frequency table for the Weather Conditions:
Yes
Yes
Yes
No
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
Weather
Overcast
Rainy
Sunny
Total
Likelihood table weather condition:
Yes
5
2
3
10
Weather
No
Overcast
0
Rainy
2
Sunny
2
All
4/14=0.29
Applying Bayes'theorem:
Yes
5
2
3
10/14=0.71
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
Tasnim (C191267)
No
0
2
2
5
5/14= 0.35
4/14=0.29
5/14=0.35
8
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
It can be used for Binary as well as Multi-class Classifications.
It performs well in Multi-class predictions as compared to the other Algorithms.
It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values
instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for
document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent
Booleans variables. Such as if a particular word is present or not in a document. This model is also famous for document classification
tasks.
2.1.4 Rule-Based Classification,
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests and these tests are logically ANDed.
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Tasnim (C191267)
9
2.1.5 Model Evaluation and Selection.
https://neptune.ai/blog/ml-model-evaluation-and-selection
3 Classification Advanced Topics:
3.1.1 Techniques to Improve,
3.1.2 classification Accuracy:
3.1.3 Ensemble Methods
Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of
better predictive performance compared to a single model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
Advantage: Improvement in predictive accuracy.
Disadvantage: It is difficult to understand an ensemble of classifiers.
Why do ensembles work?
Dietterich(2002) showed that ensembles overcome three problems –
Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the amount of available data. Hence, there are many
hypotheses with the same accuracy on the data and the learning algorithm chooses only one of them! There is a risk that the accuracy
of the chosen hypothesis is low on unseen data!
Computational Problem –
The Computational Problem arises when the learning algorithm cannot guarantees finding the best hypothesis.
Representational Problem –
The Representational Problem arises when the hypothesis space does not contain any good approximation of the target class(es).
Main Challenge for Developing Ensemble Models?
The main challenge is not to obtain highly accurate base models, but rather to obtain base models which make different kinds of
errors. For example, if ensembles are used for classification, high accuracies can be accomplished if different base models misclassify
different training examples, even if the base classifier accuracy is low.
Methods for Independently Constructing Ensembles –
Majority Vote
Bagging and Random Forest
Randomness Injection
Feature-Selection Ensembles
Error-Correcting Output Coding
Methods for Coordinated Construction of Ensembles –
Boosting
Stacking
Reliable Classification: Meta-Classifier Approach
Co-Training and Self-Training
Types of Ensemble Classifier –
Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree. Suppose a set D of d tuples, at each iteration i, a
training set Di of d tuples is sampled with replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for each training
set D < i. Each classifier Mi returns its class prediction. The bagged classifier M* counts the votes and assigns the class with the most
votes to X (unknown sample).
Implementation steps of Bagging –
Multiple subsets are created from the original data set with equal tuples, selecting observations with replacement.
A base model is created on each of these subsets.
Tasnim (C191267)
10
Each model is learned in parallel from each training set and independent of each other.
The final predictions are determined by combining the predictions from all the models.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree classifier and is generated using a
random selection of attributes at each node to determine the split. During classification, each tree votes and the most popular class is
returned.
Implementation steps of Random Forest –
Multiple subsets are created from the original data set, selecting observations with replacement.
A subset of features is selected randomly and whichever feature gives the best split is used to split the node iteratively.
The tree is grown to the largest.
Repeat the above steps and prediction is given based on the aggregation of predictions from n number of trees.
3.1.4 Bayesian Belief Networks,
Bayesian Belief Network is a graphical representation of different probabilistic relationships among random variables in a
particular set. It is a classifier with no dependency on attributes i.e it is condition independent. Due to its feature of jo int
probability, the probability in Bayesian Belief Network is derived, based on a condition — P(attribute/parent) i.e probability of an
attribute, true over parent attribute.
(Note: A classifier assigns data in a collection to desired categories.)
•
Consider this example:
In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a person ‘gfg’, which rings upon two
probabilities i.e burglary ‘B’ and fire ‘F’, which are – parent nodes of the alarm node. The alarm is the parent node of
two probabilities P1 calls ‘P1’ & P2 calls ‘P2’ person nodes.
• Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively. But, there are few drawbacks in this
case, as sometimes ‘P1’ may forget to call the person ‘gfg’, even after hearing the alarm, as he has a tendency to forget
things, quick. Similarly, ‘P2’, sometimes fails to call the person ‘gfg’, as he is only able to hear the alarm, from a
certain distance.
Q) Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called ‘gfg’) when the alarm ‘A’ rang, but no
burglary ‘B’ and fire ‘F’ has occurred.
•
Tasnim (C191267)
11
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’ events]
[ Note: The values mentioned below are neither calculated nor computed. They have observed values ]
Burglary ‘B’ –
• P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
• P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
• P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
• P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B
F
P (A=T)
P (A=F)
T
T
0.95
0.05
T
F
0.94
0.06
F
T
0.29
0.71
F
F
0.001
0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have rung). It has two parent nodes burglary
‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e may have occurred or may not have occurred) depending upon
different conditions.
Person ‘P1’ –
•
A
P (P1=T)
P (P1=F)
T
0.95
0.05
F
0.05
0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person ‘gfg’ or not) . It has a parent node, the
alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have rung ,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
•
A
P (P2=T)
P (P2=F)
T
0.80
0.20
F
0.01
0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or not). It has a parent node, the alarm
‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to get the probability of ‘P1’. We find it with regard to its parent
node – alarm ‘A’. To get the probability of ‘P2’, we find it with regard to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’ and fire ‘F’ are parent nodes of alar m ‘A’.
•
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075
3.1.5 Classification by Backpropagation,
backpropagation:
Backpropagation is a widely used algorithm for training feedforward neural networks. It computes the gradient of the loss function
with respect to the network weights. It is very efficient, rather than naively directly computing the gradient concerning each weight.
This efficiency makes it possible to use gradient methods to train multi-layer networks and update weights to minimize loss; variants
such as gradient descent or stochastic gradient descent are often used.
The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight via the chain rule,
computing the gradient layer by layer, and iterating backward from the last layer to avoid redundant computation of intermediate terms
in the chain rule.
Features of Backpropagation:
•
•
•
•
•
it is the gradient descent method as used in the case of simple perceptron network with the differentiable unit.
it is different from other networks in respect to the process by which the weights are calculated during the learning period of
the network.
training is done in the three stages :
the feed-forward of input training pattern
the calculation and backpropagation of the errorupdation of the weight
Tasnim (C191267)
12
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input vectors that the network operates on. It Compares
generated output to the desired output and generates an error report if the result does not match the generated output vector. Then it
adjusts the weights according to the bug report to get your desired output.
Backpropagation Algorithm:
Step 1: Inputs X, arrive through the preconnected path.
Step 2: The input is modeled using true weights W. Weights are usually chosen randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the error.
Step 6: Repeat the process until the desired output is achieved.
Parameters :
x = inputs training vector x=(x1,x2,…………xn).
t = target vector t=(t1,t2……………tn).
δk = error at output unit.
δj = error at hidden layer.
α = learning rate.
V0j = bias of hidden unit j.
Training Algorithm :
Step 1: Initialize weight to small random values.
Step 2: While the stepsstopping condition is to be false do step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmitsthe signal xi signal to all the units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its net input
zinj = v0j + Σxivij
( i=1 to n)
Applying activation function zj = f(zinj) and sends this signals to all units in the layer about i.e output units
For each output l=unit yk = (k=1 to m) sums its weighted input signals.
yink = w0k + Σ ziwjk
(j=1 to a)
and applies its activation function to calculate the output signals.
yk = f(yink)
Backpropagation Error :
Step 6: Each output unit yk (k=1 to n) receives a target pattern corresponding to an input pattern then error is calculated as:
δk = ( tk – yk ) + yink
Tasnim (C191267)
13
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :
δj = δinj + zinj
Updation of weight and bias :
Step 8: Each output unit yk (k=1 to m) updates its bias and weight (j=1 to a). The weight correction term is given by :
Δ wjk = α δk zj
and the bias correction term is given by Δwk = α δk.
therefore wjk(new) = wjk(old) + Δ wjk
w0k(new) = wok(old) + Δ wok
for each hidden unit zj (j=1 to a) update its bias and weights (i=0 to n) the weight connection term
Δ vij = α δj xi
and the bias connection on term
Δ v0j = α δj
Therefore vij(new) = vij(old) + Δvij
v0j(new) = v0j(old) + Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization of error, number of epochs.
Need for Backpropagation:
Backpropagation is “backpropagation of errors” and is very useful for training neural networks. It’s fast, easy to implement, and
simple. Backpropagation does not require any parameters to be set, except the number of inputs. Backpropagation is a flexible method
because no prior knowledge of the network is required.
Types of Backpropagation
There are two types of backpropagation networks.
Static backpropagation: Static backpropagation is a network designed to map static inputs for static outputs. These types of networks
are capable of solving static classification problems such as OCR (Optical Character Recognition).
Recurrent backpropagation: Recursive backpropagation is another network used for fixed-point learning. Activation in recurrent
backpropagation is feed-forward until a fixed value is reached. Static backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.
Advantages:
It is simple, fast, and easy to program.
Only numbers of the input are tuned, not any other parameter.
It is Flexible and efficient.
No need for users to learn any special functions.
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
https://www.javatpoint.com/pytorch-backpropagation-process-in-deep-neural-network
3.1.6 Support Vector Machines,
https://www.geeksforgeeks.org/support-vector-machine-algorithm/
https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
3.1.7 Lazy Learners (or Learning from Your Neighbors):
KNN is often referred to as a lazy learner. This means that the algorithm does not use the training data points to do any generalizations. In other
words, there is no explicit training phase. Lack of generalization means that KNN keeps all the training data. It is a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data. Leazy learning is also known as instance-based learning and memorybased learning. It postpones most of the processing and computation until a query or prediction request.
Here, the algorithm stores the training data set in its original form without deriving general rules from it. When we have a new object to process,
the algorithm searches the training data for the most similar objects and uses them to produce the output, like the k-nearest neighbors (kNN)
algorithm:
Tasnim (C191267)
14
In the example above, kNN classifies an unknown point by checking its neighborhood when it arrives as the input.
Difference between eager and lazy learners in data mining
https://www.baeldung.com/cs/lazy-vs-eager-learning
3.1.8 Other Classification Methods
4 Cluster Analysis:
4.1.1 Basic Concepts,
What Is Cluster Analysis?
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or observations) into subsets. Each subset is a
cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters. The set of clusters resulting
from a cluster analysis can be referred to as a clustering. In this context, different clustering methods may generate different
clustering on the same data set. The partitioning is not performed by humans, but by the clustering algorithm. Hence, clustering is
useful in that it can lead to the discovery of previously unknown groups within the data.
The basic concept of clustering in data mining is to group similar data objects together based on their characteristics or
attributes.
However, there are several requirements that need to be addressed for effective clustering:
1. Scalability: Clustering algorithms should be able to handle large datasets containing millions or billions of objects. Scalable
algorithms are necessary to avoid biased results that may occur when clustering on a sample of the data.
2. Handling Different Types of Attributes: Clustering algorithms should be able to handle various types of data, including
numeric, binary, nominal, ordinal, and complex data types such as graphs, sequences, images, and documents.
3. Discovery of Clusters with Arbitrary Shape: Algorithms should be capable of detecting clusters of any shape, not just spherical
clusters. This is important when dealing with real-world scenarios where clusters can have diverse and non-standard shapes.
4. Reduced Dependency on Domain Knowledge: Clustering algorithms should minimize the need for users to provide domain
knowledge and input parameters. The quality of clustering should not heavily rely on user-defined parameters, which can be
challenging to determine, especially for high-dimensional datasets.
5. Robustness to Noisy Data: Clustering algorithms should be robust to outliers, missing data, and errors commonly found in realworld datasets. Noise in the data should not significantly impact the quality of the resulting clusters.
6. Incremental Clustering and Insensitivity to Input Order: Algorithms should be capable of incorporating incremental updates
and new data into existing clustering structures without requiring a complete recomputation. Additionally, the order in which
data objects are presented should not drastically affect the resulting clustering.
7. Capability of Clustering High-Dimensionality Data: Clustering algorithms should be able to handle datasets with a large
number of dimensions or attributes, even in cases where the data is sparse and highly skewed.
8. Constraint-Based Clustering: Clustering algorithms should be able to consider and satisfy various constraints imposed by realworld applications. Constraints may include spatial constraints, network constraints, or specific requirements related to the
clustering task.
9. Interpretability and Usability: Clustering results should be interpretable, comprehensible, and usable for users. Clustering
should be tied to specific semantic interpretations and application goals, allowing users to understand and apply the results
effectively.
These requirements highlight the challenges and considerations involved in developing clustering algorithms that can effectively
analyze and group data objects based on their similarities and characteristics.
4.1.2 Partitioning Methods,
The simplest and most fundamental version of cluster analysis is partitioning, which organizes the objects of a set into several
exclusive groups or clusters. To keep the problem specification concise, we can assume that the number of clusters is given as
Tasnim (C191267)
15
background knowledge. This parameter is the starting point for partitioning methods. Formally, given a data set, D, of n objects, and k,
the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition represents
a cluster. The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so
that the objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms of the data set
attributes.
The most well-known and commonly used partitioning methods are
▪
The k-Means Method
▪
k-Medoids Method
Centroid-Based Technique:
The K-Means Method The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok clusters so that the resulting
intracluster similarity is high but the inter cluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a
cluster, which can be viewed as the cluster’s centroid or center of gravity. The k-means algorithm proceeds as follows
▪
▪
▪
▪
First, it randomly selects k of the objects, each of which initially represents a cluster mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance
between the object and the cluster mean
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.
Typically, the square-error criterion is used, defined as
where E is the sum of the square error for all objects in the data set p is the point in space representing a given object mi is the mean of
cluster Ci
The k-means partitioning algorithm: The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.
Clustering of a set of objects based on the k-means method
The k-Medoids Method
The k-means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort the distribution
of data. This effect is particularly exacerbated due to the use of the square-error function. Instead of taking the mean value of the objects
in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster.
Each remaining object is clustered with the representative object to which it is the most similar The partitioning method is then performed
based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. That is,
an absolute-error criterion is used, defined as
where E is the sum of the absolute error for all objects in the data set p is the point in space representing a given object in cluster Cj oj
is the representative object of Cj The initial representative objects are chosen arbitrarily.
The iterative process of replacing representative objects by non representative objects continues as long as the quality of the resulting
clustering is improved. This quality is estimated using a cost function that measures the average dissimilarity between an object and the
representative object of its cluster. To determine whether a non representative object, oj random, is a good replacement for a current
Tasnim (C191267)
16
representative
object,
oj,
the
following
four
cases
are
examined
for
each
of
the
non
representative
objects.
Case 1: p currently belongs to representative object, oj. If ojis replaced by orandomasa representative object and p is closest to one of
the other
representative objects, oi,i≠j, then p is reassigned to oi.
Case 2: p currently belongs to representative object, oj. If ojis replaced by o random as a representative object and p is closest to o
random, then p is reassigned to o random.
Case 3: p currently belongs to representative object, oi, i≠j. If oj is replaced by o random as a representative object and p is still closest
to oi, then the assignment does not change.
Case 4: p currently belongs to representative object, oi, i≠j. If ojis replaced by o randomas a representative object and p is closest to o
random, then p is reassigned to o random.
The k-Medoids Algorithm: The k-medoids algorithm for partitioning based on medoid or central objects.
https://educatech.in/classical-partitioning-methods-in-data-mining
4.1.3 Hierarchical Methods
https://www.saedsayad.com/clustering_hierarchical.htm
4.1.3.1 Hierarchical Clustering
Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the
hard disk are organized in a hierarchy. There are two types of hierarchical clustering, Divisive and Agglomerative.
Divisive method
In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the cluster to two least similar
clusters using a flat clustering method (e.g., K-Means). Finally, we proceed recursively on each cluster until there is one cluster for each observation.
There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumstances but is
conceptually more complex.
Tasnim (C191267)
17
Agglomerative method
In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then, compute the similarity (e.g., distance)
between each of the clusters and join the two most similar clusters. Finally, repeat steps 2 and 3 until there is only a single cluster left. The related
algorithm is shown below.
Before any clustering is performed, it is required to determine the proximity matrix containing the distance between each point using a distance
function. Then, the matrix is updated to display the distance between each cluster. The following three methods differ in how the distance between
each cluster is measured.
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster.
For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.
Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster.
For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.
Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to
every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length each arrow
between connecting the points of one cluster to the other.
Tasnim (C191267)
18
Example: Clustering the following 7 data points.
X1
X2
A
10
5
B
1
4
C
5
8
D
9
2
E
12
10
F
15
8
G
7
7
Step 1: Calculate distances between all data points using Euclidean distance function. The shortest distance is between data points C and G.
A
B
C
D
E
B
9.06
C
5.83
5.66
D
3.16
8.25
E
5.39 12.53
F
5.83 14.56 10.00 16.16
3.61
G
3.61
5.83
F
7.21
7.28 14.42
6.71
2.24
8.60
8.06
Step 2: We use "Average Linkage" to measure the distance between the "C,G" cluster and other data points.
A
B
C,G
D
B
9.06
C,G
4.72
6.10
D
3.16
8.25
E
5.39
12.53
6.50 14.42
F
5.83
14.56
9.01 16.16
Step 3:
Tasnim (C191267)
E
6.26
3.61
19
A,D
B
C,G
B
8.51
C,G
5.32
6.10
E
6.96
12.53
6.50
F
7.11
14.56
9.01
E
3.61
Step 4:
A,D
B
C,G
B
8.51
C,G
5.32
6.10
E,F
6.80
13.46
Step 5:
A,D,C,G
B
B
6.91
E,F
6.73
Step 6:
A,D,C,G,E,F
B
Final dendrogram:
Tasnim (C191267)
9.07
13.46
7.65
20
4.1.4 Density-Based Methods deletion.
What is Density-based clustering?
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used in model building and machine
learning algorithms. The data points in the region separated by two clusters of low point density are considered as noise. The
surroundings with a radius ε of a given object are known as the ε neighborhood of the object. If the ε neighborhood of the object
comprises at least a minimum number, MinPts of objects, then it is called a core object.
https://www.javatpoint.com/density-based-clustering-in-data-mining
5 Segment 5
5.1 Outliers Detection and Analysis:
an outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution
error. The analysis of outlier data is referred to as outlier analysis or outlier mining.
Why outlier analysis?
Most data mining methods discard outliers’ noise or exceptions, however, in some applications such as fraud detection, the rare
events can be more interesting than the more regularly occurring one and hence, the outlier analysis becomes important in suc h
case.
Why Should You Detect Outliers?
In the machine learning pipeline, data cleaning and preprocessing is an important step as it helps you better understand the data.
During this step, you deal with missing values, detect outliers, and more.
As outliers are very different values—abnormally low or abnormally high—their presence can often skew the results of statistical
analyses on the dataset. This could lead to less effective and less useful models.
But dealing with outliers often requires domain expertise, and none of the outlier detection techniques should be applied without
understanding the data distribution and the use case.
For example, in a dataset of house prices, if you find a few houses priced at around $1.5 million—much higher than the median house
price, they’re likely outliers. However, if the dataset contains a significantly large number of houses priced at $1 million and above—
they may be indicative of an increasing trend in house prices. So it would be incorrect to label them all as outliers. In this case, you
need some knowledge of the real estate domain.
The goal of outlier detection is to remove the points—which are truly outliers—so you can build a model that performs well on unseen
test data. We’ll go over a few techniques that’ll help us detect outliers in data.
How to Detect Outliers Using Standard Deviation
When the data, or certain features in the dataset, follow a normal distribution, you can use the standard deviation of the data, or the
equivalent z-score to detect outliers.
In statistics, standard deviation measures the spread of data around the mean, and in essence, it captures how far away from the mean
the data points are.
For data that is normally distributed, around 68.2% of the data will lie within one standard deviation from the mean. Close to 95.4%
and 99.7% of the data lie within two and three standard deviations from the mean, respectively.
Let’s denote the standard deviation of the distribution by σ, and the mean by μ.
One approach to outlier detection is to set the lower limit to three standard deviations below the mean (μ - 3*σ), and the upper limit to
three standard deviations above the mean (μ + 3*σ). Any data point that falls outside this range is detected as an outlier.
As 99.7% of the data typically lies within three standard deviations, the number of outliers will be close to 0.3% of the size of the
dataset.
5.2 Outliers Detection Methods
Outlier detection methods can be categorized in two orthogonal ways: based on the availability of domain expert-labeled data and
based on assumptions about normal objects versus outliers. Let's summarize each category and provide an example for each model:
Categorization based on the availability of labeled data:
Tasnim (C191267)
21
a. Supervised Methods: If expert-labeled examples of normal and/or outlier objects are available, supervised methods can be used.
These methods model data normality and abnormality as a classification problem. For example, a domain expert labels a sample of
data as either normal or outlier, and a classifier is trained to recognize outliers based on these labels.
Example: In credit card fraud detection, historical transactions are labeled as normal or fraudulent by domain experts. A supervised
method can be trained to classify new transactions as either normal or fraudulent based on the labeled training data.
b. Unsupervised Methods: When labeled examples are not available, unsupervised methods are used. These methods assume that
normal objects follow a pattern more frequently than outliers. The goal is to identify objects that deviate significantly from the
expected patterns.
Example: In network intrusion detection, normal network traffic is expected to follow certain patterns. Unsupervised methods can
analyze network traffic data and identify deviations from these patterns as potential intrusions.
c. Semi-Supervised Methods: In some cases, a small set of labeled data is available, but most of the data is unlabeled. Semi-supervised
methods combine the use of labeled and unlabeled data to build outlier detection models.
Example: In anomaly detection for manufacturing processes, some labeled normal samples may be available, along with a large
amount of unlabeled data. Semi-supervised methods can leverage the labeled samples and neighboring unlabeled data to detect
anomalies in the manufacturing process.
Categorization based on assumptions about normal objects versus outliers:
a. Statistical Methods: Statistical or model-based methods assume that normal data objects are generated by a statistical model, and
objects not following the model are considered outliers. These methods estimate the likelihood of an object being generated by the
model.
Example: Using a Gaussian distribution as a statistical model, objects falling into regions with low probability density can be
considered outliers.
b. Proximity-Based Methods: Proximity-based methods identify outliers based on the proximity of an object to its neighbors in feature
space. If an object's neighbors are significantly different from the neighbors of most other objects, it can be classified as an outlier.
Example: By considering the nearest neighbors of an object, if its proximity to these neighbors deviates significantly from the
proximity of other objects, it can be identified as an outlier.
c. Clustering-Based Methods: Clustering-based methods assume that normal data objects belong to large and dense clusters, while
outliers belong to small or sparse clusters or do not belong to any cluster at all.
Example: If a clustering algorithm identifies a small cluster or data points that do not fit into any cluster, these points can be
considered outliers.
These are general categories of outlier detection methods, and there are numerous specific algorithms and techniques within each
category. The examples provided illustrate the application of each method in different domains, highlighting their utilization in outlier
detection.
5.2.1
Mining Contextual and Collective
Contextual outlier
The main concept of contextual outlier detection is to identify objects in a dataset that deviate significantly within a specific context. Contextual
attributes, such as spatial attributes, time, network locations, and structured attributes, define the context in which outliers are evaluated.
Behavioral attributes, on the other hand, determine the characteristics of an object and are used to assess its outlier status within its context.
There are two categories of methods for contextual outlier detection based on the identification of contexts:
1. Transforming Contextual Outlier Detection to Conventional Outlier Detection: In situations where contexts can be clearly identified, this
method involves transforming the contextual outlier detection problem into a standard outlier detection problem. The process involves
two steps. First, the context of an object is identified using contextual attributes. Then, the outlier score for the object within its context
is calculated using a conventional outlier detection method.
For example, in customer-relationship management, outlier customers can be detected within the context of customer groups. By grouping
customers based on contextual attributes like age group and postal code, comparisons can be made within the same group using conventional
outlier detection techniques.
2. Modeling Normal Behavior with Respect to Contexts: In applications where it is difficult to explicitly partition the data into contexts, this
method focuses on modeling the normal behavior of objects with respect to their contexts. A training dataset is used to train a model
that predicts the expected behavior attribute values based on contextual attribute values. To identify contextual outliers, the model is
applied to the contextual attributes of an object, and if its behavior attribute values significantly deviate from the predicted values, it is
considered an outlier.
For instance, in an online store recording customer browsing behavior, the goal may be to detect contextual outliers when a customer purchases
a product unrelated to their recent browsing history. A prediction model can be trained to link the browsing context with the expected behavior,
and deviations from the predicted behavior can indicate contextual outliers.
In summary, contextual outlier detection expands upon conventional outlier detection by considering the context in which objects are evaluated.
By incorporating contextual information, outliers that cannot be detected otherwise can be identified, and false alarms can be reduced.
Contextual attributes play a crucial role in defining the context, and various methods, such as transforming the problem or modeling normal
behavior, can be employed based on the availability and identification of contexts.
Collective outlier:
Collective outlier detection aims to identify groups of data objects that, as a whole, deviate significantly from the entire dataset. It involves
examining the structure of the dataset and relationships between multiple data objects. There are two main approaches for collective outlier
detection:
Tasnim (C191267)
22
Reducing to Conventional Outlier Detection: This approach identifies structure units within the data, such as subsequences, local areas, or
subgraphs. Each structure unit is treated as a data object, and features are extracted from them. The problem is then transformed into outlier
detection on the set of structured objects. A structure unit is considered a collective outlier if it deviates significantly from the expected trend in
the extracted features.
Modeling Expected Behavior of Structure Units: This approach directly models the expected behavior of structure units. For example, a Markov
model can be learned from temporal sequences. Subsequences that deviate significantly from the model are considered collective outliers.
The structures in collective outlier detection are often not explicitly defined and need to be discovered during the outlier detection process. This
makes it more challenging than conventional and contextual outlier detection. The exploration of data structures relies on heuristics and can be
application-dependent.
Overall, collective outlier detection is a complex task that requires further research and development.
Outlier detection process:
The outlier detection process involves identifying and flagging data objects that deviate significantly from the expected patterns or
behaviors of the majority of the dataset. While the specific steps may vary depending on the approach or algorithm used, the general
outlier detection process can be outlined as follows:
1. Data Preparation: This step involves collecting and preparing the dataset for outlier detection. It may include data cleaning,
normalization, and handling missing values or outliers that are already known.
2. Feature Selection/Extraction: Relevant features or attributes are selected or extracted from the dataset to represent the
characteristics of the data objects. Choosing appropriate features is crucial for effective outlier detection.
3. Define the Expected Normal Behavior: The next step is to establish the expected normal behavior or patterns of the data
objects. This can be done through statistical analysis, domain knowledge, or learning from a training dataset.
4. Outlier Detection Algorithm/Application: An outlier detection algorithm or technique is applied to the dataset to identify
potential outliers. There are various approaches available, including statistical methods (e.g., z-score, boxplot), distance-based
methods (e.g., k-nearest neighbors), density-based methods (e.g., DBSCAN), and machine learning-based methods (e.g.,
isolation forest, one-class SVM).
5. Outlier Score/Threshold: Each data object is assigned an outlier score or measure indicating its degree of deviation from the
expected behavior. The outlier score can be based on distance, density, probability, or other statistical measures. A threshold
value is set to determine which objects are considered outliers based on their scores.
6. Outlier Identification/Visualization: Objects with outlier scores exceeding the threshold are identified as outliers. They can be
flagged or labeled for further analysis. Visualization techniques, such as scatter plots, heatmaps, or anomaly maps, can be used
to visually explore and interpret the detected outliers.
7. Validation/Evaluation: The detected outliers should be validated and evaluated to assess their significance and impact. This
may involve domain experts reviewing the flagged outliers, conducting further analysis, or performing outlier impact analysis
on the overall system or process.
8. Iteration and Refinement: The outlier detection process may require iterations and refinements based on feedback, domain
knowledge, or additional data. This allows for continuous improvement and adaptation to changing data patterns or
requirements.
It's important to note that outlier detection is a complex task, and the effectiveness of the process depends on the quality of the data,
the selection of appropriate features, and the choice of an appropriate outlier detection method for the specific dataset and application.
Why outlier can be more important than normal data?
Let's consider a dataset representing monthly sales revenue for a retail company over the course of a year. The dataset contains 12 data
points, one for each month.
Normal Data: January: $10,000 February: $12,000 March: $11,000 April: $10,500 May: $10,200 June: $10,300 July: $10,400 August:
$10,100 September: $10,200 October: $10,100 November: $10,000 December: $10,300
Outlier Data: January: $10,000 February: $12,000 March: $11,000 April: $10,500 May: $10,200 June: $10,300 July: $10,400 August:
$10,100 September: $10,200 October: $10,100 November: $10,000 December: $100,000
In this example, the normal data represents the regular monthly sales revenue for the retail company. These values are consistent,
within a certain range, and can be considered as the expected or typical sales figures.
However, in December, there is an outlier data point where the sales revenue is $100,000, which is significantly higher than the other
months. This can be due to a variety of reasons, such as a one-time large order from a major client, a seasonal spike in sales, or an
error in data entry.
Now, let's analyze why this outlier can be more important than the normal data:
1. Financial Impact: The outlier value of $100,000 represents a substantial increase in sales revenue compared to the normal
monthly figures. This can have a significant positive impact on the company's financial performance for the year, contributing
to higher profits, improved cash flow, and potentially influencing important financial decisions.
2. Decision Making: The outlier value can influence strategic decisions within the company. For example, if the company is
considering expansion, investment in marketing campaigns, or allocating resources for the upcoming year, the exceptional
sales revenue in December would carry more weight and influence these decisions.
3. Performance Evaluation: In performance evaluations and goal setting, the outlier value can significantly affect assessments and
targets. For instance, if the company's sales team has monthly targets based on the normal data, the outlier value in December
might lead to adjustments in expectations, incentives, or bonuses.
4. Anomaly Detection: Identifying and understanding the cause of the outlier value is crucial. It could be indicative of underlying
factors that need attention, such as unusual market conditions, customer behavior, or operational inefficiencies. Addressing and
managing these factors can help maintain or replicate the outlier's positive impact in future periods.
Tasnim (C191267)
23
5. Industry Comparison: Outliers can also be important when comparing performance against industry benchmarks or
competitors. In this case, if the company's December sales revenue significantly exceeds industry norms, it could signify a
competitive advantage, market dominance, or differentiation in the industry.
In summary, the outlier value of $100,000 in December represents a significant deviation from the normal monthly sales revenue. It
can have a substantial impact on financial performance, decision making, goal setting, anomaly detection, and industry positioning.
Understanding and leveraging the insights from this outlier value can be crucial for the company's success and growth.
OR,
Outliers can be more important than normal data in certain contexts due to the following reasons:
Anomalies and Errors: Outliers often represent unusual or unexpected events, errors, or anomalies in the data. They can indicate data
quality issues, measurement errors, or abnormal behavior that needs attention. Identifying and addressing these outliers can improve
data integrity and the overall quality of analysis or decision-making.
Critical Events: Outliers may correspond to critical events or situations that have a significant impact on the system or process being
analyzed. These events could include rare occurrences, extreme values, or exceptional behavior that require special attention. By
identifying and understanding these outliers, appropriate actions can be taken to mitigate risks or leverage opportunities associated
with such events.
Fraud and Security: Outliers can be indicative of fraudulent activities, security breaches, or malicious behavior. Detecting outliers in
financial transactions, network traffic, or user behavior can help identify potential fraud, intrusion attempts, or other security threats.
Early detection and intervention can minimize the impact of these incidents and protect the integrity of systems or processes.
Unexplored Insights: Outliers often represent data points that do not conform to typical patterns or expected behaviors. Exploring and
analyzing these outliers can provide valuable insights and uncover hidden relationships, trends, or new knowledge. Outliers may lead
to the discovery of new market segments, innovative ideas, or scientific breakthroughs that were previously unknown or unexplored.
Decision-making and Optimization: Outliers can influence decision-making processes and optimization strategies. In certain cases,
outliers may represent exceptional or unique situations that require specific treatment or tailored approaches. Ignoring outliers or
treating them as noise can lead to suboptimal decisions or missed opportunities for improvement.
Monitoring and Control: Outliers can serve as signals for monitoring and control systems. By continuously monitoring data streams or
processes for outliers, it becomes possible to detect deviations from expected norms and trigger timely interventions or adjustments.
This proactive approach helps maintain system performance, prevent failures, and ensure operational efficiency.
It is important to note that not all outliers are equally important or require immediate action. The significance of an outlier depends on
the specific domain, problem context, and desired outcomes. Careful analysis, domain knowledge, and expert judgment are necessary
to determine the importance of outliers and decide on appropriate actions based on their impact and relevance to the problem at hand.
6 From class lecture:
6.1 K-Nearest Neighbor (KNN) Algorithm for Machine Learning
o
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
o
K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is
most similar to the available categories.
o
K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears
then it can be easily classified into a well suite category by using K- NN algorithm.
o
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
o
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
o
It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at
the time of classification, it performs an action on the dataset.
o
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is
much similar to the new data.
o
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for
this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the
new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category.
o
Why do we need a K-NN Algorithm?
Tasnim (C191267)
24
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
o
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o
Step-1: Select the number K of the neighbors
o
Step-2: Calculate the Euclidean distance of K number of neighbors
o
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o
Step-4: Among these k neighbors, count the number of the data points in each category.
o
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o
Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o
Firstly, we will choose the number of neighbors, so we will choose the k=5.
o
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points,
which we have already studied in geometry. It can be calculated as:
o
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors
in category B. Consider the below image:
Tasnim (C191267)
25
o
As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
o
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o
There is no particular way to determine the best value for "K", so we need to try some values to find the best out of them. The most
preferred value for K is 5.
o
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
o
Large values for K are good, but it may find some difficulties.
o
Advantages of KNN Algorithm:
o
It is simple to implement.
o
It is robust to the noisy training data
o
It can be more effective if the training data is large.
o
Disadvantages of KNN Algorithm:
o
Always needs to determine the value of K which may be complex some time.
o
The computation cost is high because of calculating the distance between the data points for all the training samples.
6.2 Apriori vs FP-Growth in Market Basket Analysis – A Comparative Guide
Apriori is a Join-Based algorithm and FP-Growth is Tree-Based algorithm for frequent itemset mining or frequent pattern mining for
market basket analysis.
various machine learning concepts are used to make things easier and profitable. When it comes to marketing strategies it becomes
very important to learn the behaviour of different customers regarding different products and services. It can be any kind of product or
service the provider needs to satisfy the customers to make more and more profits. Machine learning algorithms are now capable of
making inferences about consumer behaviour. Using these inferences a provider can indirectly influence any customer to buy more
than he wants.
Arranging items in a supermarket to recommend related products on E-Commerce platforms can affect the profit level for providers
and satisfaction level for consumers. This arrangement can be done mathematically or using some algorithms. In this article we are
going to discuss the two most basic algorithms of market basket analysis, one is Apriori and the other one is FP-Growth. The major
points to be discussed in this article are listed below.
Table of content
Association Rule Learning
Frequent Itemset Mining(FIM)
Apriori
FP-Growth
Comparing Apriori and FP-Growth
Let us understand these concepts in detail.
Association Rule Learning
In machine learning, association rule learning is a method of finding interesting relationships between the variables in a large dataset.
This concept is mainly used by supermarkets and multipurpose e-commerce websites. Where it is used for defining the patterns of
selling different products. More formally we can say it is useful to extract strong riles from a large database using any measure of
interestingness.
In supermarkets, association rules are used for discovering the regularities between the products where the transaction of the products
are on a large scale. For example, the rule {comb, hair oil}→{mirror} represents that if a customer is buying comb and hair oil
together then there are higher chances that he will buy the mirror also. Such rules can play a major role in marketing strategies.
Let’s go through an example wherein in a small database we have 5 transactions and 5 products like the following.
transaction
product1
product2
product3
product4
product5
1
1
1
0
0
0
2
0
0
1
0
0
3
0
0
0
1
1
4
1
1
1
0
0
5
0
1
0
0
0
Here in the database, we can see that we have different transaction id for every transaction and 1 represents the inclusion of the
product in the transaction, and 0 represents that in the transaction the product is not included. Now in transaction 4, we can see that it
includes product 1, product 2, and product 3. By analyzing this we can decide a rule {product 2, product 3} →{product 1}, where it
will indicate to us that customers who are buying product 2 and product 3 together are likely to buy product 1 also.
To extract a set of rules from the database we have various measures of significance and interest. Some of the best-known measures
are minimum thresholds on support and confidence.
Support
Tasnim (C191267)
26
Support is a measure that indicates the frequent appearance of a variable set or itemset in a database. Let X be the itemset and T a set
of transactions in then the support of X with respect to T can be measured as
Basically, the above measure tells the proportion of T transactions in the database which contains the item set X. From above the
given table support for itemset {product 1, product 2} will be ⅖ or 0.4. Because the itemset has appeared only in 2 transactions and
the total count of transactions is 5.
Confidence
Confidence is a measure that indicates how often a rule appears to be true. Let A rule X ⇒ Y with respect to a set of transaction T, is
the proportion of the transaction that contains X and Y at the same transaction, where X and Y are itemsets. In terms of support, the
confidence of a rule can be defined as
conf(X⇒Y) = supp(X U Y)/supp(X).
For example, in the given table confidence of rule {product 2, product 3} ⇒ {product 1} is 0.2/0.2 = 1.0 in this database. This means
100% of the time the customer buys product 2 and product 3 together, product 1 bought as well.
So here we have seen the two most known measures of interestingness. Instead of these terms, we have some more measures like lift
conviction, all confidence, collective strength, and leverage which have their meaning and importance. This article is basically
dependent on having an overview of the techniques of frequent itemset mining. Which we will discuss in our next part.
Frequent Itemset Mining(FIM)
Frequent Itemset Mining is a method that comes under the market basket analysis. Above in the article, we have an overview of the
association rules which tells us how rules are important for market basket analysis in accordance with the interestingness. Now in this
part, we will see an introduction to Frequent Itemset mining which aims at finding the regularities in the transactions performed by the
consumers. In terms of the supermarket, we can say regularities in the shopping behaviour of customers.
Basically, frequent itemset mining is a procedure that helps in finding the sets of products that are frequently bought together. Found
frequent itemsets can be applied on the procedure of recommendation system, fraud detection or using them we can improve the
arrangement of the products in the shelves.
The algorithms for Frequent Itemset Mining can be classified roughly into three categories.
Join-Based algorithm
Tree-Based algorithms
Pattern Growth algorithms
Where the join based algorithms expand the items list into a larger itemset to a minimum threshold support value which defines by the
user, the tree-based algorithm uses a lexicographic tree that allows mining of itemsets in a variety of ways like depth-first order and
the pattern growth algorithms make itemsets depends on the currently identified frequent patterns and expand them.
Now we can classify and find frequent itemset using the following algorithms:
Image source
Next in the article will have an overview of a classical Apriori Algorithm and FP Growth Algorithm.
Apriori
The apriori algorithm was proposed by Agrawal and Srikant in 1994. It is designed to work on the database which consists of
transaction details. This algorithm finds ( n + 1) itemsets from n items by using an iterative level-wise search technique. For example,
let’s take a look at the table of transaction details of the 5 items.
Transaction ID
T100
T200
T300
T400
T500
Tasnim (C191267)
List of items
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
27
T600
I2, I3
T700
I1, I3
T800
I1, I2, I3, I5
T900
I1, I2, I3
In the process of Frequent Itemset Mining, the Apriori algorithm first considers every single item as an itemset and counts the support
from their frequency in the database, and then captures those who have support equal to or more than the minimum support threshold.
In this process extraction of each frequent itemset requires the scanning of the entire database by the algorithm until no more itemsets
are left with more than the minimum threshold support.
Image source
In the above image, we can see that the minimum support threshold is 2 so in the very first step items with support 2 are considered
for the further steps of the algorithms. And in the further steps also it is sending the item sets having minimum support count 2, for
further processing.
Let’s see how we can implement this algorithm using python. For implementation, I am making a dataset of 11 products and using the
mlxtend library for making a Frequent dataset using the apriori algorithm.
dataset = [['product7', 'product9', 'product8', 'product6', 'product4', 'product11'],
['product3', 'product9', 'product8', 'product6', 'product4', 'product11'],
['product7', 'product1', 'product6', 'product4'],
['product7', 'product10', 'product2', 'product6', 'product11'],
['product2', 'product9', 'product9', 'product6', 'product5', 'product4']]
Importing the libraries.
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
Making the right format of the dataset for using the apriori algorithm.
con = TransactionEncoder()
con_arr = con.fit(dataset).transform(dataset)
df = pd.DataFrame(con_arr, columns = con.columns_)
df
Output:
Tasnim (C191267)
28
Next, I will be making itemsets with at least 60% support.
from mlxtend.frequent_patterns import apriori
apriori(df, min_support=0.6, use_colnames=True)
Output:
Here we can see the itemsets with minimum support of 60% with the column indices which can be used for some downstream
operations such as making marketing strategies like giving some offers in buying combinations of products.
Now let’s have a look at FP Growth Algorithm.
Frequent Pattern Growth Algorithm
As we have seen in the Apriori algorithm that it was generating the candidates for making the item sets. Here in the FP-Growth
algorithm, the algorithm represents the data in a tree structure. It is a lexicographic tree structure that we call the FP-tree. Which is
responsible for maintaining the association information between the frequent items.
After making the FP-Tree, it is segregated into the set of conditional FP-Trees for every frequent item. A set of conditional FP-Trees
further can be mined and measured separately. For example, the database is similar to the dataset we used in the apriori algorithm. For
that, the table of every conditional FP-Tree will look like the following.
item
Conditional pattern base
Conditional FP-Tree
Frequent pattern generated
I5
{{I2, I1: 1}, {I2, I1, I3: 1}}
{I2: 2, I1: 2}
{I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5:
2}
I4
{{I2, I1: 1}, {I2: 1}}
{I2: 2}
{I2, I4: 2}
I3
{{I2, I1: 2}, {I2: 2}, {I1:
2}}
{I2: 4, I1: 2}, {I1:
2}
{I2, I3: 4}, {I1, I3: 4}, {I2, I1, I3:
2}
I1
{{I2: 4}}
{I2: 4}
{I2, I1: 4}
Based on the above table, the image below represents the frequent items as following:
Tasnim (C191267)
29
Image source
Here we can see that the support for I2 is seven. Where it came with I1 4 times with I3, 2 times, and with I4 it came one time. Where
in apriori algorithm was scanning tables, again and again, to generate the frequent set, here a one time scan is sufficient for generating
the itemsets.
The conditional FP-Tree for I3 will look like the following image.
Image source
Let’s see how we can implement it using python. As we did above, again we will use the mlxtend library for the implementation of
FP_growth. I am using similar data to perform this.
from mlxtend.frequent_patterns import fpgrowth
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
frequent_itemsets
Output:
Here we can see in comparison of apriori where the frequent itemset was in similar series as the data frame series was in input but here
in FP-growth, the series we have is in descending order of support value.
Comparing Apriori and FP-Growth Algorithm
One of the most important features of any frequent itemset mining algorithm is that it should take lower timing and memory. Taking
this into consideration, we have a lot of algorithms related to FIM algorithms. These two Apriori and FP-Growth algorithms are the
most basic FIM algorithms. Other algorithms in this field are improvements of these algorithms. There are some basic differences
between these algorithms let’s take a look at
Apriori
Apriori generates the frequent patterns by making the
itemsets using pairing such as single item set, double itemset,
triple itemset.
FP Growth
FP Growth generates an FP-Tree for making
frequent patterns.
Apriori uses candidate generation where frequent subsets
are extended one item at a time.
FP-growth generates conditional FP-Tree for
every item in the data.
Tasnim (C191267)
30
Since apriori scans the database in each of its steps it
becomes time-consuming for data where the number of
items is larger.
A converted version of the database is saved in the memory
FP-tree requires only one scan of the
database in its beginning steps so it consumes
less time.
Set of conditional FP-tree for every item is
saved in the memory
It uses breadth-first search
It uses a depth-first search.
In the above table, we can see the differences between the Apriori and FP-Growth algorithms.
Advantages Of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.
Advantages of Apriori Algorithm
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases
Disadvantages Apriori Algorithm
1. It requires high computation if the itemsets are very large and the minimum support is kept very low.
2. The entire database needs to be scanned.
6.3 Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both
Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given
dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the
random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
o
Note: To better understand the Random Forest Algorithm, you should have knowledge of the Decision Tree Algorithm.
o
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may
predict the correct output, while others may not. But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:
o
There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather
than a guessed result.
o
The predictions from each tree must have very low correlations.
o
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o
It takes less training time as compared to other algorithms.
Tasnim (C191267)
31
o
It predicts output with high accuracy, even for the large dataset it runs efficiently.
o
It can also maintain accuracy when a large proportion of data is missing.
o
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make
predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the
majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random forest classifier.
The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then based on the majority of results, the Random Forest classifier predicts the
final decision. Consider the below image:
o
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
o
Advantages of Random Forest
o
Random Forest is capable of performing both Classification and Regression tasks.
o
It is capable of handling large datasets with high dimensionality.
o
It enhances the accuracy of the model and prevents the overfitting issue.
o
Disadvantages of Random Forest
o
Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks.
6.4 Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor)
variables with one or more independent variables. More specifically, Regression analysis helps us to understand how the value
of the dependent variable is changing corresponding to an independent variable when other independent variables are held
fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that. The
below list shows the advertisement made by the company in the last 5 years and the corresponding sales:
Tasnim (C191267)
32
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the prediction about the sales
for this year. So to solve such type of prediction problems in machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to
predict the continuous output variable based on the one or more predictor variables. It is mainly used for prediction,
forecasting, time series modeling, and determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the machine learning
model can make predictions about the data. In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and the regression line
is minimum." The distance between datapoints and line tells whether a model has captured a strong relationship or not.
Some examples of regression can be as:
o
Prediction of rain using temperature and other factors
o
Determining Market trends
o
Prediction of road accidents due to rash driving.
What Is Simple Linear Regression?
Simple linear regression is a statistical method for establishing the relationship between two variables using a straight line. The line is
drawn by finding the slope and intercept, which define the line and minimize regression errors.
The simplest form of simple linear regression has only one x variable and one y variable. The x variable is the independent variable
because it is independent of what you try to predict the dependent variable. The y variable is the dependent variable because it depends
on what you try to predict.
y = β0 +β1x+ε is the formula used for simple linear regression.
y is the predicted value of the dependent variable (y) for any given value of the independent variable (x).
B0 is the intercept, the predicted value of y when the x is 0.
B1 is the regression coefficient – how much we expect y to change as x increases.
x is the independent variable ( the variable we expect is influencing y).
e is the error of the estimate, or how much variation there is in our regression coefficient estimate.
Simple linear regression establishes a line that fits your data, but it does not guarantee that the line is good enough. For example, if
your data points have an upward trend and are very far apart, then simple linear regression will give you a downward-sloping line,
which will not match your data
Simple Linear Regression vs. Multiple Linear Regression
When predicting a complex process's outcome, it's best to use multiple linear regression instead of simple linear regression. But it is
not necessary to use complex algorithms for simple problems.
A simple linear regression can accurately capture the relationship between two variables in simple relationships. But when dealing
with more complex interactions that require more thought, you need to switch from simple to multiple regression.
A multiple regression model uses more than one independent variable. It does not suffer from the same limitations as the simple
regression equation, and it is thus able to fit curved and non-linear relationships.
https://www.simplilearn.com/what-is-simple-linear-regression-in-machine-learning-article
•
What Is Multiple Linear Regression (MLR)?
One of the most common types of predictive analysis is multiple linear regression. This type of analysis allows you to understand the
relationship between a continuous dependent variable and two or more independent variables.
The independent variables can be either continuous (like age and height) or categorical (like gender and occupation). It's important to
note that if your dependent variable is categorical, you should dummy code it before running the analysis.
Tasnim (C191267)
33
•
Formula and Calculation of Multiple Linear Regression
Several circumstances that influence the dependent variable simultaneously can be controlled through multiple regression
analysis. Regression analysis is a method of analyzing the relationship between independent variables and dependent variables.
Let k represent the number of variables denoted by x1, x2, x3, ……, xk.
For this method, we assume that we have k independent variables x1, . . . , xk that we can set, then they probabilistically determine an
outcome Y.
Furthermore, we assume that Y is linearly dependent on the factors according to
Y = β0 + β1x1 + β2x2 + · · · + βkxk + ε
•
The variable yi is dependent or predicted
•
The slope of y depends on the y-intercept, that is, when xi and x2 are both zero, y will be β0.
•
The regression coefficients β1 and β2 represent the change in y as a result of one-unit changes in xi1 and xi2.
•
βp refers to the slope coefficient of all independent variables
•
ε term describes the random error (residual) in the model.
Where ε is a standard error, this is just like we had for simple linear regression, except k doesn’t have to be 1.
We have n observations, n typically being much more than k.
For i th observation, we set the independent variables to the values xi1, xi2 . . . , xik and measure a value yi for the random variable
Yi.
Thus, the model can be described by the equations.
Yi = β0 + β1xi1 + β2xi2 + · · · + βkxik + i for i = 1, 2, . . . , n,
Where the errors i are independent standard variables, each with mean 0 and the same unknown variance σ2.
Altogether the model for multiple linear regression has k + 2 unknown parameters:
β0, β1, . . . , βk, and σ 2.
When k was equal to 1, we found the least squares line y = βˆ 0 +βˆ 1x.
It was a line in the plane R 2.
Now, with k ≥ 1, we’ll have a least squares hyperplane.
y = βˆ 0 + βˆ 1x1 + βˆ 2x2 + · · · + βˆ kxk in Rk+1.
The way to find the estimators βˆ 0, βˆ 1, . . ., and βˆ k is the same.
Take the partial derivatives of the squared error.
Q = Xn i=1 (yi − (β0 + β1xi1 + β2xi2 + · · · + βkxik))2
When that system is solved we have fitted values
yˆi = βˆ 0 + βˆ 1xi1 + βˆ 2xi2 + · · · + βˆ kxik for i = 1, . . . , n that should be close to the actual values yi.
Polynomial Regression
Polynomial Regression is a regression algorithm that models the relationship between a dependent(y) and independent variable(x) as
nth degree polynomial. The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
It is also called the special case of Multiple Linear Regression in ML. Because we add some polynomial terms to the Multiple Linear
regression equation to convert it into Polynomial Regression.
It is a linear model with some modification in order to increase the accuracy.
The dataset used in Polynomial regression for training is of non-linear nature.
It makes use of a linear regression model to fit the complicated and non-linear functions and datasets.
Hence, "In Polynomial regression, the original features are converted into Polynomial features of required degree (2,3,..,n) and
then modeled using a linear model."
The need of Polynomial Regression in ML can be understood in the below points:
If we apply a linear model on a linear dataset, then it provides us a good result as we have seen in Simple Linear Regression, but if
we apply the same model without any modification on a non-linear dataset, then it will produce a drastic output. Due to which loss
function will increase, the error rate will be high, and accuracy will be decreased.
So for such cases, where data points are arranged in a non-linear fashion, we need the Polynomial Regression model. We can
understand it in a better way using the below comparison diagram of the linear dataset and non-linear dataset.
Tasnim (C191267)
34
In the above image, we have taken a dataset which is arranged non-linearly. So if we try to cover it with a linear model, then we can
clearly see that it hardly covers any data point. On the other hand, a curve is suitable to cover most of the data points, which is of the
Polynomial model.
Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial Regression model instead of Simple
Linear Regression.
Note: A Polynomial Regression algorithm is also called Polynomial Linear Regression because it does not depend on the variables,
instead, it depends on the coefficients, which are arranged in a linear fashion.
Equation of the Polynomial Regression Model:
Simple Linear Regression equation:
Multiple Linear Regression equation:
Polynomial Regression equation:
y = b0+b1x
.........(a)
y= b0+b1x+ b2x2+ b3x3+....+ bnxn
y= b0+b1x + b2x2+ b3x3+....+ bnxn
.........(b)
..........(c)
When we compare the above three equations, we can clearly see that all three equations are Polynomial equations but differ by the
degree of variables. The Simple and Multiple Linear equations are also Polynomial equations with a single degree, and the Polynomial
regression equation is Linear equation with the nth degree. So if we add a degree to our linear equations, then it will be converted into
Polynomial Linear equations.
https://www.javatpoint.com/machine-learning-polynomial-regression
6.5 Confusion Matrix in Machine Learning
The confusion matrix is a matrix used to determine the performance of the classification models for a given set of test data. It can only
be determined if the true values for test data are known. The matrix itself can be easily understood, but the related terminologies may
be confusing. Since it shows the errors in the model performance in the form of a matrix, hence also known as an error matrix. Some
features of Confusion matrix are given below:
o
For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it is 3*3 table, and so on.
o
The matrix is divided into two dimensions, that are predicted values and actual values along with the total number of
predictions.
o
Predicted values are those values, which are predicted by the model, and actual values are the true values for the given
observations.
o
It looks like the below table:
The above table has the following cases:
o
True Negative: Model has given prediction No, and the real or actual value was also No.
o
True Positive: The model has predicted yes, and the actual value was also true.
o
False Negative: The model has predicted no, but the actual value was Yes, it is also called as Type-II error.
o
False Positive: The model has predicted Yes, but the actual value was No. It is also called a Type-I error.
o
Need for Confusion Matrix in Machine learning
o
It evaluates the performance of the classification models, when they make predictions on test data, and tells how good our
classification model is.
o
It not only tells the error made by the classifiers but also the type of errors such as it is either type-I or type-II error.
o
With the help of the confusion matrix, we can calculate the different parameters for the model, such as accuracy, precision, etc.
Example: We can understand the confusion matrix using an example.
Tasnim (C191267)
35
Suppose we are trying to create a model that can predict the result for the disease that is either a person has that disease or not. So, the
confusion matrix for this is given as:
From the above example, we can conclude that:
o
The table is given for the two-class classifier, which has two predictions "Yes" and "NO." Here, Yes defines that patient has the
disease, and No defines that patient does not has that disease.
o
The classifier has made a total of 100 predictions. Out of 100 predictions, 89 are true predictions, and 11 are incorrect
predictions.
o
The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas the actual "Yes" was 27, and actual "No"
was 73 times.
o
Calculations using Confusion Matrix:
We can perform various calculations for the model, such as the model's accuracy, using this matrix. These calculations are given
below:
o
Classification Accuracy: It is one of the important parameters to determine the accuracy of the classification problems. It
defines how often the model predicts the correct output. It can be calculated as the ratio of the number of correct predictions
made by the classifier to all number of predictions made by the classifiers. The formula is given below:
o
Misclassification rate: It is also termed as Error rate, and it defines how often the model gives the wrong predictions. The
value of error rate can be calculated as the number of incorrect predictions to all number of the predictions made by the
classifier. The formula is given below:
o
Precision: It can be defined as the number of correct outputs provided by the model or out of all positive classes that have
predicted correctly by the model, how many of them were actually true. It can be calculated using the below formula:
o
Recall: It is defined as the out of total positive classes, how our model predicted correctly. The recall must be as high as
possible.
o
F-measure: If two models have low precision and high recall or vice versa, it is difficult to compare these models. So, for this
purpose, we can use F-score. This score helps us to evaluate the recall and precision at the same time. The F-score is maximum
if the recall is equal to the precision. It can be calculated using the below formula:
Other important terms used in Confusion Matrix:
o
Null Error rate: It defines how often our model would be incorrect if it always predicted the majority class. As per the
accuracy paradox, it is said that "the best classifier has a higher error rate than the null error rate."
o
ROC Curve: The ROC is a graph displaying a classifier's performance for all possible thresholds. The graph is plotted
between the true positive rate (on the Y-axis) and the false Positive rate (on the x-axis).
7 Exercise:
Which of the following schema is best suitable for data warehouse?
Snowflake schemas are good for data warehouses, star schemas are better for datamarts with simple relationships.
On one hand, star schemas are simpler, run queries faster, and are easier to set up. On the other hand, snowflake
schemas are less prone to data integrity issues, are easier to maintain, and utilize less space. star schema more popular
than snowflake schema. A star schema is easier to design and implement than a snowflake schema. A star schema
can be more efficient to query than a snowflake schema, because there are fewer JOINs between tables. A star schema
can require more storage space than a snowflake schema, because of the denormalized data.
Tasnim (C191267)
36
What is class imbalance problem in machine learning? How do measure the performance of a class-imbalance
algorithm with example.
The class imbalance problem typically occurs when there are many more instances of some classes than others. In
such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones. Class
imbalance is normal and expected in typical ML applications. For example: in credit card fraud detection, most
transactions are legitimate, and only a small fraction are fraudulent.
measure the performance of a class-imbalance algorithm with example.
Measuring the performance of a class-imbalance algorithm requires using appropriate evaluation metrics that
consider the imbalance in the dataset. Here are some commonly used metrics:
Confusion Matrix: A confusion matrix is a table that summarizes the performance of a classifier. It lists the true
positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions for each class.
Precision: Precision measures the proportion of true positive predictions out of all positive predictions. It is calculated
as TP / (TP + FP).
Recall: Recall measures the proportion of true positive predictions out of all actual positive cases. It is calculated as
TP / (TP + FN).
F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both metrics
and is often used when the classes are imbalanced.
Area Under the Receiver Operating Characteristic (ROC) curve (AUC-ROC): The ROC curve is a graphical
representation of the classifier's performance at different thresholds. It plots the true positive rate (sensitivity) against
the false positive rate (1-specificity) for different threshold values. The AUC-ROC is the area under the ROC curve.
It provides a single number that summarizes the overall performance of the classifier.
Here is an example to illustrate the performance evaluation of a class-imbalance algorithm:
Suppose we have a dataset with 1,000 instances, where 900 instances belong to the negative class and 100 instances
belong to the positive class. We want to classify whether an instance belongs to the positive or negative class. We
train a classifier using this dataset and evaluate its performance using the confusion matrix, precision, recall, F1score, and AUC-ROC metrics.
Suppose the classifier predicts 20 instances as positive and 980 instances as negative. The confusion matrix is as
follows:
Predicted Negative Predicted Positive
Actual Negative
880
20
Actual Positive
0
100
From the confusion matrix, we can calculate the precision, recall, F1-score, and AUC-ROC metrics as follows:
Precision = TP / (TP + FP) = 100 / (100 + 20) = 0.833
Recall = TP / (TP + FN) = 100 / (100 + 0) = 1.000
F1-score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.833 * 1.000) / (0.833 + 1.000) = 0.909
AUC-ROC = 0.940
In this example, the classifier achieved high precision and recall values, indicating that it performed well in detecting
the positive class. The F1-score indicates that the classifier achieved a good balance between precision and recall.
The high AUC-ROC value suggests that the classifier has a good overall performance.
How to compare the performance of two classifiers in machine learning.
The performance of two classifiers in machine learning can be compared using evaluation metrics such as accuracy,
precision, recall, F1-score, and AUC-ROC. The classifiers should be trained on the same dataset using the same
hyperparameters and features, and the evaluation metrics should be computed using cross-validation techniques. The
results should be compared to determine which classifier performs better, taking into consideration the domainspecific requirements and objectives.
What is class imbalance problem in machine learning? How do measure the performance of a class-imbalance
algorithm with example.
Tasnim (C191267)
37
The class imbalance problem typically occurs when there are many more instances of some classes than others. In
such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones. Class
imbalance is normal and expected in typical ML applications. For example: in credit card fraud detection, most
transactions are legitimate, and only a small fraction are fraudulent.
measure the performance of a class-imbalance algorithm with example.
Measuring the performance of a class-imbalance algorithm requires using appropriate evaluation metrics that
consider the imbalance in the dataset. Here are some commonly used metrics:
Confusion Matrix: A confusion matrix is a table that summarizes the performance of a classifier. It lists the true
positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions for each class.
Precision: Precision measures the proportion of true positive predictions out of all positive predictions. It is
calculated as TP / (TP + FP).
Recall: Recall measures the proportion of true positive predictions out of all actual positive cases. It is calculated as
TP / (TP + FN).
F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both metrics
and is often used when the classes are imbalanced.
Area Under the Receiver Operating Characteristic (ROC) curve (AUC-ROC): The ROC curve is a graphical
representation of the classifier's performance at different thresholds. It plots the true positive rate (sensitivity) against
the false positive rate (1-specificity) for different threshold values. The AUC-ROC is the area under the ROC curve.
It provides a single number that summarizes the overall performance of the classifier.
Here is an example to illustrate the performance evaluation of a class-imbalance algorithm:
Suppose we have a dataset with 1,000 instances, where 900 instances belong to the negative class and 100 instances
belong to the positive class. We want to classify whether an instance belongs to the positive or negative class. We
train a classifier using this dataset and evaluate its performance using the confusion matrix, precision, recall, F1score, and AUC-ROC metrics.
Suppose the classifier predicts 20 instances as positive and 980 instances as negative. The confusion matrix is as
follows:
Predicted Negative
Predicted Positive
Actual Negative
880
20
Actual Positive
0
100
From the confusion matrix, we can calculate the precision, recall, F1-score, and AUC-ROC metrics as follows:
Precision = TP / (TP + FP) = 100 / (100 + 20) = 0.833
Recall = TP / (TP + FN) = 100 / (100 + 0) = 1.000
F1-score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.833 * 1.000) / (0.833 + 1.000) = 0.909
AUC-ROC = 0.940
In this example, the classifier achieved high precision and recall values, indicating that it performed well in detecting
the positive class. The F1-score indicates that the classifier achieved a good balance between precision and recall.
The high AUC-ROC value suggests that the classifier has a good overall performance.
what is cross validation in machine learning
Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available
input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting,
ie, failing to generalize a pattern. It is a statistical method of evaluating and comparing learning algorithms by
dividing data into two segments: one used to learn or train a model and the other used to validate the model.
Extra - [For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle
the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data
array and then splitting it in two).]
Tasnim (C191267)
38
How to compare the performance of two classifiers in machine learning.
The performance of two classifiers in machine learning can be compared using evaluation metrics such as accuracy,
precision, recall, F1-score, and AUC-ROC. The classifiers should be trained on the same dataset using the same
hyperparameters and features, and the evaluation metrics should be computed using cross-validation techniques. The
results should be compared to determine which classifier performs better, taking into consideration the domainspecific requirements and objectives.
support vector machine always tries to maintain maximum margin for classification. explain it with proper
sketch
https://vitalflux.com/svm-algorithm-maximum-margin-classifier/
consider about Agora super shop. based on "Agora" develop & draw a sample data cube
A data cube is a multidimensional representation of data that allows for multidimensional analysis. It consists of dimensions,
hierarchies, and measures. Each dimension represents a different aspect or attribute of the data, and hierarchies define the levels of
detail within each dimension. Measures are the numeric values that can be analyzed or aggregated.
Let's consider a sample data cube for a super shop:
Dimensions:
1. Time: Date, Month, Year
2. Product: Product ID, Product Category, Product Subcategory
3. Store: Store ID, Store Location, Store Size
Hierarchies within Dimensions:
1. Time: Year > Month > Date
2. Product: Product Category > Product Subcategory
3. Store: Store Location > Store Size
Measures:
1. Sales Quantity
2. Sales Revenue
3. Profit
4. Discount Amount
Visual representation of the data cube:
Write short notes on (any four)
Elbow method, Fact- Constellation Schema, Decision tree,
Information entropy, Reinforcement learning
Elbow Method: The elbow method is a popular technique used to determine the optimal number of clusters in a
clustering algorithm. It works by plotting the within-cluster sum of squares (WCSS) against the number of clusters,
and identifying the "elbow" point in the plot where the change in WCSS starts to level off. This point represents the
optimal number of clusters, as adding more clusters beyond this point does not lead to a significant reduction in
WCSS.
Fact-Constellation Schema: Fact-Constellation Schema is a data modeling technique used in data warehousing to
represent complex, multi-dimensional data structures. It involves organizing data into a set of fact tables and
dimension tables, where the fact tables contain measures of interest (such as sales revenue) and the dimension tables
provide context for these measures (such as time, geography, and product). The schema enables efficient querying
and analysis of data across multiple dimensions.
Decision Tree: A decision tree is a popular machine learning algorithm that can be used for both classification and
regression tasks. It works by recursively splitting the data into subsets based on the features that are most informative
for the task at hand, until a stopping criterion is met (such as a maximum depth or minimum number of samples per
leaf). The resulting tree can be used to make predictions for new data by traversing the tree from the root to a leaf
node based on the values of its features.
Tasnim (C191267)
39
Information Entropy: Information entropy is a measure of the amount of uncertainty or randomness in a system. In
machine learning, it is commonly used to measure the impurity of a node in a decision tree, where a higher entropy
implies greater uncertainty about the class labels of the data points in that node. The entropy of a node is calculated
as the sum of the negative log probabilities of each class label, weighted by their relative frequencies in the node.
Reinforcement Learning: Reinforcement learning is a type of machine learning where an agent learns to interact with
an environment to maximize a reward signal. The agent takes actions in the environment based on its current state,
receives feedback in the form of rewards or penalties, and updates its behavior to improve its future performance.
Reinforcement learning has been successfully applied in a wide range of domains, including game playing, robotics,
and control systems.
Types of Neural Networks and Definition of Neural Network
The nine types of neural networks are:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Perceptron
Feed Forward Neural Network
Multilayer Perceptron
Convolutional Neural Network
Radial Basis Functional Neural Network
Recurrent Neural Network
LSTM – Long Short-Term Memory
Sequence to Sequence Models
Modular Neural Network
Here is a brief description of each of the nine types of neural networks, along with an example, advantages, disadvantages, and applications:
1. Perceptron:
•
Description: The perceptron is the simplest form of a neural network with a single layer of artificial neurons.
•
Example: Binary classification tasks, such as predicting whether an email is spam or not.
•
Advantages: Simple and easy to understand, computationally efficient.
•
Disadvantages: Limited to linearly separable problems, cannot learn complex patterns.
•
Applications: Pattern recognition, binary classification.
2. Feed Forward Neural Network:
•
Description: Also known as a multilayer perceptron (MLP), it consists of multiple layers of interconnected neurons.
•
Example: Handwritten digit recognition, image classification.
•
Advantages: Can learn complex patterns, universal approximation capability.
•
Disadvantages: Requires a large amount of training data, prone to overfitting.
•
Applications: Image recognition, speech recognition, natural language processing.
3. Multilayer Perceptron:
•
Description: Similar to a feed-forward neural network, it consists of multiple layers of interconnected neurons.
•
Example: Credit scoring, sentiment analysis.
•
Advantages: Can handle nonlinear problems, good generalization capability.
•
Disadvantages: Requires careful tuning of parameters, prone to overfitting.
•
Applications: Pattern recognition, regression, classification.
4. Convolutional Neural Network (CNN):
•
Description: Designed for processing structured grid-like data, such as images, by applying convolution operations.
•
Example: Image classification, object detection.
•
Advantages: Can automatically learn hierarchical features, translation-invariant.
•
Disadvantages: Requires a large amount of training data, computationally expensive.
•
Applications: Computer vision, image recognition, autonomous driving.
5. Radial Basis Functional Neural Network (RBFNN):
•
Description: Utilizes radial basis functions to model the hidden layer and linear combination in the output layer.
•
Example: Time series prediction, function approximation.
•
Advantages: Fast learning, good generalization ability.
•
Disadvantages: Sensitivity to network architecture, may overfit with insufficient data.
Tasnim (C191267)
40
•
Applications: Pattern recognition, time series analysis, control systems.
6. Recurrent Neural Network (RNN):
•
Description: Processes sequential data by using feedback connections, allowing information to persist.
•
Example: Speech recognition, language translation.
•
Advantages: Handles sequential data well, captures temporal dependencies.
•
Disadvantages: Gradient vanishing/exploding problem, computationally expensive.
•
Applications: Natural language processing, speech recognition, time series prediction.
7. LSTM - Long Short-Term Memory:
•
Description: A type of recurrent neural network designed to mitigate the vanishing gradient problem by using memory cells.
•
Example: Sentiment analysis, text generation.
•
Advantages: Captures long-term dependencies, handles variable-length sequences.
•
Disadvantages: Requires a large amount of training data, computationally expensive.
•
Applications: Natural language processing, speech recognition, time series analysis.
8. Sequence to Sequence Models:
•
Description: Utilizes an encoder-decoder architecture to transform one sequence into another.
•
Example: Machine translation, chatbots.
•
Advantages: Handles variable-length sequences, preserves semantic information.
•
Disadvantages: Requires large amounts of training data, complex architecture.
•
Applications: Machine translation, speech recognition, text summarization.
9. Modular Neural Network:
•
Description: Consists of multiple independent neural network modules that work together to solve a complex problem.
•
Example: Autonomous robots,
https://www.mygreatlearning.com/blog/types-of-neural-networks/
7.1.1.1
K means:
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined asΡ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution-
We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
Tasnim (C191267)
41
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from each of the center of the three clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Given Points
Distance from
center (2, 10) of
Cluster-01
Distance from
center (5, 8) of
Cluster-02
Distance from
center (1, 2) of
Cluster-03
Point belongs
to Cluster
A1(2, 10)
A2(2, 5)
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A7(1, 2)
A8(4, 9)
0
5
12
5
10
10
9
3
5
6
7
0
5
5
10
2
9
4
9
10
9
7
0
10
C1
C3
C2
C2
C2
C2
C3
C2
From here, New clusters are-
Cluster-01:
First cluster contains pointsA1(2, 10)
Cluster-02:
Second cluster contains points-
Tasnim (C191267)
42
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A8(4, 9)
Cluster-03:
Third cluster contains pointsA2(2, 5)
A7(1, 2)
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
Tasnim (C191267)
43
=0
Calculating Distance Between A1(2, 10) and C2(6, 6)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of the three clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Given Points
Distance from
center (2, 10) of
Cluster-01
Distance from
center (6, 6) of
Cluster-02
Distance from
center (1.5, 3.5) of
Cluster-03
Point belongs to
Cluster
A1(2, 10)
A2(2, 5)
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A7(1, 2)
A8(4, 9)
0
5
12
5
10
10
9
3
8
5
4
3
2
2
9
5
7
2
7
8
7
5
2
8
C1
C3
C2
C2
C2
C2
C3
C1
From here, New clusters are-
Cluster-01:
First cluster contains pointsA1(2, 10)
A8(4, 9)
Cluster-02:
Second cluster contains pointsA3(8, 4)
Tasnim (C191267)
44
A4(5, 8)
A5(7, 5)
A6(6, 4)
Cluster-03:
Third cluster contains pointsA2(2, 5)
A7(1, 2)
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-02.
After second iteration, the center of the three clusters areC1(3, 9.5)
C2(6.5, 5.25)
C3(1.5, 3.5)
Tasnim (C191267)
45
7.1.1.2
Confusion Matrix
Tasnim (C191267)
46
Tasnim (C191267)
47
7.1.1.3
FP Growth Algorithm
association-rule mining algorithm was developed named Frequent Pattern Growth Algorithm. It overcomes the disadvantages of the Apriori
algorithm by storing all the transactions in a Trie Data Structure. Consider the following data:-
The above-given data is a hypothetical dataset of transactions with each letter representing an item. The frequency of each individual item is
computed:-
Tasnim (C191267)
48
Let the minimum support be 3. A Frequent Pattern set is built which will contain all the elements whose frequency is greater than or equal to the
minimum support. These elements are stored in descending order of their respective frequencies. After insertion of the relevant items, the set L
looks like this:L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the Frequent Pattern set and checking if the current
item is contained in the transaction in question. If the current item is contained, the item is inserted in the Ordered-Item set for the current
transaction. The following table is built for all the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in the set and initialize the support count for each item as 1.
b) Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and E, simply the support count is increased by 1. On inserting O we can see that there is no direct
link between E and O, therefore a new node for the item O is initialized with the support count as 1 and item E is linked to this new
node. On inserting Y, we first initialize a new node for the item Y with support count as 1 and link the new node of O with the new
node of Y.
c) Inserting the set {K, E, M}:
Here simply the support count of each element is increased by 1.
Tasnim (C191267)
49
d) Inserting the set {K, M, Y}:
Similar to step b), first the support count of K is increased, then new nodes for M and Y are initialized and linked accordingly.
e) Inserting the set {K, E, O}:
Here simply the support counts of the respective elements are increased. Note that the support count of the new node of item O is
increased.
Now, for each item, the Conditional Pattern Base is computed which is path labels of all the paths which lead to any node of the
given item in the frequent-pattern tree. Note that the items in the below table are arranged in the ascending order of their frequencies.
Tasnim (C191267)
50
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of elements that is common in all the
paths in the Conditional Pattern Base of that item and calculating its support count by summing the support counts of all the paths in
the Conditional Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the items of the Conditional
Frequent Pattern Tree set to the corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first row which contains the element, the rules K -> Y
and Y -> K can be inferred. To determine the valid rule, the confidence of both the rules is calculated and the one with confidence
greater than or equal to the minimum confidence value is retained.
7.1.1.4 Aprirori Algorithm
Step 1: Data in the database
Step 2: Calculate the support/frequency of all items
Step 3: Discard the items with minimum support less than 2
Step 4: Combine two items
Step 5: Calculate the support/frequency of all items
Step 6: Discard the items with minimum support less than 2/3
Step 6.5: Combine three items and calculate their support.
Step 7: Discard the items with minimum support less than 2 /3
Example -1
Let’s see an example of the Apriori Algorithm.
Find the frequent itemsets and generate association rules on this. Assume that minimum support threshold (s = 33.33%) and
minimum confident threshold (c = 60%)
Let’s start,
Tasnim (C191267)
51
There is only one itemset with minimum support 2. So only one itemset is frequent.
Frequent Itemset (I) = {Hot Dogs, Coke, Chips}
Association rules,
[Hot Dogs^Coke]=>[Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke) = 2/2*100=100% //Selected
[Hot Dogs^Chips]=>[Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Chips) = 2/2*100=100% //Selected
[Coke^Chips]=>[Hot Dogs] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke^Chips) = 2/3*100=66.67% //Selected
[Hot Dogs]=>[Coke^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs) = 2/4*100=50% //Rejected
[Coke]=>[Hot Dogs^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke) = 2/3*100=66.67% //Selected
[Chips]=>[Hot Dogs^Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Chips) = 2/4*100=50% //Rejected
There are four strong results (minimum confidence greater than 60%)
Example -2
Let’s see another example of the Apriori Algorithm.
Find the frequent itemsets on this. Assume that minimum support (s = 3)
Tasnim (C191267)
52
There is only one itemset with minimum support 3. So only one itemset is frequent.
Frequent Itemset (I) = {Coke, Chips}
7.1.1.5
SVM
Suppose we are given the following positively labeled data points in 2D space: ((3, 1), (3, -1), (6, 1), (6, -1)) and the following
negatively labeled data points in 2D space: ((1, 0), (0,1), (0, -1), (- 1,0)). In addition, three support vectors are {(1, 0), (3, 1), (3, 1),) Apply Support Vector Machine to classify data objects (2, 1). 06
Positively labeled data points in 2D space :
{(3,1), (3,-1), (6,1),(6,-1)}
Positively labeled data points in 2D space :
{(1,0), (0,1), (0,-1),(-1,0)}
support vectors are {(1, 0), (3, 1), (3, -1)
3
Here, S1= (10), S2= (31), S3= (−1
)
Each vector augmented with 1
3
̌ = (10), 𝑆2
̌ = (31), 𝑆3
̌ = (−1
)
Here, 𝑆1
1
1
1
̌ 𝑆1
̌ + α2 𝑆2
̌ 𝑆1
̌ + α3 𝑆3
̌ 𝑆1
̌ =-1______________________(i)
α1 𝑆1
̌ 𝑆2
̌ + α2 𝑆2
̌ 𝑆2
̌ + α3 𝑆3
̌ 𝑆2
̌ =+1______________________(ii)
α1 𝑆1
̌ 𝑆3
̌ + α2 𝑆2
̌ 𝑆3
̌ + α3 𝑆3
̌ 𝑆3
̌ =+1______________________(iii)
α1 𝑆1
from equation (i),
1
1
3
1
3
1
1
1
1
1
1
1
α1 (0) , (0), + α2 (1) (0)+ α3 (−1) (0),=-1
 α1(1+0+1) + α2 (3+0+1) + α3 (3-0+1) =-1
 2 α1 + 4α2+ 4α3=-1-------------------------(Iv)
Tasnim (C191267)
53
from equation (ii),
1
3
3
3
3
3
1
1
1
1
1
1
α1 (0) (1)+ α2 (1) (1), + α3 (−1) (1)=1
 α1(3+0+1) + α2 (9+1+1) + α3 (9-0+1) =1
 3 α1 + 11α2+ 9α3=+1-------------------------(v)
from equation (iii),
1
3
3
3
3
3
1
1
1
1
1
1
α1 (0) (−1)α2 (1) (−1), + α3(−1) (−1)=+1
 α1(3-0+1) + α2 (9-1+1) + α3 (9+1+1) =1
 4 α1 +9α2+ 11α3=+1-------------------------(vi)
From equation (iv), (v), (vi)
α1 = -3.5
α2 =0.75
α3=0.75
̌
Wight vector, 𝑤
̂=∑𝑖 αi𝑆𝑖
̌ + α2 𝑆2
̌ + α3 𝑆3
̌
= α1 𝑆1
1
3
3
1
1
1
=-3.5(0)+0.75(1)+0.75(−1)
1
=( 0 )
−2
The vector is augmented with bias,
Hyperplane equation, y=wx+b
With (10) and b= -2
Here (10) means, line will be parallel to y axis,
So, the decision line will be on 2, (-2)
Means in first quadrant in 2 axis and data object (2,1) will be classified as w(21) =(10)(21) =(2+0) =2 so the data object will be on the
positive class.
https://www.youtube.com/watch?v=ivPoCcYfFAw
7.1.1.6
Decision Tree
RID
age
Income
student
credit rating
Class: buys computer
1
youth
High
no
fair
no
2
youth
High
no
excellent
no
3
middle aged
High
no
fair
yes
4
senior
Medium
no
fair
yes
5
senior
Low
yes
fair
yes
6
senior
Low
yes
excellent
no
7
middle aged
Low
yes
excellent
yes
8
youth
Medium
no
fair
no
9
youth
Low
yes
fair
yes
10
senior
Medium
yes
fair
yes
11
youth
Medium
yes
excellent
yes
12
middle aged
Medium
no
excellent
yes
13
middle aged
High
yes
fair
yes
14
senior
Medium
no
excellent
no
Build a Decision Tree for the given data using information gain.
Entropy before partition:
5
5
9
5
E(s) = - 14log214- 14log214 =0.94
Now calculate entropy for each attribute
Tasnim (C191267)
youth
Middle age
senior
yes
2
4
3
no
3
0
2
54
Attribute (Age):
2
2 3
3
E(youth)= - 5log25- 5log25 =0.97
4
4 0
3 2
2
0
E (Middle age) = - 4log24- 4log24 =0
3
E(senior)= - 5log25- 5 log25 =0.97
Information gain (Age):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
5
4
|𝑆𝑣|
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
𝑆
5
G (S, Age) = 0.94- ((14 *0.97) + (14 *0) +(14 *0.97)) =0.24
Attribute (Income):
2
2 2
2
E(High)= - 4log24- 4log24 =1
4
4 2
High
medium
low
2
E (medium) = - 6log2 6-6log26 =0.92
3
3 1
yes
2
4
3
no
2
2
1
Yes
6
3
No
1
4
1
E (low)= - 4log24- 4 log24 =0.81
Information gain (Income):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
4
6
|𝑆𝑣|
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
𝑆
4
G (S, Income) = 0.94- ((14 *1) + (14 *0.92) +(14 *0.81)) =0.028
Attribute (Student):
6
6 1
1
E(yes)= - 7log27- 7log27 =0.59
3
3 0
4
E (No) = - 7log27- 4log27 =0.98
yes
no
Information gain (Student):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
7
|𝑆𝑣|
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
𝑆
7
G (S, Student) = 0.94- ((14 *0.59) + (14 *0.98)) =0.15
Attribute (Credit rating):
6
6 2
2
8
8 8
8
3
3 3
E (fair)= - log2 - log2 =0.81
3
E (excellent) = - 6log26- 6log26 =1
fair
excellent
Information gain (Credit rating):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
8
|𝑆𝑣|
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
𝑆
6
G (S, Credit rating) = 0.94- ((14 *0.81) + (14 *0) =0.048
Information gain for all attribute:
Attribute
Age
Income
Student
Credit Rating
Tasnim (C191267)
Information gain
0.24
0.028
0.15
0.048
Yes
6
3
No
2
3
55
Second iteration omit age only for youth
ID
Income
Student
Credit rating
Class: Buy Computer
1
high
no
fair
no
2
high
no
Excellent
no
8
medium
no
fair
no
9
low
yes
fair
yes
11
medium
yes
Excellent
yes
2
2 3
3
E(s) = - 5log25- 5log25 =0.97
Attribute (Income):
0
0 2
2
E(High)= - 2log22- 2log22 = 1
1
1 1
High
medium
low
1
E (medium) = - 2log22- 2log22 = 1
1
1 0
yes
0
1
1
no
2
1
0
Yes
2
0
No
0
3
0
E (low)= - 1log21- 1 log21 =0
Information gain (Income):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
2
2
|𝑆𝑣|
𝑆
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
1
G (S, Income) = 0.94- ((5 *0) + ((5 *1) +((5*0)) =0.57
Attribute (Student):
2
2 0
0
E(yes)= - 2log22- 2log22 =0
0
0 3
3
3
3 3
3
E (No) = - log2 - log2 =0
yes
no
Information gain (Student):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
2
|𝑆𝑣|
𝑆
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
3
G (S, Student) = 0.97- ((5 *0) + (5 *0)) =0.97
Attribute (Credit rating):
1
1 2
2
1
1 1
E (fair)= - 3log23- 3log23 =0.92
1
E (excellent) = - 2log22- 2log22 =1
fair
excellent
Information gain (Credit rating):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
3
2
|𝑆𝑣|
𝑆
G (S, Credit rating) = 0.94- ((5 *0.92) + (5 *0) =0.018
Tasnim (C191267)
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
Yes
2
1
No
2
1
56
Attribute
Income
Student
Credit Rating
Information gain
0.57
0.97
0.018
Second iteration omit age only for Senior
3
ID
Income
Student
Credit rating
Class: Buy Computer
4
medium
no
fair
yes
5
low
yes
fair
yes
6
low
yes
Excellent
no
10
medium
yes
fair
yes
14
medium
no
Excellent
no
3 2
2
E(s) = - 5log25- 5log25 =0.97
Attribute (Income):
2
2 1
1
E (medium) = - 3log23- 3log23 = 0.92
1
1 1
medium
low
1
E (low)= - 2log22- 2 log22 =0
yes
2
1
no
1
1
Yes
2
1
No
1
1
Information gain (Income):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
3
|𝑆𝑣|
𝑆
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
2
G (S, Income) = 0.94- ((5 *0.92) + ((5 *0) =0.18
Attribute (Student):
2
2 1
1
E(yes)= - 3log23- 3log23 =0.92
1
1 1
1
E (No) = - 2log22- 2log22 =1
yes
no
Information gain (Student):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
3
|𝑆𝑣|
𝑆
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
2
G (S, Student) = 0.97- ((5 *0.92) + (5 *1)) =0.018
Attribute (Credit rating):
3
3 0
0
E (fair)= - 3log23- 3log23 =0
fair
excellent
Tasnim (C191267)
Yes
3
0
No
0
2
57
0
0 2
2
E (excellent) = - 2log22- 2log22 =0
Information gain (Credit rating):
Using formula: G (S, A) = Entropy (S)-∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
3
|𝑆𝑣|
𝑆
∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)
2
G (S, Credit rating) = 0.94- ((5 *0) + (5 *0) =0.97
Attribute
Income
Student
Credit Rating
7.1.1.7 Naïve bias
https://www.youtube.com/watch?v=XzSlEA4ck2I
7.1.1.8
Delta rule:
Tasnim (C191267)
Information gain
0.018
0.018
0.97
58
7.1.1.9
Back propagation
Tasnim (C191267)
Download