Uploaded by Ruchi Bhatnagar

Rubrics

advertisement
Solution : 1
Visualizing the
Histograms of
each feature of
the data.
Feature
Transformations
(taking log or
not).
Feature
Normalization
(Z-scoring).
Doing Clustering
on the data
using K-means
clustering.
Correct Order:
E,D,A,C,B
Projection of
Data using PCA
projection.
Solution : 2
a)
Increasing the Dimension.
b)
Full Covariance Matrix / Non equal
Diagonal.
Complexity
increases by
c)
Increase the K [the no. of clusters].
d)
Decreasing Sigma.
e)
Increasing the no. of Gaussians.
Solution-3
a)
PCA.
b)
Single Linkage Agglomerative Clustering.
c)
MDS.
d)
Histogram.
e)
Parzen window.
f)
Self Organizing Map [SOM].
g)
Spectral Clustering.
h)
Itemset Mining.
i)
Hierarchical Clustering.
j)
Mixture Of Gaussians [MoG].
Similarity & Difference
(a) PCA vs. SOM

Similarity: They can be both thought of as projections and dimensionality
reduction algorithms. PCA does it by defining new principal components which
are orthogonal to each other and linear combination of all the existing
dimensions. SOM tries to project the entire data onto a lower dimensional
space (mostly 2D) wherein each observation is a point in the new space with
similar observations closer to each other.

Difference: SOM is a visualization tool wherein the entire data is projected
onto a 2D space and results in similar observations being close to each other.
Hence, it results in clustering of data. PCA doesn’t result in any direct
clustering. Also, PCA works on only numeric data while SOM can work on
numeric as well as text data.
Similarity & Difference
(b) PCA vs. MDS

Similarity: Both are ways to do projections. While PCA projects the data on
orthogonal principal components, MDS projects the data onto a X-Y plane such
that the proximity is preserved. Also, both the algorithms help reduce
dimensionality.

Difference: While PCA works on multi dimension numeric data, MDS can be used in
cases where pair wise distances are given or can be calculated.
(c) SOM vs. MDS

Similarity: Again, both of these are projection methods. While SOM projects the
data onto a grid, MDS projects the data on the basis on distance between the
observations. Both thus result in dimensionality reduction.

Difference: The main purpose of SOM is clustering or grouping similar
observations together. However, the main purpose of MDS is create a graph out of
the distance data while preserving the proximity between the points. So, SOM
works on the original vector data while MDS works on pair wise distance data.
Similarity & Difference
(d) Spectral Clustering vs. Partitional Clustering

Similarity: They are similar in the sense of the underlying principal that they both
are trying to divide the data into different clusters, and you need to decide the
number of clusters before starting the process of clustering i.e. the number of
clusters is an input to the algorithm

Difference: While spectral clustering is applied on graph data or where pair wise
distance data is given, partitional clustering is done on vector data.
(e) K-means clustering vs. Mixture of Gaussians (MOG)

Similarity: Both the algorithms are used for dividing the data into groups which
are called clusters in k-means and mixtures in MOG. In a sense, k-means is a more
specialized version of Mixture of Gaussians.

Difference: While k-means only looks at mean vectors, mixture of gaussians looks
at both mean vectors as well as covariance matrix. As a result, Mixture of
Gaussians allows elliptical gaussians with different radius. In contrast, k-means
allows only circular gaussians and fixed radius.
Similarity & Difference
(f) Parzen Window density estimation vs. Mixture-of-Gaussians (MOG) density estimation

Similarity: Both are density estimation algorithms i.e. they calculate the probability of a point given
the existing data. In a way Parzen window is a special form of mixture of gaussian wherein the
number of mixtures is equal to the number of observations.

Difference: Parzen window is non-parametric and thus requires the entire data to calculate density
of any point. MOG is a parametric form of density estimation, wherein we calculate the parameters
of the assumed probability density function and then calculate the density of a point using that the
density function. Parzen window is not suited for real time analysis while MOG can be used for real
time density estimation and hence in real time outlier/fraud detection.
(h) Support of an Item-Set vs. Coherence of an Item-set

Similarity: Both of these are parameters of a group of products that help us understand whether the
products of the group are bought together or not. Both should have a high value is their respective
algorithms.

Difference: Both are used in difference contexts. Support is used in item-set mining and represents
the number of times the group of products (item-set) were bought together. Coherence is used
which doing co-occurrence analysis in which we try to find the soft maximal clique that is a clique
which has more coherence than both its up-neighbors and down-neighbors.
Similarity & Difference
(g) Frequent Itemset Mining vs. Logical Itemset Mining

Similarity: They are similar in the purpose that they want to achieve i.e. to understand the latent
intent of the customer and be able to find patterns in the purchase of items and thus use these
patters for either product promotion, inventory planning or store layout optimization.

Difference: While frequent itemset mining in used to find large and frequent itemset based on the
data of products purchased together (retail data), it fails to capture the entire latent intent of the
customer. This is because one purchase (or a group of purchases) can represent both a sub-set of the
complete latent intent or a mixture of 2 latent intents. Logical itemset mining overcomes these
deficiencies by looking a pair wise co-occurrence across customers to build a logical itemset which
represents the complete latent intent and is able to capture the rare events as well.
(i) Frequent Item-set Mining vs. N-gram patterns mining

Similarity: There are similar in the sense that in both the algorithms, we are trying to find things
that are purchased or used together. In item-set mining it could be items that are bought together,
which in n-grams it is the words that are used together.

Difference: While in item-set mining the order doesn’t matter, it matters in n-gram pattern mining.
Also, in n-gram pattern mining, you need more than one consistency matrices. For eg. if you are
trying to find an n-gram of length 4, you will to check consistency for 1 step ahead, 2 step ahead
and 3 step ahead and all should be high for an n-gram of length 4.
Similarity & Difference
(j) K-means clustering vs. Spherical K-means clustering

Similarity: Both are partitional clustering algorithms and are similar in the
sense that they do clustering using the EM algorithm which involves 2
iterative steps of assigning each data point to the cluster centers and then
calculating the new cluster centers. Also, both require the number of clusters
as an input i.e. k is a hyper parameter that needs to be decided before
starting the algorithms.

Difference: While K-means is applied on the vector data, spherical k-means is
used for clustering text data. As a result, there are 2 additional steps
required in spherical k-means. First is to normalize text data so that length of
data doesn’t matter and only relative proportion of words matter. We use TFIDF for this purpose. Similarly, this normalization is required again when we
find the new cluster centers at the end of each iteration. This is again
required to make length of the new mean vector same as the data points.
Solution-5
Parameters
Definition
Hyperpar
ameters
a) PCA projections [K x D]
K
K= no. of projections
b) The K mean vectors [K x D]
K
K= no. of clusters
c) The Projected points[K x N]
K
K= no. of MDS dimensions
d) The SOM grid clusters
[K x K x D]
K
K= size of the grids
e) The mean & the covariance of
each Gaussian [ K x D + K x D]
K
K= no. of Gaussians, fact that
the covariance matrix is
diagonal
f)The means of Top level & Next
Level clusters [ K1 x D + K1 x K2 X D]
K1 , K2 K1 , K2 : no. of clusters at 2 levels
Solution-6
a) 𝑀1 × 𝑀2 × ⋯ × 𝑀𝐷 − 1 =
b)
𝐷
Π𝑑=1
𝑑+1
2
−1=
𝐷
Π𝑑=1
𝑀1 − 1 + 𝑀2 − 1 + ⋯ + 𝑀𝐷 − 1 = σ𝐷
𝑑=1 𝑑 + 1
𝑫 𝑫+𝟏 𝟐𝑫+𝟏
𝐷
2
σ𝐷
σ
𝑑
+
2
𝑑
=
+ 𝑫(𝑫 + 𝟏)
𝑑=1
𝑑=1
2
𝑑+1
2
−1=
𝟐
𝑫+𝟏 ! - 1
2
− 𝐷 = σ𝐷
𝑑=1 𝑑 + 2𝑑 + 1 − 𝐷 =
𝟔
c)
𝑀1 − 1 + 𝑀2 − 1 × 𝑀1 + 𝑀3 − 1 × 𝑀2 + ⋯ + 𝑀𝐷 − 1 × 𝑀𝐷−1
= (𝑀1 − 1) + (𝑀1 × 𝑀2 − 𝑀1 ) + (𝑀2 × 𝑀3 − 𝑀2 ) + ⋯ + (𝑀𝐷−1 × 𝑀𝐷 − 𝑀𝐷−1 )
𝐷
𝐷−1
𝑫
= ෍ 𝑀𝑑−1 × 𝑀𝑑 − ෍ 𝑀𝑑 = ෍ 𝒅𝟐 × 𝒅 + 𝟏
𝑑=2
𝑑=2
𝒅=𝟐
𝑫−𝟏
𝟐
− ෍ 𝒅+𝟏
𝒅=𝟐
𝟐
Solution-7
𝑢𝑗 , 𝑣𝑗
𝑀
a)
PARAMETERS:
b)
LATENT PARAMETERS: 𝛿𝑖,𝑗 – whether apartment i is associated with library j [N x
M parameters]
c)
𝑀
Optimization Function: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒: 𝐽 𝑼, 𝑽 𝚫 = σ𝑁
𝑖=1 𝑘𝑖 σ𝑗=1 𝛿𝑖,𝑗 𝐶(𝑖, 𝑗) =
2
2
𝑀
σ𝑁
σ
𝑘
𝛿
𝑢
−
𝑥
+
𝑣
−
𝑦
𝑖
𝑖,𝑗
𝑗
𝑖
𝑗
𝑖
𝑖=1
𝑗=1
d)
OPTIMIZATION SOLUTION:
𝑗=1
[2 x M parameters]
𝑁
𝑁
𝜕
𝜕𝐽 𝑼, 𝑽 𝚫
𝜕𝐶(𝑖, 𝑗)
= ෍ 𝛿𝑖,𝑗 𝑘𝑖
= ෍ 𝛿𝑖,𝑗 𝑘𝑖
𝜕𝑢𝑗
𝜕𝑢𝑗
𝑖=1
=0

𝑁
σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑢𝑗 = σ𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑥𝑖

𝑁
𝑢𝑗 σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 = σ𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑥𝑖 ,

𝑠𝑜: 𝑢ො𝑗 =
σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑥𝑖
σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖
𝑖=1
𝑢𝑗 − 𝑥𝑖
2
+ 𝑣𝑗 − 𝑦𝑖
𝜕𝑢𝑗
2
𝑁
= 2 ෍ 𝛿𝑖,𝑗 𝑘𝑖 𝑢𝑗 − 𝑥𝑖
𝑖=1
Solution-7 continues

𝜕𝐽
𝑼, 𝑽 𝚫
𝜕𝑣𝑗
=
𝜕𝐶(𝑖,𝑗)
σ𝑁
𝛿
𝑘
𝑖=1 𝑖,𝑗 𝑖 𝜕𝑣
𝑗
𝑦𝑖 ൯ = 0

𝑁
σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑣𝑗 = σ𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑦𝑖

𝑁
𝑣𝑗 σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 = σ𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑦𝑖 ,

𝑠𝑜: 𝑣ෝ𝑗 =
σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 𝑦𝑖
σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖
= σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖
𝜕 𝑢𝑗 −𝑥𝑖
2
+ 𝑣𝑗 −𝑦𝑖
𝜕𝑣𝑗
2
= 2 σ𝑁
𝑖=1 𝛿𝑖,𝑗 𝑘𝑖 ൫𝑣𝑗 −
Solution-8

What are the parameters? How many parameters are we estimating?
PARAMETER: p (1 parameter)

What is the observed data and knowledge that is given to us?
DATA: Coin toss observations: 𝑥1 , 𝑥2 , … , 𝑥2𝑁
KNOWLEDGE: 𝑤𝑛 = 𝑘 × 𝑛

Pose this as an optimization problem and simplify to make it more solvable.
[HINT: Make sure you take the weights into account].
OBJECTIVE: Maximize the Likelihood of seeing the data
2𝑁
𝐽 𝑝 = ෑ 𝑃 𝑥𝑛
2𝑁
𝑤𝑛
𝑛=1
= ෑ 𝑝 𝑥𝑛 1 − 𝑝
𝑛=1
Taking the log to maximize the log likelihood.
2𝑁
2𝑁
𝐽 𝑝 = ෍ 𝑤𝑛 log 𝑃(𝑥𝑛 ) = ෍ 𝑤𝑛 log 𝑝 𝑥𝑛 1 − 𝑝
𝑛=1
𝑛=1
1−𝑥𝑛 𝑤𝑛
2𝑁
1−𝑥𝑛
= ෍ 𝑤𝑛 𝑥𝑛 log 𝑝 + 1 − 𝑥𝑛 log(1 − 𝑝)
𝑛=1
Solution-8 [continues]
Estimate the parameter in terms of the data and weights.
2𝑁
𝜕𝐽 𝑝
𝑥𝑛
1 − 𝑥𝑛
= ෍ 𝑤𝑛
−
𝜕𝑝
𝑝
1−p
𝑛=1
2𝑁
2𝑁
2𝑁
1
1
= ෍ 𝑤𝑛 𝑥𝑛 −
෍ 𝑤𝑛 (1 − 𝑥𝑛 ) = 0
𝑝
1−𝑝
𝑛=1
2𝑁
𝑛=1
1
1
෍ 𝑤𝑛 𝑥𝑛 =
෍ 𝑤𝑛 (1 − 𝑥𝑛 )
𝑝
1−𝑝
𝑛=1
2𝑁
𝑛=1
2𝑁
(1 − 𝑝) ෍ 𝑤𝑛 𝑥𝑛 = 𝑝 ෍ 𝑤𝑛 (1 − 𝑥𝑛 )
𝑛=1
2𝑁
2𝑁
2𝑁
𝑛=1
2𝑁
෍ 𝑤𝑛 𝑥𝑛 = 𝑝 ෍ 𝑤𝑛 𝑥𝑛 + ෍ 𝑤𝑛 (1 − 𝑥𝑛 ) = 𝑝 ෍ 𝑤𝑛
𝑛=1

𝒑=
σ𝟐𝑵
𝒏=𝟏 𝒘𝒏 𝒙𝒏
σ𝟐𝑵
𝒏=𝟏 𝒘𝒏
=
σ𝟐𝑵
𝒏=𝟏 𝒌×𝒏×𝒙𝒏
σ𝟐𝑵
𝒏=𝟏 𝒌×𝒏
𝑛=1
=
σ𝟐𝑵
𝒏=𝟏 𝒏×𝒙𝒏
σ𝟐𝑵
𝒏=𝟏 𝒏
𝑛=1
𝑛=1
Solution-8 [continues]
If every even numbered coin toss is a heads and every odd numbered coin toss is a
tail, estimate the value of the parameter 𝑝 in terms of 𝑁

Odd numbered coin tosses are tails: 𝑥1 = 𝑥3 = 𝑥5 = ⋯ = 0

Even numbered coin tosses are heads: 𝑥2 = 𝑥4 = 𝑥6 = ⋯ = 1
σ2𝑁
2 + 4 + 6 + ⋯ + 2𝑁 2 × 1 + 2 + ⋯ + 𝑁
𝑛=1 𝑛 × 𝑥𝑛
=
=
=
σ2𝑁
1
+
2
+
3
+
⋯
+
2𝑁
1 + 2 + ⋯ + 2𝑁
𝑛
𝑛=1
Numerator:
2× 1+ 2+ ⋯+ 𝑁 = 2×
Denominator:
𝑁 𝑁+1
= 𝑁(𝑁 + 1)
2
2𝑁 2𝑁 + 1
= 𝑁 2𝑁 + 1
2
𝑵× 𝑵+𝟏
𝑵+𝟏
𝒑=
=
𝑵 × 𝟐𝑵 + 𝟏
𝟐𝑵 + 𝟏
1 + 2 + ⋯ + 2𝑁 =
Solution-9

We are given the following facts about this data:

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑎, 𝑏, 𝑐 → 𝑑
= , 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑎, 𝑏 → 𝑐, 𝑑
2
,𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
5
7
10
𝑎 → 𝑐
=
1
3
=
1
6
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑏, 𝑑 → 𝑎
=
[5 points] Given the above four confidence values, find the relationships among the counts,
𝑈, 𝑉, 𝑊, 𝑋, 𝑎𝑛𝑑 𝑌 [Hint: You can write each Confidence in terms of U, V, W, X, Y and simplify]
𝑺𝒖𝒑𝒑𝒐𝒓𝒕( 𝒂, 𝒃, 𝒄, 𝒅 )
𝑼
𝟏
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒂, 𝒃, 𝒄 → 𝒅 =
=
= ⇒ 𝑽 = 𝟐𝑼
𝑺𝒖𝒑𝒑𝒐𝒓𝒕( 𝒂, 𝒃, 𝒄 )
𝑼+𝑽 𝟑
𝑺𝒖𝒑𝒑𝒐𝒓𝒕( 𝒂, 𝒃, 𝒄, 𝒅 )
𝑼
𝟏
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒂, 𝒃 → 𝒄, 𝒅 =
=
= ⇒ 𝑾 = 𝟑𝑼
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 {𝒂, 𝒃}
𝑼+𝑽+𝑾 𝟔
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 {𝒂, 𝒃, 𝒅}
𝑼+𝑾
𝟐
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒃, 𝒅 → 𝒂 =
=
= ⇒ 𝒀 = 𝟔𝑼
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 {𝒃, 𝒅}
𝑼+𝑾+𝒀 𝟓
𝒔𝒖𝒑𝒑𝒐𝒓𝒕 {𝒂, 𝒄}
𝑼+𝑽+𝑿
𝟕
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒂 → 𝒄 =
=
=
⇒ 𝑿 = 𝟒𝑼
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 {𝒂}
𝑼 + 𝑽 + 𝑾 + 𝑿 𝟏𝟎
[2 points] Compute 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑎 → 𝑏, 𝑐, 𝑑 given (a)
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝒂, 𝒃, 𝒄, 𝒅
𝑼
𝑼
𝟏
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒂 → 𝒃, 𝒄, 𝒅 =
=
=
=
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝒂
𝑼 + 𝑽 + 𝑾 + 𝑿 𝑼 + 𝟐𝑼 + 𝟑𝑼 + 𝟒𝑼 𝟏𝟎
Solution-9 [continues]
Compute 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑏, 𝑑 → 𝑎
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑏, 𝑑 → 𝑎
given (a)
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑏, 𝑑
=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑏, 𝑑
Compute 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐 → 𝑎, 𝑑 given (a)
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑐, 𝑑
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐 → 𝑎, 𝑑 =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑐
If (𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑐
= 49) what is 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑏, 𝑑
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑐
𝑈+𝑊
𝑈 + 3𝑈
2
=
=
=
𝑈 + 𝑊 + 𝑌 𝑈 + 3𝑈 + 6𝑈 5
𝑈+𝑋
𝑈 + 4𝑈
5
=
=
=
𝑈 + 𝑉 + 𝑋 + 𝑌 𝑈 + 2𝑈 + 4𝑈 + 6𝑈 13
given (a)
= 𝑈 + 𝑉 + 𝑋 = 𝑈 + 2𝑈 + 4𝑈 = 7𝑈 = 49 ⇒ 𝑈 = 7
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑏, 𝑑
= 𝑈 + 𝑊 + 𝑌 = 𝑈 + 3𝑈 + 6𝑈 = 10𝑈 = 70
Solution-9 [continues]
What is the minimum support threshold that will give 𝑎, 𝑏, 𝑐, 𝑑 as a candidate.
For {a,b,c,d} to be a candidate, all its subsets of size 3 should be above threshold, i.e.
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑏, 𝑐
= 𝑈 + 𝑉 = 𝑈 + 2𝑈 = 3𝑈 = 21
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑏, 𝑑
= 𝑈 + 𝑊 = 𝑈 + 3𝑈 = 4𝑈 = 28
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑐, 𝑑
= 𝑈 + 𝑋 = 𝑈 + 4𝑈 = 5𝑈 = 35
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑏, 𝑐, 𝑑
= 𝑈 + 𝑌 = 𝑈 + 6𝑈 = 7𝑈 = 49
In order for all these supports to be higher than the threshold, the threshold must be
21 (or 3U).
Solution-10
What would be the MAP step for the three examples above and what would be the
reduce step?
MAP:
For each pair in the LHS of the input, we dump the total count (n) and the unique
count (1) both.
INPUT: 𝑎, 𝑏, 𝑐 → 𝑛1
MAP OUTPUT:
𝑎, 𝑏 → 𝑛1
𝑎, 𝑐 → 𝑛1
𝑏, 𝑐 → 𝑛1
INPUT: 𝑏, 𝑐, 𝑒, 𝑓 → 𝑛2
MAP OUTPUT:
𝑏, 𝑐 → 𝑛2
𝑏, 𝑒 → 𝑛2
𝑏, 𝑓 → 𝑛2
𝑐, 𝑒 → 𝑛2
𝑐, 𝑓 → 𝑛2
𝑒, 𝑓 → 𝑛2
Solution-10 [continues]
INPUT: 𝑏, 𝑐, 𝑔 → 𝑛3
MAP OUTPUT:
𝑏, 𝑐 → 𝑛3
𝑏, 𝑔 → 𝑛3
𝑐, 𝑔 → 𝑛3
REDUCE:
For each key, take the sum and divide by counts:
Example:
𝑏, 𝑐 → 𝑛1
𝑏, 𝑐 → 𝑛2
𝑏, 𝑐 → 𝑛3
Output: Sum(values) / Count(values)
Solution-10 [continues]
b) You are given a weighted undirected graph as the following adjacency matrix.
𝑎 → < 𝑏, 1 >, < 𝑐, 3 >, < 𝑑, 5 > ,𝑏 → < 𝑑, 4 > ,𝑐 → < 𝑑, 3 >
Here the edge (𝑎, 𝑏) is the same as 𝑏, 𝑎 and has a weight of 1, 𝑐, 𝑑 has a weight of 3.
(Note that since the graph is directed the data only contains edges where the
neighbour indices are higher than the node index. i.e. b → {<a,1>} is not given since it
is already there in a → {<b,1>,…})
We want to write a MAP/REDUCE job to generate the AVERAGE of all the edge weights
associated with each node. So, the output we want here is as follows:
𝑤 𝑎,𝑏 +𝑤 𝑎,𝑐 +𝑤 𝑎,𝑑
(a is connected to three nodes)
3
𝑤 𝑎,𝑏 +𝑤 𝑏,𝑐 +𝑤 𝑏,𝑑
𝑏→
(b is connected to three nodes
3
𝑤 𝑎,𝑐 +𝑤 𝑎,𝑐
𝑐→
(c is connected to only two nodes)
2
𝑤 𝑎,𝑑 +𝑤 𝑏,𝑑 +𝑤 𝑐,𝑑
𝑑→
(d is connected to three nodes)
3
𝑎→
What would be the MAP and REDUCE?
MAP:
INPUT: 𝑎 → < 𝑏, 1 >, < 𝑐, 3 >, < 𝑑, 5 >
Solution-10 [continues]
a→1
a→3
a→5
b→1
c→3
d→5
INPUT: 𝑏 → < 𝑑, 4 >
b→4
d→4
INPUT: 𝑐 → < 𝑑, 3 >
c→3
d→3
REDUCE:
For each key, sum(value)/count(value)
Download