Uploaded by manan maniyar

a3

advertisement
CMPT 354 Assignment 3 Model Answer
Due: 11:59 pm, December 10, 2021
100 points in total
This model answer provides only one set of possible answers.
Question 1 (25 points)
Consider a database schema with a relation Emp whose attributes are as shown below, with types
specified for multivalued attributes.
Emp = (ename, ChildrenSet multiset(Children), SkillSet multiset(Skills))
Children = (name, birthday)
Skills = (type, ExamSet setof(Exams))
Exams = (year, city)
Redesign the database into a relational database holding the first normal form and the fourth
normal form. List any functional or multivalued dependencies that you assume. Also list all
referential-integrity constraints that should be present in the first and fourth normal form
schemas.
Answer.
To put the schema into first normal form, we flatten all the attributes into a single relation
schema.
Employee-details = (ename, cname, bday, bmonth, byear, stype, xyear, xcity)
We can rename the attributes for the sake of clarity. For example, cname is Children.name, and
bday, bmonth, byear are the Birthday attributes. stype is Skills.type, and xyear and xcity are the
Exams attributes. The FDs and multivalued dependencies we assume are
ename, cname → bday, bmonth, byear
ename →→ cname, bday, bmonth, byear
ename, stype →→ xyear, xcity
The FD captures the fact that a child has a unique birthday, under the assumption that one
employee cannot have two children of the same name. The MVDs capture the fact there is no
relationship between the children of an employee and his or her skills information.
The redesigned schema in fourth normal form is:
Employee = (ename)
Child = (ename, cname, bday, bmonth, byear)
Skill = (ename, stype, xyear, xcity)
1
Question 2 (15 points)
The Google search engine provides a feature whereby web sites can display advertisements
supplied by Google. The advertisements supplied are based on the contents of the page. Suggest
how Google might choose which advertisements to supply for a page, given the page contents.
Can the similarity measures discussed in this course, TF-IDF and cosine similarity, be useful here?
Answer. Google might use the concepts in similarity-based retrieval. Here, they can give the
system a document A and the set of advertisements B, and ask the system to retrieve
advertisements that are similar to A. One approach is to find k terms in A with the highest values
of TF(A,t) * IDF(t), and to use to these k terms as a query to find relevance of other documents.
The cosine similarity metric can also be used to determine which advertisements to supply for a
page, given the page contents.
Google may also take into account the user-profile of the user when deciding what
advertisements to display and could potentially even choose to ignore the web page content
when deciding what advertisement to show.
Question 3 (20 points)
Consider tables S (A, B, C) and T (B, C, D) and SQL query
select A, B, C, D
from S, T
where S.B = T.B and S.C = T.C
Design a MapReduce program to compute the join efficiently. Please provide the pseudocode.
Answer. With the map function, output records from both the input relations, using the join
attribute value as the key. The reduce function gets records from both relations with matching
join attribute values and outputs all matching pairs.
Map (record) {
For each record (x, y, z){
If it is a record in S then emit ((B, C), (“S”, A))
Else emit ((B, C), (“T”, D)); // for a record in T
}
Reduce (tuple key (b, c), list value_list){
S_set ß ∅;
T_set ß ∅;
For each value (tag, x) in value_list{
If tag == “S” then S_set = S_set ∪ {𝑥𝑥}
Else T_set = T_set ∪ {𝑥𝑥};
For each x ∈ S_set
For each y ∈ T_set
Output(x, b, c, y)
}
2
Question 4 (20 points)
The map-reduce framework is quite useful for creating inverted indices on a set of documents.
An inverted index stores for each word a list of all document IDs that it appears in (offsets in the
documents are also normally stored, but we shall ignore them in this question).
For example, if the input document IDs and contents are as follows:
1: data clean
2: data base
3: clean base
then the inverted lists would
data: 1, 2
clean: 1, 3
base: 2, 3
Give pseudocode for map and reduce functions to create inverted indices on a given set of files
(each file is a document). Assume the document ID is available using a function
context.getDocumentID(), and the map function is invoked once per line of the document. The
output inverted list for each word should be a list of document IDs separated by commas. The
document IDs are normally sorted, but for the purpose of this question you do not need to bother
to sort them.
Answer.
Map (string s) {
For each unique word w in s
Emit (w, context.getDocumentID())
}
Reduce (key w, list value_list) {
s.append(w, “: ”);
For each value d in value_list {
s.append(d, “, ”);
output(s);
}
3
Question 5 (20 points)
Consider the table sales as follows.
City
Vancouver
Vancouver
Toronto
Victoria
Season
Spring
Fall
Fall
Spring
Product
GoPro 9
GoPro 9
GoPro 8
GoPro 8
Amount
100,000
80,000
12,000
60,000
Suppose we only consider the two-level hierarchy, city – province, in dimension city, and no
hierarchy in other dimensions. List all tuples in the relational representation of the data cube on
the sales table, that is, the complete relational representation of all cross-tabs.
Answer.
City/Province
Vancouver
Vancouver
Toronto
Victoria
Vancouver
Vancouver
Vancouver
Vancouver
Toronto
Toronto
Toronto
Victoria
Victoria
Victoria
BC
BC
BC
BC
BC
BC
BC
BC
ON
ON
ON
ON
All
All
All
All
All
All
All
All
All
Season
Spring
Fall
Fall
Spring
Spring
Fall
All
All
Fall
All
All
Spring
All
All
Spring
Spring
Fall
Spring
Fall
All
All
All
Fall
All
Fall
All
Spring
Spring
Spring
Fall
Fall
All
Fall
All
All
Product
GoPro 9
GoPro 9
GoPro 8
GoPro 8
All
All
GoPro 9
All
All
GoPro 8
All
All
Spring
All
GoPro 9
GoPro 8
GoPro 9
All
All
GoPro 9
GoPro 8
All
GoPro 8
GoPro 8
All
All
GoPro 9
GoPro 8
All
GoPro 9
GoPro 8
GoPro 9
All
GoPro 8
All
4
Amount
100,000
80,000
12,000
60,000
100,000
80,000
180,000
180,000
12,000
12,000
12,000
60,000
60,000
60,000
100,000
60,000
80,000
160,000
80,000
180,000
60,000
240,000
12,000
12,000
12,000
12,000
100,000
60,000
160,000
80,000
12,000
180,000
92,000
72,000
252,000
Download