CMPT 354 Assignment 3 Model Answer Due: 11:59 pm, December 10, 2021 100 points in total This model answer provides only one set of possible answers. Question 1 (25 points) Consider a database schema with a relation Emp whose attributes are as shown below, with types specified for multivalued attributes. Emp = (ename, ChildrenSet multiset(Children), SkillSet multiset(Skills)) Children = (name, birthday) Skills = (type, ExamSet setof(Exams)) Exams = (year, city) Redesign the database into a relational database holding the first normal form and the fourth normal form. List any functional or multivalued dependencies that you assume. Also list all referential-integrity constraints that should be present in the first and fourth normal form schemas. Answer. To put the schema into first normal form, we flatten all the attributes into a single relation schema. Employee-details = (ename, cname, bday, bmonth, byear, stype, xyear, xcity) We can rename the attributes for the sake of clarity. For example, cname is Children.name, and bday, bmonth, byear are the Birthday attributes. stype is Skills.type, and xyear and xcity are the Exams attributes. The FDs and multivalued dependencies we assume are ename, cname → bday, bmonth, byear ename →→ cname, bday, bmonth, byear ename, stype →→ xyear, xcity The FD captures the fact that a child has a unique birthday, under the assumption that one employee cannot have two children of the same name. The MVDs capture the fact there is no relationship between the children of an employee and his or her skills information. The redesigned schema in fourth normal form is: Employee = (ename) Child = (ename, cname, bday, bmonth, byear) Skill = (ename, stype, xyear, xcity) 1 Question 2 (15 points) The Google search engine provides a feature whereby web sites can display advertisements supplied by Google. The advertisements supplied are based on the contents of the page. Suggest how Google might choose which advertisements to supply for a page, given the page contents. Can the similarity measures discussed in this course, TF-IDF and cosine similarity, be useful here? Answer. Google might use the concepts in similarity-based retrieval. Here, they can give the system a document A and the set of advertisements B, and ask the system to retrieve advertisements that are similar to A. One approach is to find k terms in A with the highest values of TF(A,t) * IDF(t), and to use to these k terms as a query to find relevance of other documents. The cosine similarity metric can also be used to determine which advertisements to supply for a page, given the page contents. Google may also take into account the user-profile of the user when deciding what advertisements to display and could potentially even choose to ignore the web page content when deciding what advertisement to show. Question 3 (20 points) Consider tables S (A, B, C) and T (B, C, D) and SQL query select A, B, C, D from S, T where S.B = T.B and S.C = T.C Design a MapReduce program to compute the join efficiently. Please provide the pseudocode. Answer. With the map function, output records from both the input relations, using the join attribute value as the key. The reduce function gets records from both relations with matching join attribute values and outputs all matching pairs. Map (record) { For each record (x, y, z){ If it is a record in S then emit ((B, C), (“S”, A)) Else emit ((B, C), (“T”, D)); // for a record in T } Reduce (tuple key (b, c), list value_list){ S_set ß ∅; T_set ß ∅; For each value (tag, x) in value_list{ If tag == “S” then S_set = S_set ∪ {𝑥𝑥} Else T_set = T_set ∪ {𝑥𝑥}; For each x ∈ S_set For each y ∈ T_set Output(x, b, c, y) } 2 Question 4 (20 points) The map-reduce framework is quite useful for creating inverted indices on a set of documents. An inverted index stores for each word a list of all document IDs that it appears in (offsets in the documents are also normally stored, but we shall ignore them in this question). For example, if the input document IDs and contents are as follows: 1: data clean 2: data base 3: clean base then the inverted lists would data: 1, 2 clean: 1, 3 base: 2, 3 Give pseudocode for map and reduce functions to create inverted indices on a given set of files (each file is a document). Assume the document ID is available using a function context.getDocumentID(), and the map function is invoked once per line of the document. The output inverted list for each word should be a list of document IDs separated by commas. The document IDs are normally sorted, but for the purpose of this question you do not need to bother to sort them. Answer. Map (string s) { For each unique word w in s Emit (w, context.getDocumentID()) } Reduce (key w, list value_list) { s.append(w, “: ”); For each value d in value_list { s.append(d, “, ”); output(s); } 3 Question 5 (20 points) Consider the table sales as follows. City Vancouver Vancouver Toronto Victoria Season Spring Fall Fall Spring Product GoPro 9 GoPro 9 GoPro 8 GoPro 8 Amount 100,000 80,000 12,000 60,000 Suppose we only consider the two-level hierarchy, city – province, in dimension city, and no hierarchy in other dimensions. List all tuples in the relational representation of the data cube on the sales table, that is, the complete relational representation of all cross-tabs. Answer. City/Province Vancouver Vancouver Toronto Victoria Vancouver Vancouver Vancouver Vancouver Toronto Toronto Toronto Victoria Victoria Victoria BC BC BC BC BC BC BC BC ON ON ON ON All All All All All All All All All Season Spring Fall Fall Spring Spring Fall All All Fall All All Spring All All Spring Spring Fall Spring Fall All All All Fall All Fall All Spring Spring Spring Fall Fall All Fall All All Product GoPro 9 GoPro 9 GoPro 8 GoPro 8 All All GoPro 9 All All GoPro 8 All All Spring All GoPro 9 GoPro 8 GoPro 9 All All GoPro 9 GoPro 8 All GoPro 8 GoPro 8 All All GoPro 9 GoPro 8 All GoPro 9 GoPro 8 GoPro 9 All GoPro 8 All 4 Amount 100,000 80,000 12,000 60,000 100,000 80,000 180,000 180,000 12,000 12,000 12,000 60,000 60,000 60,000 100,000 60,000 80,000 160,000 80,000 180,000 60,000 240,000 12,000 12,000 12,000 12,000 100,000 60,000 160,000 80,000 12,000 180,000 92,000 72,000 252,000