CS9100 THE UNIVERSITY OF WARWICK MSc Examinations: Summer 2015 CS910: Foundations of Data Analytics Time allowed: 2 hours. Answer SEVEN questions only: ALL FOUR from Section A and THREE from Section B. Read carefully the instructions on the answer book and make sure that the particulars required are entered on each answer book. Only calculators that are approved by the Department of Computer Science are allowed to be used during the examination. Question: 1 2 3 4 5 6 8 9 Total Points: 7 6 6 25 25 25 25 25 150 6 -1- 7 Continued CS9100 Section A Answer ALL questions 1. For each of the following problems, state whether it would be best addressed by link prediction, a recommender system, or time series prediction: [6] (a) A library that wants to suggest which book a borrower should borrow next (b) A railway station trying to anticipate how many customers they will have next week (c) A phone company wanting to predict who their customers will call next month (d) A social network wanting to suggest which news sources its users should “follow” (e) A social network wanting to suggest which users might be “friends” (f) A social network wanting to know how many photographs will be shared tomorrow Solution: Comprehension – requires student to show understanding of concepts (a) recommender system (b) time series prediction (c) link prediction (d) recommender system (e) link prediction (f) time series prediction 2. Given n paired observations of two numeric random variables X and Y , give the definition of the following quantities: (a) The (empirical) probability that Y takes on a particular value y (b) The expectation of X, E[X] (c) The variance of X, Var[X] (d) The expectation of XY , E[XY ] (e) The covariance of X and Y , Cov(X, Y ) (f) The Pearson Product-Moment Correlation Coefficient between X and Y (g) The regression coefficient of determination between X and Y , R2 Solution: Bookwork – primarily requires recollection of taught concepts (a) Pr[Y = y] = 1[Yi = y]/n = fraction of examples equal to y P (b) E[X] = i Xi /n P (c) Var[X] = i Xi2 /n − E2 [X] P (d) E[XY ] = i Xi Yi /n (e) CoVar[XY ] = E[XY ] − E[X]E[Y ] = E[(X − E[X])(Y − E[Y ])] p (f) P M CC = (CoVar[XY ]/ Var[X] Var[Y ]) (g) R2 = P M CC 2 . -2- Continued [7] CS9100 3. For the following datasets, state what kind of system would be best suited to storing the data for the intended purpose, amongst text files/spreadsheet, database, data warehouse, noSQL system: [6] (a) Current orders for an e-commerce website for billing (b) List of results of a scientific experiment, to plot results (c) Collection of billions of phone records for clustering in a MapReduce system (d) Historical customer orders of a large mail-order business for fitting a regression model (e) Financial transactions of a small business with two offices (f) Marks for an exam for computing average score Solution: Comprehension – requires student to show understanding of concepts (a) DBMS (b) text/spreadsheet (c) noSQL (d) datawarehouse (e) DBMS (f) text/spreadsheet 4. For each of the following applications, state which distance function you would use out of Hamming Distance, Euclidean Distance, Dynamic Time Warping: (a) Clustering locations based on the shortest distance between them (b) Clustering data with ten categorical attributes (c) Clustering data formed of trajectories of different lengths (d) Nearest neighbour classification of a month of stock market opening prices (e) Nearest neighbour classification of leaf shapes described by length and width (f) Nearest neighbour classification of books described by five binary attributes Solution: Comprehension – requires student to show understanding of concepts (a) Euclidean distance (b) Hamming distance (c) DTW (d) DTW (e) Euclidean (f) Hamming -3- Continued [6] CS9100 Section B Choose THREE questions. 5. (a) Give three reasons why a data item might have missing values for one of its attributes. [3] Solution: Bookwork – primarily requires recollection of taught concepts Data may have been lost or transcribed incorrectly. The data value may never have been measured. The notion may not apply to that example: e.g., someone may not have a national insurance number if they are too young or not native. The data subject may have withheld the information for privacy reasons. (b) Why can it be important to ensure there are no missing values in data? [2] Solution: Bookwork – primarily requires recollection of taught concepts Some data analysis processes require that there are no missing values in the data, and so we need to ensure that there are none present so that the analysis can proceed. For example, we may have a process that expects numeric points, and is not able to handle missing values encoded as “?” or similar. (c) Describe three ways that a missing value might be replaced in data that was collected by carrying out a survey of people, and contrast their strengths and weaknesses. [6] Solution: Comprehension – requires student to show understanding of concepts Some possibilities include: - Drop all records with missing values. Simple to enact, but may delete a large number of examples, and may skew the distribution of data if there is a correlation between data with missing values and some other attribute. - Fill in missing values with some plausible value, e.g. median or mean value in that attribute. Also easy to enact, but possibly simplistic and misleading. - Fill in missing values using a model (classifier/regression). More complex to perform, still not clear if suitable. - Treat missing values as “unknown”: straightforward to do initially, but may not be compatible with downstream processing, and may lead to false patterns discovered based on clusters of missing values. - Go back to the individuals who were surveyed and try to get their missing values. Might be time consuming and they may not want to reveal their values. (d) A dataset contains information on patients’ heights and weights. Describe two constraints that you could use to check for outliers in this data. For each, explain whether it is feasible that a valid example could violate the given rule. Solution: Application – student needs to apply techniques they have learned Height/weight is not negative: cannot be violated (can’t have negative weights) Height less than 7 feet / 2 metres : could possibly be violated by a giant Height more than 4 feet: could possibly be violated by a child Weight less than 200 pounds / 150 kilos : could possibly be violated Weight at least 50 pounds: could be violated by a child. -4- Continued [4] CS9100 Height and weight combined to give a body mass index of at least 10: very unlikely to be violated. (e) The following data values record the average pulse rate for each of a set of patients: 70 72 74 75 75 77 77 77 78 80 70 i. Give the median pulse rate and the variance of this quantity. ii. Describe how you could normalize numeric data such as the pulse data to allow them to be compared. iii. Describe a test based on statistics to determine whether a pulse rate should be considered an outlier. [6] Solution: Application – student needs to apply techniques they have learnedComprehension – requires student to show understanding of concepts [2 marks] Median: 75 Mean: 75, Variance: E[(X − E[X])2 = 25 + 25 + 9 + 1 + 4 + 4 + 4 + 9 + 25)/11 = 106/11 = 9.63 [2 marks] Could normalize by subtracting min, dividing by (max-min). Or by subtracting mean, divide by standard deviation for each value. [2 marks] If it is more than some number of standard deviations away from its mean value. Or, fit a distribution (e.g. normal) and compute probability of seeing the observed value: if it is small, consider it an outlier. (f) Consider the following two time-series of pulse rates over time: Week 1 Week 2 Week 3 Week 4 Week 5 Patient 1 70 74 76 72 Patient 2 72 72 75 72 71 Find the distance between these two time-series that would be found by dynamic timewarping when the distance between pairs of values is given by the absolute difference. Show the mapping between the two series that corresponds to the distance. [4] Solution: Application – student needs to apply techniques they have learned The solution found makes the following mappings: (70, 72), (74, 72), (76, 75), (72, 72), (72, 71). The corresponding distance is 2 + 2 + 1 + 1 = 6. 6. An online university allows its students to sign up for different modules, where lectures are viewed as recorded videos. Any student can sign up for any module. There are now over a thousand modules, and students are having a hard time picking which modules to sign up for next, based on their interests. (a) Describe how the university can design a recommender system which will help suggest possible modules to students. What information needs to be collected for the system to be initialized? Discuss how the system you describe would suggest modules for a particular student. Be sure to explain any notation you use in your answer. Solution: Bookwork – primarily requires recollection of taught concepts Comprehension – requires student to show understanding of concepts The recommender system can make a prediction for each student for their predicted “score” for each module. Say, 1 for little/no interest, to 5 for high interest. To do -5- Continued [9] CS9100 this, we need the ratings from current students for existing modules. This could be collected via some online surveys. We can then model the the ratings as a matrix between students and modules, and attempt to “fill in” the missing ratings. Two possible approaches discussed in lectures: Neighbour based methods: for each user u, find K users who are similar to user u. Assign a weight wu,v for the similarity, based, e.g. on the correlation coefficient between the two users. Then use this to make a weighted prediction, e.g. X X pu,i = ru + ( (rv,i − rv )wu,v )/( wu,v ) v∈K v∈K where, pu,i is the predicted score for user u for module i; ru is the average rating given by user u; and rv,i is the rating given by user v to item i. Latent factor analysis: assume each module and each student is represented by a vector of features, and attempt to learn these features from the data given. That is, assume each user u has a length k vector wu , and each module i has a length k vector qi , and predict the rating of pu,i = wu · qi for student u and Pmodule i. The vectors can be found by solving the minimization problem minq,w (u,i)∈R (ru,i − qi · wu )2 , where R is the set of rated pairs. Then, for each module i, find the predicted score, and suggest the highest (predicted) rated modules for the student. Ideally, these should not include modules that the student has already taken! Marks scheme: 2 marks: explain that some initial rating data from students for modules is needed, and describe the scale on which it is drawn. 3 marks: give a good outline description of a suitable recommender system approach (either neighbour based or latent vector based) 3 marks: make it clear how a recommendation for user u and item i is found, using suitable mathematical notation, and how the input data is used. 1 mark: explain how these recommendations can be used to suggest the best predicted modules for the student. (b) Students sometimes want to know why a particular module was recommended to them. Discuss whether such explanations can be provided for your proposed system. [2] Solution: Comprehension – requires student to show understanding of concepts The neighbour based methods have a partial explanation: the system can point to other users who also liked those combination of modules (but may not be able to name them for privacy reasons). For latent factor methods, it is not clear whether the factors found lend themselves to simple explanations, but possibly the data can be plotting on the first 2 factors to see if there is a meaningful clustering of modules and users. (c) Explain how your solution could be extended to include prerequisites: some modules cannot be taken until their prerequisite modules have been taken. -6- Continued [2] CS9100 Solution: Application – student needs to apply techniques they have learned A simple solution is to just not allow the prerequisite module to be suggested until the prerequisites have been completed. This can be checked when the recommendations are generated, and the list can be filtered accordingly. One could instead try to include this in the model, e.g. by penalizing items where the user does not have the prerequisites, but this becomes more complex. (d) Describe how the quality of the developed system can be evaluated by the university. [3] Solution: Comprehension – requires student to show understanding of concepts The university can hold back some rating data as “test data”, and use the rest as “training data” on which to build the system. It can then evaluate the error on the predictions for the test data, for example by computing the Root-Mean-Square-Error, applied to the difference between the predicted and observed ratings. An RMSE of less than 1 for data rated 1-5, for example, might be considered acceptable. More directly, the university could simply ask the students how much they like the system. (e) What problems with the recommender system will emerge when a new module is introduced? What can be done to overcome this? [5] Solution: Application – student needs to apply techniques they have learned (2 marks) When a new module is introduced, no students have rated it, and so it will not have any meaningful data about it to use for the system. Consequently it will not be suggested for anyone to study, and so it may not reach any students. (3 marks) This can be handled by trying to elicit ratings for it – say, by suggesting it to some users amongst their other suggestions. To do this in a more principled way, properties of the module can be identified (such as being a computer science module), and it can be suggested to those who have already taken several similar modules. Other features of the module (difficulty, mathematical content etc.) could be extracted and used to help make meaningful recommendations. (f) When new students join the university, how should an initial set of modules be suggested to them? [4] Solution: Application – student needs to apply techniques they have learned The students can express some interest in particular areas of study, such as maths or computer science, and some modules from that area can be suggested. These could be the most popular modules in those areas, or they could be some special “introduction” or “foundations” modules. They could also rate some other objects, like their A-levels. If many other users have rated these, they can be used to learn properties of the user. 7. (a) Explain why clustering is considered an “unsupervised” learning method. [3] Solution: Bookwork – primarily requires recollection of taught concepts No explicit label is associated with the data points, and there is no notion of training versus test set in clustering. Instead, the object is to identify meaningful groups of points within the data set. This is in contrast to classification and regression, where -7- Continued CS9100 there are labels on training data, and the task is considered to be “supervised”. (b) The hierarchical agglomerative clustering method proceeds by considering each input point as an initial cluster, and repeatedly merging together the closest pair of clusters until a single cluster remains. Describe how this can be used to find a clustering into k clusters. [2] Solution: Bookwork – primarily requires recollection of taught concepts Simply terminate the process when only k clusters remain, or backtrack from the final state and “undo” the last k cluster merges. (c) Define the k-centre objective. Give an example that shows a furthest point clustering that achieves a 2-approximation to the k-centre objective. Explain your example. [6] Solution: Bookwork – primarily requires recollection of taught concepts Application – student needs to apply techniques they have learned [2 marks] The k-centre objective is to find a clustering that minimizes the maximum cluster diameter, where the diameter of a cluster is defined as the maximum distance between any two points placed in the same cluster. [4 marks] The simplest example is for 1-centre clustering in Euclidean space, as in the example shown below: x y z If point x is chosen as the (arbitrary first) cluster centre, then the diameter is twice the size of that if y is chosen. This can be generalized to higher values of k by adding k − 1 additional points far from this gadget. Then if x is chosen as the first centre, and the k − 1 points are chosen as the next k − 1 centres, we still have that the diameter is twice the optimal. (d) For the DBSCAN algorithm, define the notion of density-reachable given parameters and MinPts. Show an example where p is density reachable from q but it is not the case that q is density reachable from p. Explain your example, and make it clear what are the values of MinPts and . Solution: Bookwork – primarily requires recollection of taught concepts Application – student needs to apply techniques they have learned Define the neighbourhood of a point q as N (q) = {p ∈ data|d(p, q) ≤ } A point p is directly density reachable from q if p ∈ N (q) and |N (q)| ≥ MinPts -8- Continued [6] CS9100 A point p is density-reachable from point q (under , MinPts) if there is a chain of points p1 , . . . , pn , where p1 = q, pn = p and for all i pi+1 is directly densityreachable from pi . In the example, point A is not density reachable from B (with MinPts = 3), but B is density reachable for A. The radius of the circles is . (e) Use a small example to show why the k-means algorithm may not find the global optimum, that is, may not find a clustering that minimizes the sum of squared distances within a cluster. Solution: Comprehension – requires student to show understanding of concepts This picture shows a possible suboptimal solution with k = 3 A specific example is as follows: -9- Continued [5] CS9100 The algorithm has converged on a solution that places two cluster centres within a single optimal cluster, leaving only one remaining centre to cover two optimal clusters. The solution is a local optimum, meaning that repeated application of the k-means algorithm will not alter the solution or improve the cost, but it is not the global optimum. This can happen due to the random initial allocation of cluster centres. (f) A particular data set is clustered multiple times with the k-means algorithm, and different cluster centres are found each time. Why is this? [3] Solution: Comprehension – requires student to show understanding of concepts The k-means algorithm is not deterministic: it depends on which points are selected as starting points. If different starting points are chosen, then a different clustering will be found. The implementation of the algorithm must be choosing a different random selection each time. 8. Consider the following social network represented as a graph, where the nodes represent individuals, and edges represent a declared (symmetric) “friendship” relation between pairs. b a d e f c (a) State the degree of node b; the shortest-path distance between nodes b and f ; and the diameter of the graph. Solution: Application – student needs to apply techniques they have learned degree = 3 distance = 3 diameter = 4 (realized by a and f) - 10 - Continued [3] CS9100 (b) Define the concept of clustering coefficient, and explain why it is a relevant measure for social networks. [4] Solution: Bookwork – primarily requires recollection of taught concepts Clustering coefficient is the number of triangles (embedded cliques) divided by the number of pairs of common neighbours of nodes, giving a fraction between 0 (no triangles) and 1 (all possible triangles are present). It gives a measure of how much commonality of connections there is in the network. In the social network example, it means what is the average fraction of pairs of friends of a user who are themselves friends. (c) Which node(s) in the graph are the most central, using the notion of eccentricity? Explain your answer. [2] Solution: Application – student needs to apply techniques they have learned d is the only central node by this definition: it has the smallest maximum shortest path distance to any other, of 2. No node is more than 2 hops away from d, whereas for all other nodes some node is at distance at least 3. (d) What is the “betweenness” of node e? [3] Solution: Application – student needs to apply techniques they have learned Betweenness is the number (or fraction) of shortest paths passing through it. All shortest paths between f and all other nodes pass through e (5) and between e and all other nodes, not counting f (4). This is out of the total of 15 shortest paths (6 * 5 / 2). (e) Explain the Page Rank method for measuring importance of nodes within a graph and how it assigns ranks to nodes. [6] Solution: Bookwork – primarily requires recollection of taught concepts • Basic idea: each directed edge in the graph represents a “vote” from the source of the edge that the destination is “good”. So score a page by the sum of the votes it receives • Model adjacency table of a graph as a matrix M . The vector of page rank scores is an eigenvector of M . • Define page rank as the principal eigenvector of M • Can compute either directly via linear algebra, or by power iteration. • Start with a random / arbitrary vector of scores, and repeatedly multiply by matrix M until convergence/stopping criterion. • Modify M by adding a smoothing factor to ensure unique solution. (f) A social network has information on which pairs of users are friends. Some users have expressed interest in some topics, such as their favourite sports teams, bands, and pastimes. Discuss how the social network could use its data to infer possible interests of other users who have not declared them explicitly. What assumptions - 11 - Continued [7] CS9100 would you rely on? Solution: Application – student needs to apply techniques they have learned This can be set up as an inference problem: classification or semi-supervised learning. E.g. could try to learn whether a user likes a particular football club or not, or more broadly if they enjoy football. Could simply use traditional classification: based on properties of the user (age, location, other demographics) build a classifier and train with the training data. Could try to use the link structure of the graph: assume that friends share similar interests (homophily), fill in the learned function. E.g. could use local voting to spread the labels (look at what labels are popular in the neighbourhood of the user). Or look for a similar node based on its pattern of neighbours, and copy the labels that are seen on that node, based on co-citation regularity. Any solution based on a good formalization of the problem and proposed solution will be acceptable. 9. Consider the following data set of 12 items with 3 attributes: Item X Y 1 8 7 2 13 18 3 8 7 4 8 6 5 9 12 6 8 4 7 9 9 8 4 3 9 8 7 10 5 3 11 6 4 12 10 16 Z 8 10 7 7 8 6 8 3 6 6 5 10 (a) Compute the variance of the X values and the covariance between the Y and Z values. Show your calculations clearly. Solution: Application – student needs to apply techniques they have learned Var[X] (2 marks): E[X] = (8 + 13 + 8 + 8 + 9 + 8 + 9 + 4 + 8 + 5 + 6 + 10)/12 = 96/12 = 8. E[(X −E[X])2 ] = (0+25+0+0+1+0+1+16+0+9+4+4)/12 = (60/12) = 5. Cov[Y, Z] (4 marks): E[Y Z] = (56+180+49+42+96+24+72+9+42+18+20+160)/12 = 768/12 = 64 E[Z] = (8 + 10 + 7 + 7 + 8 + 6 + 8 + 3 + 6 + 6 + 5 + 10)/12 = 84/12 = 7. E[Y ] = (7 + 18 + 7 + 6 + 12 + 4 + 9 + 3 + 7 + 3 + 4 + 16)/12 = 96/12 = 8. Cov[Y, Z] = (E[Y, Z] − E[Y ]E[Z]) = 64 − (7 ∗ 8) = 8. - 12 - Continued [6] CS9100 It is further calculated that Var[Y ] = 23 . 6 45 , 2 Var[Z] = 11 , 3 Cov[X, Y ] = 19 , 2 and Cov[X, Z] = (b) Find the R2 value between Y and Z, and give the corresponding model Z = aY + c. [6] Solution: Application – student needs to apply techniques they have learned [2 marks] R2 = (Cov[Y, Z])2 / Var(Y ) Var(Z) = 64/(45 ∗ 11/6) = 384/495 = 0.776 [2 marks] a = Cov[Y, Z]/ Var[Y ] = 8/22.5 = 16/45 = 0.356. [2 marks] c = E[Z] − aE[Y ] = 7 − 8/22.5 ∗ 8 = 7 − 128/45 = 4.16 So the model is z = 0.356y + 4.16 (c) Some other regression models are computed over the data: A linear model Z = 0.767X + 0.867 is found with R2 value 0.802. A multilinear model Z = 0.161Y + 0.461X + 2.03 with R2 value of 0.833. Discuss the quality of these two models and that found in the previous part for modeling the data. [5] Solution: Comprehension – requires student to show understanding of concepts Application – student needs to apply techniques they have learned All models achieve a good R2 value, indicating that there is a good fit of the model to the data. They all show that Z increases as X and Y both increase. Between the models in a single variable, X explains the behaviour of Z better than Y. The model with both X and Y has an appreciably better fit, and is the model of choice. The models show a different dependence on the variables and the constant, indicating that none is a perfect fit for the data. (d) Compute the prediction of Z from the three models for the point X = 12, Y = 3, and comment on what you find. [4] Solution: Application – student needs to apply techniques they have learned Comprehension – requires student to show understanding of concepts [2 marks] The first model (in X only) predicts Z = 10.071. The second (in Y only) predicts Z = 5.228. The third (in both X and Y ) predicts Z = 8.045. [2 marks] These are quite different. The more powerful multilinear model predicts Z approximately 8, so we might lean towards this value. However, note that the target point is quite different from the training data: the X and Y values in the training data are quite similar, whereas this point is quite far from parity. We might conclude that it is sufficiently different from the training data that we would not trust the model for this point. (e) Comment on the suitability of using the different models to predict the Z value for the point X = 21, Y = 25. Solution: Comprehension – requires student to show understanding of concepts - 13 - Continued [4] CS9100 The first model predicts Z = 13.06, the second predicts Z = 16.974 and the third predicts Z = 15.736. These are all quite different values. This should not be surprising – the target value falls quite far from the range of the rest of the data, and so there is a large amount of extrapolation going on. There is no evidence for how the dependency on Z in this range is actually behaving. - 14 - End