CS9100 THE UNIVERSITY OF WARWICK MSc Examinations: Summer 2015

advertisement
CS9100
THE UNIVERSITY OF WARWICK
MSc Examinations: Summer 2015
CS910: Foundations of Data Analytics
Time allowed: 2 hours.
Answer SEVEN questions only: ALL FOUR from Section A and THREE from Section B.
Read carefully the instructions on the answer book and make sure that the particulars required
are entered on each answer book.
Only calculators that are approved by the Department of Computer Science are allowed to be
used during the examination.
Question: 1
2
3 4
5
6
8
9
Total
Points:
7
6 6
25
25 25 25
25
150
6
-1-
7
Continued
CS9100
Section A
Answer ALL questions
1. For each of the following problems, state whether it would be best addressed by link prediction, a recommender system, or time series prediction:
[6]
(a) A library that wants to suggest which book a borrower should borrow next
(b) A railway station trying to anticipate how many customers they will have next week
(c) A phone company wanting to predict who their customers will call next month
(d) A social network wanting to suggest which news sources its users should “follow”
(e) A social network wanting to suggest which users might be “friends”
(f) A social network wanting to know how many photographs will be shared tomorrow
Solution: Comprehension – requires student to show understanding of concepts
(a) recommender system
(b) time series prediction
(c) link prediction
(d) recommender system
(e) link prediction
(f) time series prediction
2. Given n paired observations of two numeric random variables X and Y , give the definition
of the following quantities:
(a) The (empirical) probability that Y takes on a particular value y
(b) The expectation of X, E[X]
(c) The variance of X, Var[X]
(d) The expectation of XY , E[XY ]
(e) The covariance of X and Y , Cov(X, Y )
(f) The Pearson Product-Moment Correlation Coefficient between X and Y
(g) The regression coefficient of determination between X and Y , R2
Solution: Bookwork – primarily requires recollection of taught concepts
(a) Pr[Y = y] = 1[Yi = y]/n = fraction of examples equal to y
P
(b) E[X] = i Xi /n
P
(c) Var[X] = i Xi2 /n − E2 [X]
P
(d) E[XY ] = i Xi Yi /n
(e) CoVar[XY ] = E[XY ] − E[X]E[Y ] = E[(X − E[X])(Y − E[Y ])]
p
(f) P M CC = (CoVar[XY ]/ Var[X] Var[Y ])
(g) R2 = P M CC 2 .
-2-
Continued
[7]
CS9100
3. For the following datasets, state what kind of system would be best suited to storing the
data for the intended purpose, amongst text files/spreadsheet, database, data warehouse,
noSQL system:
[6]
(a) Current orders for an e-commerce website for billing
(b) List of results of a scientific experiment, to plot results
(c) Collection of billions of phone records for clustering in a MapReduce system
(d) Historical customer orders of a large mail-order business for fitting a regression model
(e) Financial transactions of a small business with two offices
(f) Marks for an exam for computing average score
Solution: Comprehension – requires student to show understanding of concepts
(a) DBMS
(b) text/spreadsheet
(c) noSQL
(d) datawarehouse
(e) DBMS
(f) text/spreadsheet
4. For each of the following applications, state which distance function you would use out of
Hamming Distance, Euclidean Distance, Dynamic Time Warping:
(a) Clustering locations based on the shortest distance between them
(b) Clustering data with ten categorical attributes
(c) Clustering data formed of trajectories of different lengths
(d) Nearest neighbour classification of a month of stock market opening prices
(e) Nearest neighbour classification of leaf shapes described by length and width
(f) Nearest neighbour classification of books described by five binary attributes
Solution: Comprehension – requires student to show understanding of concepts
(a) Euclidean distance
(b) Hamming distance
(c) DTW
(d) DTW
(e) Euclidean
(f) Hamming
-3-
Continued
[6]
CS9100
Section B
Choose THREE questions.
5. (a) Give three reasons why a data item might have missing values for one of its attributes.
[3]
Solution: Bookwork – primarily requires recollection of taught concepts
Data may have been lost or transcribed incorrectly.
The data value may never have been measured.
The notion may not apply to that example: e.g., someone may not have a national
insurance number if they are too young or not native.
The data subject may have withheld the information for privacy reasons.
(b) Why can it be important to ensure there are no missing values in data?
[2]
Solution: Bookwork – primarily requires recollection of taught concepts
Some data analysis processes require that there are no missing values in the data,
and so we need to ensure that there are none present so that the analysis can proceed.
For example, we may have a process that expects numeric points, and is not able to
handle missing values encoded as “?” or similar.
(c) Describe three ways that a missing value might be replaced in data that was collected
by carrying out a survey of people, and contrast their strengths and weaknesses.
[6]
Solution: Comprehension – requires student to show understanding of concepts
Some possibilities include:
- Drop all records with missing values. Simple to enact, but may delete a large
number of examples, and may skew the distribution of data if there is a correlation
between data with missing values and some other attribute.
- Fill in missing values with some plausible value, e.g. median or mean value in that
attribute. Also easy to enact, but possibly simplistic and misleading.
- Fill in missing values using a model (classifier/regression). More complex to perform, still not clear if suitable.
- Treat missing values as “unknown”: straightforward to do initially, but may not be
compatible with downstream processing, and may lead to false patterns discovered
based on clusters of missing values.
- Go back to the individuals who were surveyed and try to get their missing values.
Might be time consuming and they may not want to reveal their values.
(d) A dataset contains information on patients’ heights and weights. Describe two constraints that you could use to check for outliers in this data. For each, explain whether
it is feasible that a valid example could violate the given rule.
Solution: Application – student needs to apply techniques they have learned
Height/weight is not negative: cannot be violated (can’t have negative weights)
Height less than 7 feet / 2 metres : could possibly be violated by a giant
Height more than 4 feet: could possibly be violated by a child
Weight less than 200 pounds / 150 kilos : could possibly be violated
Weight at least 50 pounds: could be violated by a child.
-4-
Continued
[4]
CS9100
Height and weight combined to give a body mass index of at least 10: very unlikely
to be violated.
(e) The following data values record the average pulse rate for each of a set of patients:
70 72 74 75 75 77 77 77 78 80 70
i. Give the median pulse rate and the variance of this quantity.
ii. Describe how you could normalize numeric data such as the pulse data to allow
them to be compared.
iii. Describe a test based on statistics to determine whether a pulse rate should be
considered an outlier.
[6]
Solution: Application – student needs to apply techniques they have learnedComprehension – requires student to show understanding of concepts
[2 marks] Median: 75 Mean: 75, Variance: E[(X − E[X])2 = 25 + 25 + 9 + 1 +
4 + 4 + 4 + 9 + 25)/11 = 106/11 = 9.63
[2 marks] Could normalize by subtracting min, dividing by (max-min). Or by subtracting mean, divide by standard deviation for each value.
[2 marks] If it is more than some number of standard deviations away from its mean
value. Or, fit a distribution (e.g. normal) and compute probability of seeing the
observed value: if it is small, consider it an outlier.
(f) Consider the following two time-series of pulse rates over time:
Week 1 Week 2 Week 3 Week 4 Week 5
Patient 1
70
74
76
72
Patient 2
72
72
75
72
71
Find the distance between these two time-series that would be found by dynamic timewarping when the distance between pairs of values is given by the absolute difference.
Show the mapping between the two series that corresponds to the distance.
[4]
Solution: Application – student needs to apply techniques they have learned
The solution found makes the following mappings: (70, 72), (74, 72), (76, 75), (72,
72), (72, 71). The corresponding distance is 2 + 2 + 1 + 1 = 6.
6. An online university allows its students to sign up for different modules, where lectures are
viewed as recorded videos. Any student can sign up for any module. There are now over a
thousand modules, and students are having a hard time picking which modules to sign up
for next, based on their interests.
(a) Describe how the university can design a recommender system which will help suggest possible modules to students. What information needs to be collected for the
system to be initialized? Discuss how the system you describe would suggest modules for a particular student. Be sure to explain any notation you use in your answer.
Solution: Bookwork – primarily requires recollection of taught concepts
Comprehension – requires student to show understanding of concepts
The recommender system can make a prediction for each student for their predicted
“score” for each module. Say, 1 for little/no interest, to 5 for high interest. To do
-5-
Continued
[9]
CS9100
this, we need the ratings from current students for existing modules. This could be
collected via some online surveys.
We can then model the the ratings as a matrix between students and modules, and
attempt to “fill in” the missing ratings.
Two possible approaches discussed in lectures:
Neighbour based methods: for each user u, find K users who are similar to user
u. Assign a weight wu,v for the similarity, based, e.g. on the correlation coefficient
between the two users. Then use this to make a weighted prediction, e.g.
X
X
pu,i = ru + ( (rv,i − rv )wu,v )/(
wu,v )
v∈K
v∈K
where, pu,i is the predicted score for user u for module i; ru is the average rating
given by user u; and rv,i is the rating given by user v to item i.
Latent factor analysis: assume each module and each student is represented by a
vector of features, and attempt to learn these features from the data given. That is,
assume each user u has a length k vector wu , and each module i has a length k vector
qi , and predict the rating of pu,i = wu · qi for student u and
Pmodule i. The vectors
can be found by solving the minimization problem minq,w (u,i)∈R (ru,i − qi · wu )2 ,
where R is the set of rated pairs.
Then, for each module i, find the predicted score, and suggest the highest (predicted)
rated modules for the student. Ideally, these should not include modules that the
student has already taken!
Marks scheme:
2 marks: explain that some initial rating data from students for modules is needed,
and describe the scale on which it is drawn.
3 marks: give a good outline description of a suitable recommender system approach
(either neighbour based or latent vector based)
3 marks: make it clear how a recommendation for user u and item i is found, using
suitable mathematical notation, and how the input data is used.
1 mark: explain how these recommendations can be used to suggest the best predicted modules for the student.
(b) Students sometimes want to know why a particular module was recommended to
them. Discuss whether such explanations can be provided for your proposed system.
[2]
Solution: Comprehension – requires student to show understanding of concepts
The neighbour based methods have a partial explanation: the system can point to
other users who also liked those combination of modules (but may not be able to
name them for privacy reasons).
For latent factor methods, it is not clear whether the factors found lend themselves
to simple explanations, but possibly the data can be plotting on the first 2 factors to
see if there is a meaningful clustering of modules and users.
(c) Explain how your solution could be extended to include prerequisites: some modules
cannot be taken until their prerequisite modules have been taken.
-6-
Continued
[2]
CS9100
Solution: Application – student needs to apply techniques they have learned
A simple solution is to just not allow the prerequisite module to be suggested until
the prerequisites have been completed. This can be checked when the recommendations are generated, and the list can be filtered accordingly. One could instead try to
include this in the model, e.g. by penalizing items where the user does not have the
prerequisites, but this becomes more complex.
(d) Describe how the quality of the developed system can be evaluated by the university.
[3]
Solution: Comprehension – requires student to show understanding of concepts
The university can hold back some rating data as “test data”, and use the rest as
“training data” on which to build the system. It can then evaluate the error on the
predictions for the test data, for example by computing the Root-Mean-Square-Error,
applied to the difference between the predicted and observed ratings. An RMSE of
less than 1 for data rated 1-5, for example, might be considered acceptable. More
directly, the university could simply ask the students how much they like the system.
(e) What problems with the recommender system will emerge when a new module is
introduced? What can be done to overcome this?
[5]
Solution: Application – student needs to apply techniques they have learned
(2 marks) When a new module is introduced, no students have rated it, and so it will
not have any meaningful data about it to use for the system. Consequently it will not
be suggested for anyone to study, and so it may not reach any students.
(3 marks) This can be handled by trying to elicit ratings for it – say, by suggesting
it to some users amongst their other suggestions. To do this in a more principled
way, properties of the module can be identified (such as being a computer science
module), and it can be suggested to those who have already taken several similar
modules. Other features of the module (difficulty, mathematical content etc.) could
be extracted and used to help make meaningful recommendations.
(f) When new students join the university, how should an initial set of modules be suggested to them?
[4]
Solution: Application – student needs to apply techniques they have learned
The students can express some interest in particular areas of study, such as maths
or computer science, and some modules from that area can be suggested. These
could be the most popular modules in those areas, or they could be some special
“introduction” or “foundations” modules. They could also rate some other objects,
like their A-levels. If many other users have rated these, they can be used to learn
properties of the user.
7. (a) Explain why clustering is considered an “unsupervised” learning method.
[3]
Solution: Bookwork – primarily requires recollection of taught concepts
No explicit label is associated with the data points, and there is no notion of training
versus test set in clustering. Instead, the object is to identify meaningful groups of
points within the data set. This is in contrast to classification and regression, where
-7-
Continued
CS9100
there are labels on training data, and the task is considered to be “supervised”.
(b) The hierarchical agglomerative clustering method proceeds by considering each input
point as an initial cluster, and repeatedly merging together the closest pair of clusters
until a single cluster remains. Describe how this can be used to find a clustering into
k clusters.
[2]
Solution: Bookwork – primarily requires recollection of taught concepts
Simply terminate the process when only k clusters remain, or backtrack from the
final state and “undo” the last k cluster merges.
(c) Define the k-centre objective. Give an example that shows a furthest point clustering
that achieves a 2-approximation to the k-centre objective. Explain your example.
[6]
Solution: Bookwork – primarily requires recollection of taught concepts
Application – student needs to apply techniques they have learned
[2 marks] The k-centre objective is to find a clustering that minimizes the maximum
cluster diameter, where the diameter of a cluster is defined as the maximum distance
between any two points placed in the same cluster.
[4 marks] The simplest example is for 1-centre clustering in Euclidean space, as in
the example shown below:
x
y
z
If point x is chosen as the (arbitrary first) cluster centre, then the diameter is twice
the size of that if y is chosen.
This can be generalized to higher values of k by adding k − 1 additional points far
from this gadget. Then if x is chosen as the first centre, and the k − 1 points are
chosen as the next k − 1 centres, we still have that the diameter is twice the optimal.
(d) For the DBSCAN algorithm, define the notion of density-reachable given parameters
and MinPts. Show an example where p is density reachable from q but it is not the
case that q is density reachable from p. Explain your example, and make it clear what
are the values of MinPts and .
Solution: Bookwork – primarily requires recollection of taught concepts
Application – student needs to apply techniques they have learned
Define the neighbourhood of a point q as
N (q) = {p ∈ data|d(p, q) ≤ }
A point p is directly density reachable from q if p ∈ N (q) and |N (q)| ≥ MinPts
-8-
Continued
[6]
CS9100
A point p is density-reachable from point q (under , MinPts) if there is a chain
of points p1 , . . . , pn , where p1 = q, pn = p and for all i pi+1 is directly densityreachable from pi .
In the example, point A is not density reachable from B (with MinPts = 3), but B is
density reachable for A. The radius of the circles is .
(e) Use a small example to show why the k-means algorithm may not find the global
optimum, that is, may not find a clustering that minimizes the sum of squared distances
within a cluster.
Solution: Comprehension – requires student to show understanding of concepts
This picture shows a possible suboptimal solution with k = 3
A specific example is as follows:
-9-
Continued
[5]
CS9100
The algorithm has converged on a solution that places two cluster centres within
a single optimal cluster, leaving only one remaining centre to cover two optimal
clusters. The solution is a local optimum, meaning that repeated application of the
k-means algorithm will not alter the solution or improve the cost, but it is not the
global optimum. This can happen due to the random initial allocation of cluster
centres.
(f) A particular data set is clustered multiple times with the k-means algorithm, and different cluster centres are found each time. Why is this?
[3]
Solution: Comprehension – requires student to show understanding of concepts
The k-means algorithm is not deterministic: it depends on which points are selected
as starting points. If different starting points are chosen, then a different clustering
will be found. The implementation of the algorithm must be choosing a different
random selection each time.
8. Consider the following social network represented as a graph, where the nodes represent individuals, and edges represent a declared (symmetric) “friendship” relation between pairs.
b
a
d
e
f
c
(a) State the degree of node b; the shortest-path distance between nodes b and f ; and the
diameter of the graph.
Solution: Application – student needs to apply techniques they have learned
degree = 3
distance = 3
diameter = 4 (realized by a and f)
- 10 -
Continued
[3]
CS9100
(b) Define the concept of clustering coefficient, and explain why it is a relevant measure
for social networks.
[4]
Solution: Bookwork – primarily requires recollection of taught concepts
Clustering coefficient is the number of triangles (embedded cliques) divided by the
number of pairs of common neighbours of nodes, giving a fraction between 0 (no
triangles) and 1 (all possible triangles are present). It gives a measure of how much
commonality of connections there is in the network. In the social network example,
it means what is the average fraction of pairs of friends of a user who are themselves
friends.
(c) Which node(s) in the graph are the most central, using the notion of eccentricity?
Explain your answer.
[2]
Solution: Application – student needs to apply techniques they have learned
d is the only central node by this definition: it has the smallest maximum shortest
path distance to any other, of 2. No node is more than 2 hops away from d, whereas
for all other nodes some node is at distance at least 3.
(d) What is the “betweenness” of node e?
[3]
Solution: Application – student needs to apply techniques they have learned
Betweenness is the number (or fraction) of shortest paths passing through it. All
shortest paths between f and all other nodes pass through e (5) and between e and
all other nodes, not counting f (4). This is out of the total of 15 shortest paths (6 * 5
/ 2).
(e) Explain the Page Rank method for measuring importance of nodes within a graph and
how it assigns ranks to nodes.
[6]
Solution: Bookwork – primarily requires recollection of taught concepts
• Basic idea: each directed edge in the graph represents a “vote” from the source
of the edge that the destination is “good”. So score a page by the sum of the
votes it receives
• Model adjacency table of a graph as a matrix M . The vector of page rank
scores is an eigenvector of M .
• Define page rank as the principal eigenvector of M
• Can compute either directly via linear algebra, or by power iteration.
• Start with a random / arbitrary vector of scores, and repeatedly multiply by
matrix M until convergence/stopping criterion.
• Modify M by adding a smoothing factor to ensure unique solution.
(f) A social network has information on which pairs of users are friends. Some users
have expressed interest in some topics, such as their favourite sports teams, bands,
and pastimes. Discuss how the social network could use its data to infer possible
interests of other users who have not declared them explicitly. What assumptions
- 11 -
Continued
[7]
CS9100
would you rely on?
Solution: Application – student needs to apply techniques they have learned
This can be set up as an inference problem: classification or semi-supervised learning. E.g. could try to learn whether a user likes a particular football club or not, or
more broadly if they enjoy football.
Could simply use traditional classification: based on properties of the user (age,
location, other demographics) build a classifier and train with the training data.
Could try to use the link structure of the graph: assume that friends share similar
interests (homophily), fill in the learned function.
E.g. could use local voting to spread the labels (look at what labels are popular
in the neighbourhood of the user). Or look for a similar node based on its pattern
of neighbours, and copy the labels that are seen on that node, based on co-citation
regularity.
Any solution based on a good formalization of the problem and proposed solution
will be acceptable.
9. Consider the following data set of 12 items with 3 attributes:
Item X Y
1 8 7
2 13 18
3 8 7
4 8 6
5 9 12
6 8 4
7 9 9
8 4 3
9 8 7
10 5 3
11 6 4
12 10 16
Z
8
10
7
7
8
6
8
3
6
6
5
10
(a) Compute the variance of the X values and the covariance between the Y and Z values.
Show your calculations clearly.
Solution: Application – student needs to apply techniques they have learned
Var[X] (2 marks):
E[X] = (8 + 13 + 8 + 8 + 9 + 8 + 9 + 4 + 8 + 5 + 6 + 10)/12 = 96/12 = 8.
E[(X −E[X])2 ] = (0+25+0+0+1+0+1+16+0+9+4+4)/12 = (60/12) = 5.
Cov[Y, Z] (4 marks):
E[Y Z] = (56+180+49+42+96+24+72+9+42+18+20+160)/12 = 768/12 = 64
E[Z] = (8 + 10 + 7 + 7 + 8 + 6 + 8 + 3 + 6 + 6 + 5 + 10)/12 = 84/12 = 7.
E[Y ] = (7 + 18 + 7 + 6 + 12 + 4 + 9 + 3 + 7 + 3 + 4 + 16)/12 = 96/12 = 8.
Cov[Y, Z] = (E[Y, Z] − E[Y ]E[Z]) = 64 − (7 ∗ 8) = 8.
- 12 -
Continued
[6]
CS9100
It is further calculated that Var[Y ] =
23
.
6
45
,
2
Var[Z] =
11
,
3
Cov[X, Y ] =
19
,
2
and Cov[X, Z] =
(b) Find the R2 value between Y and Z, and give the corresponding model Z = aY + c.
[6]
Solution: Application – student needs to apply techniques they have learned
[2 marks] R2 = (Cov[Y, Z])2 / Var(Y ) Var(Z) = 64/(45 ∗ 11/6) = 384/495 =
0.776
[2 marks] a = Cov[Y, Z]/ Var[Y ] = 8/22.5 = 16/45 = 0.356.
[2 marks] c = E[Z] − aE[Y ] = 7 − 8/22.5 ∗ 8 = 7 − 128/45 = 4.16
So the model is z = 0.356y + 4.16
(c) Some other regression models are computed over the data:
A linear model Z = 0.767X + 0.867 is found with R2 value 0.802.
A multilinear model Z = 0.161Y + 0.461X + 2.03 with R2 value of 0.833.
Discuss the quality of these two models and that found in the previous part for modeling the data.
[5]
Solution: Comprehension – requires student to show understanding of concepts
Application – student needs to apply techniques they have learned
All models achieve a good R2 value, indicating that there is a good fit of the model
to the data. They all show that Z increases as X and Y both increase.
Between the models in a single variable, X explains the behaviour of Z better than
Y.
The model with both X and Y has an appreciably better fit, and is the model of
choice.
The models show a different dependence on the variables and the constant, indicating
that none is a perfect fit for the data.
(d) Compute the prediction of Z from the three models for the point X = 12, Y = 3, and
comment on what you find.
[4]
Solution: Application – student needs to apply techniques they have learned
Comprehension – requires student to show understanding of concepts
[2 marks] The first model (in X only) predicts Z = 10.071. The second (in Y only)
predicts Z = 5.228. The third (in both X and Y ) predicts Z = 8.045.
[2 marks] These are quite different. The more powerful multilinear model predicts Z
approximately 8, so we might lean towards this value. However, note that the target
point is quite different from the training data: the X and Y values in the training data
are quite similar, whereas this point is quite far from parity. We might conclude that
it is sufficiently different from the training data that we would not trust the model for
this point.
(e) Comment on the suitability of using the different models to predict the Z value for the
point X = 21, Y = 25.
Solution: Comprehension – requires student to show understanding of concepts
- 13 -
Continued
[4]
CS9100
The first model predicts Z = 13.06, the second predicts Z = 16.974 and the third
predicts Z = 15.736. These are all quite different values. This should not be surprising – the target value falls quite far from the range of the rest of the data, and so
there is a large amount of extrapolation going on. There is no evidence for how the
dependency on Z in this range is actually behaving.
- 14 -
End
Download