CSCI 4390/6390 Database Mining Project 1 Instructor: Wei Liu Problem 1. Linear Regression

advertisement
CSCI 4390/6390 Database Mining
Project 1
Instructor: Wei Liu
Problem 1. Linear Regression
You are given a fully labeled training dataset of 4,649 data samples with 256 dimensions. In this dataset, each
sample takes a discrete label from ‘1’, ‘2’, ..., ‘10’. You are required to build regularized linear regression from
the 256-dimensional data to the discrete labels.
The task is to predict the labels for a test dataset of 4,649 samples. Note that the test data is completely disjoint
to the training data.
All data information is wrapped into a ‘linear regression.mat’ file in Matlab. Separate data information, including ‘trainX.txt’ (training data samples, each row represents a sample), ‘trainY.txt’ (training data labels in the order
of training samples), and ‘testX.txt’ (test data samples, each row represents a sample), are also provided. Please
output your predicted labels (in 1 to 10) in the same order as that of the test samples into a ‘testY.txt’ file.
(Hint: multiple linear regressors are needed, each of which accounts for a single discrete label. The regularization parameter needs to be tuned on the training dataset. Two-fold cross validation can be tried. )
Problem 2. Graph Node Labeling
You are given an undirected graph of 8,030 nodes, in which each node denotes a document and each edge links
a semantically similar document pair. Now we know that there are 280 documents belonging to 28 different topics
(topic 1, topic 2,..., topic 28). You are required to run topic-sensitive Pagerank biased to the 28 topics.
The task is to predict the topics (in 1 to 28) for the rest 7,750 documents.
All data information is wrapped into a ‘document graph.mat’ file in Matlab. Separate data information, including ‘W.txt’ (the symmetric graph adjacency matrix, the ith row or column corresponds to the ith node), and ‘labels.txt’ (the labeled 280 nodes and their topic labels: the first row in this text file represents the IDs of the labeled
nodes, and the second row in this text file represents the corresponding topic labels in 1 to 28), are also provided.
Please output your predicted topic labels (in 1 to 28) of the rest 7,750 graph nodes into a ‘predictlabels.txt’ file,
where the first row represents the IDs of the 7,750 nodes, and the second row represents the corresponding topic
labels.
(Hint: for each topic in 1 to 28, one topic-sensitive Pagerank vector with uniform node weights is computed.
The teleporting probability parameter c needs to be tuned on the labeled graph nodes in the range of [0.001,0.5].
Two-fold cross validation can be tried. )
Download