mining feature

advertisement
1.
Explain the following terms and give one possible application for each of them:
a)
b)
c)
d)
2.
Web content mining
Web structure mining
Web usage mining
Outlier detection
Fig. 1 shows the PageRank Algorithm with random teleports and the web link
structure:
a) Construct the column stochastic matrix M and A.
b) Calculate the PageRank with random transports ( = 0.8) for three iterations.
c) Which, among the three nodes {Yahoo, Amazon, M’soft}, is the most
important node?
Page Rank (Random Teleports )
Updated !!!
Yahoo
1. Construct the N£ N matrix M based on web
link structure
2. Construct the matrix A as follows
Aij = Mij + (1-)/N
3. Initialize r0 = [1/N,….,1/N]T
4. Compute rk+1 = Ark iteratively
5. Output r
(a)
M’soft
Amazon
(b)
Fig. 1 (a) The PageRank algorithm. (b) The web link structure.
3.
As shown in Fig. 2, document frequency thresholding is an important step
toward feature extraction in text mining.
․N: The number of documents in the training document collection
․γ : The thresholding parameter
training
documents D
Naive terms
calculate term weight w
set threshold 
remove all terms with w < 
feature terms
Fig. 2 Feature Extraction and Dimension Reduction.
document-feature matrix (Freqij)
A B K O Q R S T W
D1
ABRTSAQWAXAO
D2
RTABBAXAQSAK
X
D1
D2
(a)
(b)
document
A B frequency
K O Q (DocFreq
R S ij)T W X
A B K O Q R S T W
DocFreqj
10
5
3
3
5
2
1X 5
3
5
5 3 3 5 2 1 5 3 5
Entropy(w10
j) 0.4 0.1 0.1 0.1 0.3 0.4 0.4 0.4 0.3 0.1
(c)
Fig. 3
3
Fig.
a) Given the content for document D1 and D2 shown in Fig.3 (a), fill the
document-feature matrix in Fig.3 (b) .
b) Given N=10, γ=1.5, and DocFreqj in Fig.3 (c) , what are the feature terms
extracted from D1 by using inverse document frequency weighting?
c) Given N=10, γ=1.0 and Entropy(wj) in Fig.3 (c), what are the feature terms
extracted from D2 by using entropy weighting?
4.
Data Preprocessing is essential for web usage mining.
a) Explain the four steps data preprocessing
b) Given the web page linkage shown in Fig 4. (c), refine the user sessions
shown in Fig. 4 (a).
c) Given the web page linkage shown in Fig 4. (c), complete the paths in Fig. 4.
(b).
Two Users sessions:
- A-B-L-F-R-O-G-A-D
- A-B-C-J
(a)
Four User Sessions:
-A-B-F-O-G
-A-D
-L-R
-A-B-C-J
(b)
(c)
Fig. 4
Download