1. Explain the following terms and give one possible application for each of them: a) b) c) d) 2. Web content mining Web structure mining Web usage mining Outlier detection Fig. 1 shows the PageRank Algorithm with random teleports and the web link structure: a) Construct the column stochastic matrix M and A. b) Calculate the PageRank with random transports ( = 0.8) for three iterations. c) Which, among the three nodes {Yahoo, Amazon, M’soft}, is the most important node? Page Rank (Random Teleports ) Updated !!! Yahoo 1. Construct the N£ N matrix M based on web link structure 2. Construct the matrix A as follows Aij = Mij + (1-)/N 3. Initialize r0 = [1/N,….,1/N]T 4. Compute rk+1 = Ark iteratively 5. Output r (a) M’soft Amazon (b) Fig. 1 (a) The PageRank algorithm. (b) The web link structure. 3. As shown in Fig. 2, document frequency thresholding is an important step toward feature extraction in text mining. ․N: The number of documents in the training document collection ․γ : The thresholding parameter training documents D Naive terms calculate term weight w set threshold remove all terms with w < feature terms Fig. 2 Feature Extraction and Dimension Reduction. document-feature matrix (Freqij) A B K O Q R S T W D1 ABRTSAQWAXAO D2 RTABBAXAQSAK X D1 D2 (a) (b) document A B frequency K O Q (DocFreq R S ij)T W X A B K O Q R S T W DocFreqj 10 5 3 3 5 2 1X 5 3 5 5 3 3 5 2 1 5 3 5 Entropy(w10 j) 0.4 0.1 0.1 0.1 0.3 0.4 0.4 0.4 0.3 0.1 (c) Fig. 3 3 Fig. a) Given the content for document D1 and D2 shown in Fig.3 (a), fill the document-feature matrix in Fig.3 (b) . b) Given N=10, γ=1.5, and DocFreqj in Fig.3 (c) , what are the feature terms extracted from D1 by using inverse document frequency weighting? c) Given N=10, γ=1.0 and Entropy(wj) in Fig.3 (c), what are the feature terms extracted from D2 by using entropy weighting? 4. Data Preprocessing is essential for web usage mining. a) Explain the four steps data preprocessing b) Given the web page linkage shown in Fig 4. (c), refine the user sessions shown in Fig. 4 (a). c) Given the web page linkage shown in Fig 4. (c), complete the paths in Fig. 4. (b). Two Users sessions: - A-B-L-F-R-O-G-A-D - A-B-C-J (a) Four User Sessions: -A-B-F-O-G -A-D -L-R -A-B-C-J (b) (c) Fig. 4