第頁共7頁 機器學習 期末考試 學號: 姓名: 102/6 是非題(24%) ( ) 1

advertisement
第 1 頁共 7 頁
機器學習
期末考試
學號:
姓名:
102/6
一、 是非題(24%)
(
)1. Principal components analysis (PCA) and linear discriminant analysis (LDA) are both
supervised dimensionality reduction method.
(
)2. The k-means clustering algorithm is used to solve the supervised learning problem.
(
)3. In nonparametric estimation, all we assume is that similar inputs have similar outputs.
(
)4. Rule induction works similar to tree induction except that rule induction does a
breadth-first search, whereas tree induction goes depth-first.
(
)5. A decision tree is a hierarchical model using a divided-and-conquer strategy.
(
)6. Expectation-Maximization (EM) algorithm can be used to solve the unsupervised
learning problem.
(
)7. Entropy in information theory specifies the maximum number of bits needed to encode
the classification accuracy of an instance.
(
)8. To remove subtrees in a decision tree, postpruning is faster and prepruning is more
accurate.
(
)9. In semiparametric estimation, the density is written as a disjunction of a small number of
parametric models.
(
)10. Gradient descent is a simple and global method. When online training is used, it does
not need to store the training set and can adapt as the task to be learned changes.
However, gradient descent converges slowly.
(
)11. When classes are Gaussian with a shared covariance matrix, the optimal discriminant is
linear.
(
)12. Locally linear embedding method can recovers global nonlinear structure from locally
linear fits.
二、 簡答題
1. (4%) Can you explain what Isomap is? What is geodesic distance?
1
第 2 頁共 7 頁
2. (3%) What is the difference between feature selection methods and feature extraction methods?
3. (3%) Draw two-class, two dimensional data such that PCA and LDA find totally different
directions.
4. (4%) Please explain the meanings of the nonparametric density estimation methods. What are
its assumptions?
5. (4%) What are the differences between the parametric density estimation methods and the
semiparametric density estimation methods?
6. (2%) In the running mean smoother, we can fit a constant, a line, or a higher-degree polynomial
at a test point. How can we choose between them?
7. (2%) Please finish the following Expectation-maximixation (EM) algorithm.
8. (4%) Please finish the following k-means clustering algorithm.
2
第 3 頁共 7 頁
9. (2%) Condensed Nearest Neighbor algorithm is used to find a subset Z of X that is small and is
accurate in classifying X. Please finish the following Condensed Nearest Neighbor algorithm.
10. (3%) Please show the properties that an impurity measure of a classification tree should be
satisfied.
11. (3%) Given a two-dimensional dataset as follows, please show the dendrogram(樹狀圖) of the
complete-link clustering result. The complete-link distance between two groups Gi and Gj:


d  Gi , G j   r maxs d  x r , x s  where d x r ,x s   j 1 x rj  x sj
x Gi , x G j
d
12. (3%) Please estimate the density function with h = 0.5 by histogram estimator.
pˆ  x  
#  xt in the same bin as x
Nh
3
第 4 頁共 7 頁
13. (3%) In nonparametric regression, given a running mean smoother as follows, please finish the
graph with h = 3.
 bx, x  r
gˆ x  
 bx, x 
N
t
t 1
N
t 1
t
t
1 if xt is in the same bin with x
where b x, x  
0 otherwise

t

14. (6%) Given a regression tree as follows. (1) Please draw its corresponding regression result. (2)
Could you show one rule which is extracted from this regression tree?
4
第 5 頁共 7 頁
15. (2%) What is a multivariate tree?
16. (4%) In pairwise separation example as follows, and Hij indicates the hyperplane separate the
examples of Ci and the examples of Cj. Please decide each region belongs to which class.
gij  x|w ij , wij 0   wTij x  wij 0
if x  Ci
 0

gij  x     0
if x  C j
don't care otherwise

choose Ci if j  i , gij  x   0
17. (4%) Given a Classification tree construction algorithm as follows.
K
where I   pi log pi
 m 2 m
m
i 1
i
i
(eq. 9.3) and I'm   N mj  pmj
(eq. 9.8)
log 2 pmj
n
j 1
K
Nm
i 1
Can you explain what the function “SplitAttribute” does?
5
第 6 頁共 7 頁
三、計算證明題
1. (10%) Given a sample of two classes, X  x t , r t t , where r t  1 if x  C1 and r t  0 if
x  C2 . In logistic discrimination, assume that the log likelihood ratio is linear in two classes
case, the estimator of PC1 | x is the sigmoid function
1
y  PC1 | x  
1  exp  w T x  w0 

t

t
We assume r , given x , is Bernoulli distribution. Then the sample likelihood is
r 
1 r 
, and the cross-entropy is
l w , w0 | X     y t  1  y t 
t
t
t
E w, w0 | X    (r t log y t  (1  r t )log( 1  y t ) )
t
Please find the update equations of w j and w0 ,
where w j  
E
E
,and w0  
, j  1,..., d .
w0
w j
6
第 7 頁共 7 頁
2. (10%) Using principal components analysis, we can find a low-dimensional space such that
when x is projected there, information loss is minimized. Let the projection of x on the direction
of w is z = wTx. The PCA will find w such that Var(z) is maximized
Var(z) = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑
If z1 = w1Tx with Cov(x) = ∑ then Var(z1) = w1T ∑ w1, and maximize Var(z1) subject to ||w1||=1.
please show that the principal component is the eignvector of the covariance matrix of the input
sample with the largest eigenvalue.
7
Download