Uploaded by dapiping

Convolutional-Neural-Networks-for-Sentence-Classification

advertisement
Convolutional Neural Networks for
Sentence Classification
Model
Relate Work

Use pre-trained word vectors for sentence-level classification tasks

show a simple CNN with little hyper parameter tuning and static
vectors achieves excellent results on multiple benchmarks.

propose a simple modification to the architecture to allow for the
use of both task-specific and static vectors.
Word Embedding & Filter

一般每个字符的Word Embedding的长度为d,所以CNN的输入矩阵大小
是不确定的,这取决于m的大小是多少,及句子的长短

卷积层本质上是个特征抽取层,可以设定超参数F来指定设立多少个特征抽
取器(Filter)

某个Filter来说,可以想象有一个k*d大小的移动窗口从输入矩阵的第一个
字开始不断往后移动,其中k是Filter指定的窗口大小,d是Word
Embedding长度。

Chunk-MaxPooling:把某个Filter对应的Convolution层的所有特征向量
进行分段,切割成若干段后,在每个分段里面各自取得一个最大特征值
Pooling

MaxPooling Over Time:对于某个Filter抽取到若干
特征值,只取其中得分最大的那个值作为Pooling层保
留值,其它特征值全部抛弃,值最大代表只保留这些
特征中最强的,而抛弃其它弱的此类特征。

K-Max Pooling:可以取所有特征值中得分在Top –K
的值,并保留这些特征值原始的先后顺序,这种位置
信息只是特征间的相对顺序,而非绝对位置信息。
Regularization
 employ
dropout on the penultimate layer
with a constraint on l2-norms of the weight
vectors(Hinton et al., 2012).
 Dropout
prevents co-adaptation of
hidden units by randomly dropping out.
Datasets and Experimental Setup
Hyperparameters&Pre-trainedWord
Vectors

For all datasets we use: rectified linear units, filter
windows (h) of 3, 4, 5 with 100 feature maps
each, dropout rate (p) of 0.5, l2 constraint (s) of
3, and mini-batch size of 50.

Initializing word vectors with those obtained from
an unsupervised neural language model is a
popular method to improve performance in the
absence of a large supervised training set.
Model Variations

CNN-rand

CNN-static

CNN-non-static

CNN-multichannel
Results and Discussion

These results suggest that the pre-trained vectors are good,
‘universal’ feature extractors and can be utilized across datasets.
Fine tuning the pre-trained vectors for each task gives still further
improvements (CNN-non-static).

the single channel non-static model, the multichannel model is
able to fine-tune the non-static channel to make it more specific to
the task-at-hand.

the multichannel architecture would prevent overfitting and thus
work better than the single channel model.The results, however,
are mixed.
Conclusion

A series of experiments with convolutional neural
networks built on top of word2vec. Despite little
tuning of hyperparameters, a simple CNN with
one layer of convolution performs remarkably well.

The results add to the well-established evidence that
unsupervised pre-training of word vectors is an
important ingredient in deep learning for NLP
Download