Convolutional Neural Networks for Sentence Classification Model Relate Work Use pre-trained word vectors for sentence-level classification tasks show a simple CNN with little hyper parameter tuning and static vectors achieves excellent results on multiple benchmarks. propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. Word Embedding & Filter 一般每个字符的Word Embedding的长度为d,所以CNN的输入矩阵大小 是不确定的,这取决于m的大小是多少,及句子的长短 卷积层本质上是个特征抽取层,可以设定超参数F来指定设立多少个特征抽 取器(Filter) 某个Filter来说,可以想象有一个k*d大小的移动窗口从输入矩阵的第一个 字开始不断往后移动,其中k是Filter指定的窗口大小,d是Word Embedding长度。 Chunk-MaxPooling:把某个Filter对应的Convolution层的所有特征向量 进行分段,切割成若干段后,在每个分段里面各自取得一个最大特征值 Pooling MaxPooling Over Time:对于某个Filter抽取到若干 特征值,只取其中得分最大的那个值作为Pooling层保 留值,其它特征值全部抛弃,值最大代表只保留这些 特征中最强的,而抛弃其它弱的此类特征。 K-Max Pooling:可以取所有特征值中得分在Top –K 的值,并保留这些特征值原始的先后顺序,这种位置 信息只是特征间的相对顺序,而非绝对位置信息。 Regularization employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors(Hinton et al., 2012). Dropout prevents co-adaptation of hidden units by randomly dropping out. Datasets and Experimental Setup Hyperparameters&Pre-trainedWord Vectors For all datasets we use: rectified linear units, filter windows (h) of 3, 4, 5 with 100 feature maps each, dropout rate (p) of 0.5, l2 constraint (s) of 3, and mini-batch size of 50. Initializing word vectors with those obtained from an unsupervised neural language model is a popular method to improve performance in the absence of a large supervised training set. Model Variations CNN-rand CNN-static CNN-non-static CNN-multichannel Results and Discussion These results suggest that the pre-trained vectors are good, ‘universal’ feature extractors and can be utilized across datasets. Fine tuning the pre-trained vectors for each task gives still further improvements (CNN-non-static). the single channel non-static model, the multichannel model is able to fine-tune the non-static channel to make it more specific to the task-at-hand. the multichannel architecture would prevent overfitting and thus work better than the single channel model.The results, however, are mixed. Conclusion A series of experiments with convolutional neural networks built on top of word2vec. Despite little tuning of hyperparameters, a simple CNN with one layer of convolution performs remarkably well. The results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP