基于支持向量机的多示例学习问题 张丽阳 2014年10月 多示例学习 大 多 数 大多数药物分子都是通过与较大的蛋白质分子等绑定来发生疗 效,疗效由绑定程度决定的。某个低能形状和期望的绑定区域结合得 很紧密,就是适于制造药物的分子 。 任何一个分子可能有上百种低能 形状,而这么多形状中只要有一种是合适的,这个分子就适于制药。 出于解决这个问题的目的,T G Dietterich 等学者将每一个分子看成一 个包,分子的每一种低能形状作为包中的一个示例,由此提出了多示 例学习的概念。 在此类学习中,训练集由若干个具有概念标记的包(bag)组成,每个包 包含若干个没有概念标记的示例。若一个包中至少有一个正例,则该 包被标记为正(positive),若一个包中所有示例都是反例,则该包被标 记为负(negative)。通过对训练包的学习,希望学习系统尽可能正确地 对训练集之外的包的概念标记进行预测。 支持向量机 给定一个包含正例和反例的样本集合,支持向量机的目的是寻找一个 超平面来对样本进行分割,把样本中的正例和反例用超平面分开,但 是不是简单地分 开 ,其原则是使正例和反例之间的间隔最大。我们最 大化的是离超平面最近的点到超平面的距离。 Multiple Instance Learning for Sparse Positive Bags SVM algorithms for MIL [SIL-SVM] The Single Instance Learning approach to MIL transforms the MIL dataset into a standard supervised representation by applying the bag’s label to all instances in the bag. A normal SVM is then trained on the resulting dataset SVM algorithms for MIL [NSK] In the Normalized Set Kernel of Gartner et al. (2002) a bag is represented as the sum of all its instances, normalized by its 1 or 2-norm. The resulting representation is further used in training a traditional SVM (Figure 2) SVM algorithms for MIL By definition, all instances from negative bags are real negative instances. Therefore, a constraint can be created for every instance from a negative bag, leading to the tighter NSK formulation from Figure 3. Transductive SVMs all unlabeled examples might be classified as belonging to only one of the classes with a very large margin, especially in high dimensions and with little training data. To ensure that unlabeled examples are assigned to both classes, they further constrained the solution by introducing a balancing constraint。 L is the labeled training data and U is the unlabeled dataset, and if y(x) = ±1 denotes the label of x, the balancing constraint has the form shown in Equation 1 below: Multiple Instance Learning for Sparse Positive Bags Replacing the inequality constraint from Figure 3 with the new balancing constraint (derived from Equation 3 by summing up the hidden labels) leads to the optimization problem in Figure 4 (sMIL). A transductive SVM approach to sparse MIL Even though the balancing constraint from the sMIL formulation is closer to expressing the requirement that at least one instance from a positive bag is positive, there may be cases when all instances from a bag have negative scores, yet the bag satisfies the balancing constraint. This can happen for instance when the negative scores are very close to 0. On the other hand, if all negative instances inside a bag X were constrained to have scores less than or equal to −1 + ξX, then the balancing constraint w φ(X) + b |X| ≥ (2 − |X|)(1 − ξX) would guarantee that at least one instance x had a score w φ(x) + b ≥ 1 − ξX A transductive SVM approach to sparse MIL 谢谢