Appendix, page 1 Appendix Appendix ..................................................................................................................................................... 1 Datasets .............................................................................................................................................. 1 Detailed Methods ............................................................................................................................ 1 Relation to previous work ............................................................................................................. 4 Figures ................................................................................. Error! Bookmark not defined. References ......................................................................................................................................... 5 Datasets All four synopses were formed using a seeding PubMed search (Appendix Figure 1) that returned publications up to their inception date; thereafter a maintenance search strategy was used. The difference between the seeding and maintenance search is that the former is more specific whereas the latter is more sensitive. This is for practical reasons: using the maintenance search strategy during the database inception would retrieve an unmanageably large number of articles to be screened. When continuously monitoring the literature, however, it is feasible to use a very sensitive strategy. The maintenance search strategy has remained constant for all four databases. Detailed Methods We view the citation screening problem as a specific instance of the text classification problem, the aim in which is to induce a model capable of automatically categorizing text Appendix, page 2 documents into one of k categories. Spam filters for e-mail are common examples of text classifiers. These classifiers automatically designate incoming e-mails as spam or not. To use a text classification model, one must first encode documents into a vector space, in order for it to be intelligible to the classification model. We make use of the standard Bag-of-Words (BoW) encoding scheme, in which each document is mapped to a vector whose ith entry is 1 if word i is present in the document and 0 otherwise. We map the titles, abstracts and MESH terms of a given document into separate BoW representations (the latter might be called a Bag-of-Mesh terms representation). These mappings are referred to as feature-spaces. For our base classifier, we use the Support Vector Machine (SVM).1 SVMs have been shown empirically to be particularly adept at text classification.2 Briefly, SVMs work by finding a hyperplane (i.e., high-dimensional generalization of a line) that separates instances (documents) from the respective classes with the maximum possible margin. The intuition behind the SVM approach is illustrated in Appendix Figure 2, which shows a simplified fictitious 2-dimensional classification problem comprising two classes: the plusses (+) and the minuses (-). There are an infinite number of lines that separate the classes, one of which is shown in the left-hand side of the figure, for example. The intuition behind the SVM is that it selects the separating line in a principled way. In particular, it chooses the line that maximizes the distance between the nearest members of the respective classes and itself; this is referred to as the margin. In light of this strategy, the separating line the SVM would select in our example is shown on the right hand side of the figure. In practice, this line is found by solving an objective function expressing this max-margin principle. Appendix, page 3 There are a few properties that make the citation classification problem unique from a data mining/machine learning perspective. First, there is severe class imbalance, i.e. there are far fewer relevant than irrelevant citations. This can pose problems for machine learning algorithms. Second, false negatives (relevant citations misclassified as irrelevant) are costlier than are false positives (irrelevant citations misclassified as relevant), and we therefore want to emphasize sensitivity (recall) over specificity. Accordingly, we have tuned our model to this end. In particular, we first build an ensemble of three classifiers, one per each of the three aforementioned feature spaces. Each of these classifiers is trained with a modified SVM objective function that emphasizes sensitivity by penalizing the model less heavily for misclassifying negative (irrelevant) examples during training. When a prediction for a given document is made, we follow a simple disjunction rule to aggregate predictions; if any of the three classifiers predicts relevant, then we predict that the document is relevant – only when there is unanimous agreement that the document is irrelevant do we exclude it. To further increase sensitivity, we build a committee of these ensembles to reduce the variance caused by the sampling scheme we use. More specifically, we undersample the majority class of irrelevant instances before training the classifiers, i.e. we discard citations designated as irrelevant by the reviewer at random until the number of irrelevant documents in the training set is equal to the number of relevant documents. This simple strategy for dealing with class imbalance has been shown empirically to work well, particularly in terms of improving the induced classifiers’ sensitivity.4-7 Because this introduces randomness (the particular majority Appendix, page 4 instances that are discarded are selected at random), we build a committee of these classifiers and take a majority vote to make a final prediction. This ensemble strategy is known as bagging7,8 and is an extension of bootstrapping to the case of predictive models.8 Bagging reduces the variance of predictors. We have found bagging classifiers induced over balanced bootstrap samples works well in the case of class imbalance, consistent with previous evidence that this strategy is effective,4,5 and have proposed an explanation based on a probabilistic argument.10 Specifically, for our task we induce an ensemble of 11 base classifiers over corresponding balanced (i.e., undersampled) bootstrap samples from the original training data. The final classification decision is taken as a majority vote over this committee. We chose an odd number (n=11) to break ties. The exact number is arbitrary, but has worked well in previous work.4,5 Each base classifier itself is an aggregate prediction (an OR Boolean operator; i.e., each base classifier predicts that a document is relevant if any of its members does) over three separate classifiers induced, respectively, over different feature-space representations (titles, abstracts, MeSH terms) of the documents included in independently drawn balanced bootstrap samples. For a schematic depiction of this, see the Figure in the main manuscript. Relation to previous work There has been a good deal of research in the machine learning and medical informatics communities investigating techniques for semi-automating citation screening.4-7, 11-18 These works have largely been ‘proof-of-concept’ endeavors that have explored the feasibility of automatic classification for the citation screening task (with promising results). By contrast, the present work looks to apply our existing classification Appendix, page 5 methodology prospectively, to demonstrate its utility in reducing the burden on reviewers updating existing systematic reviews. Most similar to our work here, Cohen et al. conducted a prospective evaluation of a classification system for supporting the systematic review process.14 Rather than semiautomating the screening process, the authors advocated using data mining for work prioritization. More specifically, they induced a model to rank the retrieved set of potentially relevant citations in order of their likelihood of being included in the review. In this way, reviewers would screen the citations that are most likely to be included first, thereby discovering the relevant literature earlier in the review process than they would have had they been screening the citations in a random order. Note that in this scenario, reviewers still ultimately screen all of the retrieved citations. This differs from our aim here, as we prospectively evaluate a system that automatically excludes irrelevant literature; i.e., reduces the number of abstracts the reviewers must screen for a systematic review. References 1. Vapnik VN. The nature of statistical learning theory. Springer Verlag; 2000. 2. Joachims, T. Text categorization with support vector machines: Learning with many relevant features. ECML , 137-142. 1998. Springer. 3. Van Hulse, J., Khoshgoftaar, T. M., and Napolitano, A. Experimental perspectives on learning from imbalanced data. 935-942. 2007. ACM. 4. Small KM, Wallace BC, Brodley CE, Trikalinos TA. The constrained weight space SVM: Learning with labeled features. International Conference on Machine Learning (ICML). 2011. 5. Wallace BC, Small KM, Brodley CE, Trikalinos TA. Active learning for biomedical citation screening. Knowledge Discovery and Data Mining (KDD). 2010. Appendix, page 6 6. Wallace BC, Small KM, Brodley CE, Trikalinos TA. Modeling annotation time to reduce workload in comparative effectiveness reviews. Proc ACM International Health Informatics Symposium (IHI). 2010. 7. Wallace BC, Small KM, Brodley CE, Trikalinos TA. Who should label what? Instance allocation in multiple expert active learning. Proc SIAM International Conference on Data Mining. 2011. 8. Breiman, L. Bagging Predictors. Journal of Machine Learning, 123-140. 1996. 9. Kang P, Cho S. EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems. Neural Information Processing (NIPS), 837-846. 2006. 10. Wallace BC, Small K, Brodley CE, Trikalinos TA. Class Imbalance, Redux. In Proc. of the International Conference on Data Mining (ICDM), 2011. 11. Bekhuis T, Demner-Fushman D. Towards automating the initial screening phase of a systematic review. Stud Health Technol Inform. 2010;160 (Pt 1):146-50. 12. Cohen AM, Hersh WR, Peterson K, Yen PY. Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc. 2006;13:206-219. 13. Cohen AM, Ambert K, McDonagh M. Cross-topic learning for work prioritization in systematic review creation and update. J Am Med Inform Assoc. 2009; Erratum in: J Am Med Inform Assoc. 2009;16(6):898. 14. Cohen AM, Ambert K, McDonagh M. A Prospective Evaluation of an Automated Classification System to Support Evidence-based Medicine and Systematic Review. AMIA Annu Symp Proc; 2010. 15. Frunza O, Inkpen D, Matwin S, Klement W, O'Blenis P. Exploiting the systematic review protocol for classification of medical abstracts. Artif Intell Med. 2011;51(1):17-25. 16. Matwin S, Kouznetsov A, Inkpen D, Frunza O, O'Blenis P. A new algorithm for reducing the workload of experts in performing systematic reviews. J Am Med Inform Assoc. 2010;17(4):446-53. 17. Polavarapu N, Navathe SB, Ramnarayanan R, ul Haque A, Sahay S, Liu Y. Investigation into biomedical literature classification using support vector machines. Proc IEEE Comput Syst Bioinform Conf. 2005:366-74. 18. Yu W, Clyne M, Dolan SM, Yesupriya A, Wulf A, Liu T, Khoury MJ, Gwinn M. GAPscreener: an automatic tool for screening human genetic association literature in PubMed using the support vector machine technique. BMC Bioinformatics. 2008;9:205.