Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 1 of 22 Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine Supplemental Appendix 1. Details of the Manual Annotation Process and Inter-rater agreement with the MEDLINE Randomized Controlled Trial (RCT) publication type The following instructions were provided to the manual annotators: The Medline definition of an RCT (http://www.ncbi.nlm.nih.gov/mesh/68016449) is as follows: Work consisting of a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table. For ambiguous cases, the following guidance may be helpful: An RCT article is a publication that is the primary description of a study that, at least in one of its components, is a randomized controlled trial, or a specific description of one or more of a set of related randomized controlled trials. For a study to be a randomized controlled trial the selection of subject receiving various interventions must be randomized, there must be a control intervention to contrast with the studied intervention, and the study must be a prospective trial of a well-defined intervention or interventions. Review each article in order in the spreadsheet and using the drop down for the judgment field, select one of: IS_RCT, NOT_RCT, UNCERTAIN. IS_RCT – the article meets the definition of a randomized controlled trial given. NOT_RCT – the article does not meet the definition of a randomized controlled trial given. UNCERTAIN – based on the title and abstract, the article may meet the definition of a randomized controlled trial, but it is not possible to confirm this because the full text is not available or the full text provides insufficient detail. When making the judgments, you may review the article’s title, abstract, author, journal, publication date, and full article text. Do not look at the MeSH terms or assigned MEDLINE publication types. 1 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 2 of 22 To evaluate the reproducibility of the RCT manual review process, a random subset of 100 previously unseen articles published in the year 2014 were manually reviewed by an independent human assessor and annotated using the definition. The independent annotator was experienced with evidence-based medicine and familiar with the work of our team, but had no part in developing the machine learning model. These articles were chosen so that half had been assigned the MEDLINE RCT_PT, and half had not. The Cohen’s Kappa inter-rater agreement between the PubMed Randomized Controlled Trial publication type and the assessment of the independent annotator was 0.72. This represents a substantial level of agreement. Hooper’s measure of indexing consistency was 72%, which is comparable to the best performance obtained in other inter-rater agreement students on MEDLINE annotation.[1] The annotator and the publication type agreed perfectly that all 50 of the articles that were not annotated with the MEDLINE RCT_PT were indeed not RCTs. Of the 50 articles having the MEDLINE RCT_PT, the independent annotator differed from the MEDLINE RCT_PT on 14 articles, designating these as not RCTs. Reasons given by the independent annotator for designating the articles as not RCTs included: “not randomized”, “not controlled”, “treatment allocation blocked by clinic not patient”, and “describes a proposed RCT”. 2. Human Related Articles PubMed Query and Dataset Details The full query used to construct our data set is shown in Figure S1 below. The query uses both subject-related terms, such as Human, and study design related terms, such as case-control studies and double-blind method in order to identify as many human-related articles as possible. We only included articles that were published in English language journals, were annotated with MeSH terms, and that had an available abstract. 2 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 3 of 22 The dataset was split into a training set, comprising articles from 1987-2010, and two evaluation/testing sets, one corresponding to each of the years 2011 and 2012. The complete dataset included 6,316,826 articles: 5,482,943 articles for training, 406,558 articles in the 2011 evaluation set, and 427,325 articles in the 2012 evaluation set. The percentage of articles that were indexed as having RCT_PT was approximately 4.4%, 4.5%, and 4.6%, in the training, 2011, and 2012 data subsets, respectively. Figure S1. Pubmed query used to build the document data set used for this research. The first portion of the query (before the initial AND) defines the date range. The second portion (up to the first NOT) captures clinical research articles. The third portion (after the first NOT) excludes studies about animals and not humans. The last portion of the query (starting with “hasabstract”) requires the articles to have abstracts and be published in English. ("1987/01/01"[PDAT] : "2012/12/31"[PDAT]) AND (hasabstract[text] AND English[lang]) AND ("humans"[MeSH Terms] OR "meta-analysis"[Publication Type] OR "Clinical Trial"[Publication Type] OR "Clinical Trial, Phase I"[Publication Type] OR "Clinical Trial, Phase II"[Publication Type] OR "Clinical Trial, Phase III"[Publication Type] OR "Clinical Trial, Phase IV"[Publication Type] OR "Controlled Clinical Trial"[Publication Type] OR "Randomized Controlled Trial"[Publication Type] OR "case reports"[Publication Type] OR "Multicenter Study"[Publication Type] OR "Comparative Study"[Publication Type] OR "case-control studies"[MeSH Terms] OR "cross-sectional studies"[MeSH Terms] OR "cross-over studies"[MeSH Terms] OR "double-blind method"[MeSH Terms] OR "Genome-Wide Association Study"[MeSH Terms] OR "single-blind method"[MeSH Terms] OR "random allocation"[MeSH Terms] OR "focus groups"[MeSH Terms] OR "cohort studies"[MeSH Terms] OR "prospective studies"[MeSH Terms] OR "follow-up studies"[MeSH Terms] OR "Multicenter Studies as Topic"[MeSH Terms] OR "Randomized Controlled Trials as Topic"[MeSH Terms] OR "Clinical Trials as Topic"[MeSH Terms]) NOT ("animals"[MeSH Terms] NOT "humans"[MeSH Terms]) 3 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 4 of 22 3. Per-Feature Statistical Significance Testing As part of the feature set pre-processing, we evaluated the statistical significance of every feature in the training set, across all feature types, for predicting whether an article was an RCT. We counted the number of articles that each feature occurred in, as well as the number of articles with each feature that were annotated with the MEDLINE publication type Randomized Controlled Trial. Features occurring in less than 500 articles (< 0.01% of articles) were dropped, and a 2x2 chi-squared test was computed for the rest of the features, with the rows of the 2x2 table representing citations including the feature or not, and the column representing citations annotated with the MEDLINE publication type Randomized Controlled Trial or not. We did not do any alpha correction, such as Bonferroni, when doing these statistical tests. With over 45 million total features to be tested, the required cut off value would have been exceedingly small, leaving very few features. Instead we choose to apply the commonly used alpha cut-off value of 0.05 to select significant features, which left us with a large but manageable set of 113,177 statistically significant features. This collection of features will be referred to as the statistically significant feature set. Predictive machine learning models built with this feature set will be compared with models built without first filtering the features by significance testing. 4. Description of the Recursive Partitioning Algorithm Used on the Author Count Feature We used a method called recursive partitioning to split these count-based features into ranges and treat each range as a separate binary feature. Recursive partitioning (RP) uses a minimum description length (MDL) approach. RP works by starting with the full range of author count values and determining whether there is a place to split the range in between values that reduces the total number of bits necessary to encode the dependent variable for each sample (in this case RCT status), including the overhead of representing the partitioning value. This approach is then recursively applied to each sub-range, and the process repeated until splitting is no longer 4 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 5 of 22 effective at reducing the number of bits necessary for encoding. The result is a set of partition boundaries that define the ranges of individual binary features. In our implementation this is handled automatically within each run before the sample feature vectors are passed to the SVM classifier. In all cases, including cross-validation, the partitions are determined based on the training data being used for that run, and then that partitioning is applied to the evaluation data for that run. Classification using only recursive partitioning of the author count resulted in a mean AUC of 0.631. For author count, MDL optimal partition boundaries on the entire training set were found at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 9.5, and 11.5. Therefore partitioned author count features appear to be worth including in the predictive models. Forward selection confirmed these observations. 5 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 6 of 22 5. Feature Set Creation Feature sets included in the final models are shown in Table S1. Text-based features were created by tokenizing the text at delimiters and then filtering out stop word tokens before creating n-grams. We used the stop word list created by Andrew McCallum (http://www.cs.cmu.edu/~mccallum/bow/). Table S1. Feature sets, types, and whether they were included in each of the final predictive models by the forward selection process. TYPE INCLUDED IN CITATION_ONLY MODEL? INCLUDED IN CITATION_PLUS_MESH MODEL? Title unigrams binary ✓ ✓ Title bigrams binary ✓ ✓ Abstract unigrams binary ✓ ✓ Abstract bigrams binary ✓ ✓ Abstract trigrams binary ✓ ✓ Author count scalar ✓ ✓ Journal name binary ✓ ✓ MeSH article terms binary ✓ MeSH major article terms binary ✓ MeSH article term qualifiers binary ✓ Title trigrams binary MeSH major terms with qualifiers binary Authors last names with first initial binary Languages (en, fr, etc.) binary Databank accession numbers binary Title acronym count scalar FEATURE SET 6 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 7 of 22 6. Forward Selection Method for Feature Sets We built our models using a forward selection process on each feature set. We started out building individual predictive models for each feature set and evaluating the feature set using 5x2 cross-validation on the training data. From these runs, we choose the best performing feature set. Then we performed evaluation runs adding each not yet included feature set to the previously determined best feature set. Then we again chose the best run from these, and added to the current best collection of feature sets each remaining feature set not yet included and evaluated these using cross-validation. We repeated this iterative process until no performance gain was achieved. For forward selection we chose AUC and average precision as the primary measures to select the best feature or combination of features. We chose these measures since they are both sensitive to ranking and the ability to accurately rank articles as RCTs was one of our primary goals. If a run had the best performance on both of these measures, or on one of these measures and no worse performance on the other, that run was chosen as the best for that iteration. Feature set selection ended when no cross-validation performance improvement could be obtained by adding an unused feature set to the current best feature set collection. We compared performance to three decimal places. The decision was based on the fact that the width of the AUC 95% confidence interval for the individual cross-validation estimates is approximately +/- 0.002 according to the conservative Hanley/McNeil method, and we wanted to ensure we included any feature set that could lead to statistically significant improvement. Furthermore, at the high level of performance at which we were targeting our models, a performance improvement of 0.001 represents an improvement of 2.5% of the maximum possible improvement, which could be important in an actual system. 7 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 8 of 22 The feature sets added at each stage of the forward selection decision process for the CITATION_PLUS_MESH model are shown in Table S2, with the same information for the CITATION_ONLY model shown in Table S3. Listed are the mean values of the AUC and average precision measures (across all 5 iterations of 2-way cross-validation runs) used to make decisions on which feature set most improved the model. Beyond the last model in each table, no feature set improved the performance on AUC or average precision. Table S2. Results of the forward selection process feature set inclusion in the CITATION_PLUS_MESH model. Values shown are the mean performance measures of 5 iterations of 2-way cross-validation on training Dataset (1987-2010), randomly down-sampled at 7.5% for each iteration. Feature Sets abstract unigrams add mesh terms add abstract bigrams add title unigrams add abstract trigrams add author count add title bigrams add mesh qualifiers add journal name add mesh major terms Mean AUC Mean Average Precision 0.941 0.741 0.959 0.801 0.970 0.856 0.971 0.865 0.973 0.869 0.975 0.873 0.975 0.875 0.976 0.875 0.976 0.876 0.976 0.877 Table S3. Results of the forward selection process feature set inclusion in the CITATION_ONLY model. Values shown are the mean performance measures of 5 iterations of 2-way cross-validation on training Dataset (19872010), randomly down-sampled at 7.5% for each iteration. Feature Sets abstract unigrams add abstract bigrams add author count add abstract trigrams add title bigrams add journal name add title unigrams Mean AUC Mean Average Precision 0.941 0.741 0.955 0.804 0.965 0.837 0.968 0.844 0.969 0.849 0.969 0.852 0.969 0.855 8 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 9 of 22 7. Extensions to the Method of Rüping for SVM Confidence Estimation The signed margin distance is essentially the distance in front of or in back of the separating hyperplane that the SVM algorithm creates as the optimal solution to the classification problem. Samples with positive signed margin values are predicted to be drawn from the positive class, and zero or negative signed margin values are predicted to be from the negative class. Samples can be ranked using the signed margin values – more positive values are considered more strongly positive predictions. However, a signed margin value on its own does not have any direct probabilistic interpretation. The method of Rüping converts the signed margin values into confidence predictions using a simple algorithm, based on the idea that most of the information about confidence is in the margin value interval between +1.0 and -1.0. This is due to the way the SVM mathematical optimization problem is formulated. Treating all predictions as positive with a confidence between 0.0 and 1.0, the method estimates that the minimum confidence occurs at a signed margin value <= -1.0, and is equal to the ratio of the number of positive training samples with a signed margin value <= -1.0 divided by the total number of training samples with a signed margin value <= -1.0. Similarly, the maximum confidence occurs at a signed margin value >= +1.0, and is equal to the ratio of the number of positive training samples with a signed margin value >= -1.0 divided by the total number of training samples with a signed margin value >= 1.0. Between the margin values of -1.0 and +1.0 Rüping’s method simply linearly interpolates the confidence between these minimum and maximum values. The method works well but has a couple of shortcomings. First, while margin values > +1.0 and < -1.0 certainly have less information about confidence than those in the interval between -1.0, +1.0 it is still reasonable to expect that a signed margin value of +100.0 would represent a higher confidence prediction than a value of +1.5. Furthermore, “flat lining” the confidence values 9 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 10 of 22 results in poor AUC and average precision performance measures because all of the ranking information for the very high and very low confidence information is thrown away. Secondly, using the margin values from the training samples used to create the model in the first place is a biased approach, which will likely result in the high and low confidence points being somewhat more extreme than they really are. We extended the method of Rüping in two ways to overcome these deficiencies. First, instead of estimating the margin distances and the high and low confidence points from the trained model, we estimate these quantities using 10-fold cross-validation on the training samples used to build the model. Therefore, the margin distances and the resulting high and low confidence ratios are estimated not with the model trained on all the data, but instead on a set of a very similar models that do not include the data sample being estimated. The second extension is in how we handle the margin intervals below -1.0 and above +1.0. Instead of clamping the confidence at the maximum/minimum value computed we extend the linear portion of the margin vs. confidence graph with arctan “splines” attached to the top and bottom. The positive arctan segment is scaled so that the value at a signed margin of +1.0 is equal to the maximum computed confidence and the value at a margin of positive infinity is equal to 1.0. Similarly, the negative arctan segment is scaled so that the value at a signed margin of -1.0 is equal to the minimum computed confidence and the value at minus infinity is equal to 0.0. These arctan extensions preserve the rank ordering of the samples, and slightly improve the accuracy of the confidence predictions. 10 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 11 of 22 8. Cross-validation performance of title and abstract features by data set subsampling rate Figure S2 shows the cross-valuation performance for a number of evaluation measures at varying fractions of training data samples. It is clear that performance is basically flat across all measures between 5% and 10% sampling of the training data. This represents between 300,000 and 600,000 individual article citations. Figure S2. Plot of a number of performance measures obtained by 5x 2-way cross-validation while varying the fraction of the training dataset used to train the classifier. Classification performance begins to flatten out at 2.5% of the dataset and is essentially maximized at 5% for all measures. Citation Features Cross-Validation RCT Confidence Prediction vs. Training Data Fraction Mean AUC Mean F1 Mean Average Precision Mean F1 Mean MCC Mean Precision Mean Recall 1.000 Metric Value 0.900 0.800 0.700 0.600 0.500 0.000 0.020 0.040 0.060 0.080 0.100 Fraction of the Human-Related Article Universe Training Dataset Used for Classifier Training 11 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 12 of 22 9. Human Related Topics PubMed Queries The Cochrane review search topics and full queries used in the four topic evaluations are shown in Table S4 below. Table S4. Search topics included in manual review of classifier confidence predictions. COCHRANE ID REVIEW TOPIC Opioid antagonists for alcohol dependence [2] TOPIC SPECIFIC SEARCH TERMS Alcohol-Related Disorders[mesh] OR ((alcohol) AND (dependen* or disorder* or drink* or misuse or abuse* or consumption)) OR alcoholism [mesh] OR alcohol* OR "drinking behaviour" Early versus delayed laparoscopic cholecystectomy for people with acute cholecystitis [3] ((laparoscop* OR celioscop* OR coelioscop* OR abdominoscop* OR peritoneoscop*) AND (cholecystecto* OR colecystecto*)) OR (cholecystitis OR colecystitis OR colecistitis*) OR (CHOLECYSTECTOMY LAPAROSCOPIC (MeSH) OR CHOLECYSTITIS ACUTE (MeSH)) Preconception lifestyle advice for people with subfertility [4] (infertility OR reproductive techniques) AND (preconception care OR counselling OR body weight OR body mass index OR folic acid OR vitamin A OR vitamin D OR iodine OR alcohol OR ethanol OR caffeine OR smoking OR nonprescription drug OR environmental pollutants OR infection OR immunization OR vaccination OR prescription drug OR disease OR genetic counselling OR genetic screening) Interventions for preventing oral mucositis for patients with cancer receiving treatment [5] ((neoplasm* OR leukemia OR leukaemia OR leukaemia OR lymphoma* OR plasmacytoma OR "histiocytosis malignant" OR ((neoplasm* OR leukemia OR leukaemia OR leukaemia OR lymphoma* OR plasmacytoma OR "histiocytosis malignant" OR reticuloendotheliosis OR "sarcoma mast cell" OR "Letterer Siwe disease" OR "immunoproliferative small intestine disease" OR "Hodgkin disease" OR "histiocytosis malignant" OR "bone marrow transplant*" OR cancer* Or tumor* OR tumour* OR malignan* OR neutropeni* OR carcino* OR adenocarcinoma* OR radioth* OR radiat* OR radiochemo* OR irradiat* OR chemotherap*) AND (stomatitis OR "Stevens Johnson syndrome" OR "candidiasis oral" OR mucositis OR (oral AND (candid* OR mucos* OR fung*)) OR mycosis OR mycotic OR thrush)) CD001867 CD005440 CD008189 CD000978 12 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 13 of 22 10. The Most Strongly Weighted Positive and Negative Model Features The most strongly positively and negatively weighted features are shown for the CITATION_ONLY_MODEL and the CITATION_PLUS_MESH_MODEL in Tables S5-S8 below. (We have removed features that occur in >90% of all articles and therefore contain a large amount of negative bias weighting.) When examining these model weights it should be kept in mind that these feature weights only contribute to an article’s score when the corresponding feature is present for an article. Therefore these feature weights should not be strictly interpreted as how important they are to the model in a probabilistic sense, and rather be interpreted how much influence that feature has when present compared to other features. As expected the top positively weighted features in both models include unigrams and bigrams containing variations of the word randomized. However there are a number of interesting top features in each of the models. The CITATION_ONLY model includes hope as a top title term. The presence of this word in the title is associated with increased likelihood of an article being an RCT. There are also several study-design related text features that are associated with RCTs including: placebo, enrolling, crossover, and double-blind. Certain text bigrams occurring in the title are associated with RCTs including --^study and --^trials. The double-dash is an actual text string occurring in some articles titles. The model also includes some disease topic-specific terms that would only apply to papers within the given topic domain that also are associated with RCTs including: icc (abbreviation for Interstitial cell of Cajal), and antiretroviral^combination. Most of the highly negatively weighted features for the CITATION_ONLY model are journal names that are associated with less likelihood of an article being an RCT. There are also several 13 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 14 of 22 disease topic specific title and abstract terms that are associated with less likelihood of an article being an RCT such as lipids, aids, and 3tc/azt. The CITATION_PLUS_MESH model includes many of the positively weighted features that that CITATION_ONLY model included. Combined with these are MeSH terms for various study designs that tend to be associated with RCTs such as Single-Blind Method, Double-Blind Method, Cross-Over Studies, and Placebos. There are also several medical topic-based MeSH terms included here such as Carnitine and Adenine. In addition several medical topic-based title bigrams make it to the top of the list including: nelfinavir^protease and antiviral^effect. The top negatively weighted features in the CITATION_PLUS_MESH model includes some of the text and journal name negative features from the CITATION_ONLY model, but adds many negatively weighted MeSH terms. Some of these are general study design terms, such as Retrospective Studies and Case-Control Studies. Others such as Nevirapine (a reverse transcriptase inhibitor used to treat HIV) are specific to a particular medical condition. Of particular note are all of the negatively weighted terms that treat study design as the topic of the article itself, such as Randomized Controlled Trials as Topic, Clinical Trials, Phase II as Topic, and Clinical Trials, Phase III as Topic. Also of interest are the negatively weighted title and abstract terms such as randomized^trials, randomized^studies, meta-analysis and trials (note the plural forms), that could indicate an article is a review or synthesis of several studies. For these features, text pre-processing such as stemming would have eliminated their usefulness. This shows that even for highly predictive text features the context of the term can add substantial information predictive information (e.g. randomized is given a weight of 0.99602987 but when occurring as randomized^trials it incurs a substantial negative penalty of -0.35509579. 14 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 15 of 22 Table S5. The most strongly positively weighted features for the CITATION_ONLY model. RANK TOP MOST POSITIVELY WEIGHTED FEATURES CITATION_ONLY MODEL FEATURE_SET FEATURE_NAME WEIGHT 1 Abstract Unigram randomized 0.97411508 2 Abstract Unigram randomised 0.82340813 3 Abstract Unigram randomly 0.63069745 4 Title Unigram hope 0.61080829 5 Title Unigram randomized 0.46978424 6 Abstract Bigram assigned^randomly 0.46964635 7 Title Bigram antiviral^finds 0.41717700 8 Abstract Unigram randomization 0.41659960 9 Abstract Bigram order^random 0.39865617 10 Title Bigram antiviral^effect 0.38119962 11 Title Unigram enrolling 0.37362438 12 Title Bigram --^study 0.37128836 13 Title Bigram enrolling^study 0.37128836 14 Abstract Unigram placebo 0.33636001 15 Abstract Bigram allocated^randomly 0.33516724 16 Title Unigram randomised 0.31382431 17 Title Bigram icc^trials 0.30972921 18 Abstract Unigram random 0.30738346 19 Title Unigram icc 0.30669612 20 Title Bigram --^trials 0.29650661 21 Abstract Unigram crossover 0.29052612 22 Abstract Unigram double-blind 0.27706781 23 Abstract Bigram randomized^receive 0.27099984 24 Abstract Bigram divided^randomly 0.26311613 25 Title Bigram antiretroviral^combination 0.26080109 15 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 16 of 22 Table S6. The most strongly negatively weighted features for the CITATION_ONLY model. TOP MOST NEGATIVELY WEIGHTED FEATURES CITATION_ONLY MODEL FEATURE_SET FEATURE_NAME RANK WEIGHT 1 Journal Name aids policy & law -0.85491611 2 Journal Name aids alert -0.69202547 3 Journal Name project inform perspective -0.67780928 4 Journal Name positively aware -0.67168294 5 Journal Name treatment review -0.61031358 6 Journal Name gmhc treatment issues -0.60816330 7 Journal Name pi perspective -0.57353228 8 Journal Name prescrire international -0.51756177 9 Journal Name notes from the underground (new york, ny) -0.50087075 10 Journal Name beta bulletin of experimental treatments for aids -0.48566774 11 Title Unigram aids -0.47393402 12 Journal Name focus (san francisco, calif) -0.44763457 13 Journal Name newsline (people with aids coalition of new york) -0.44190878 14 Abstract Bigram randomized^trials -0.43005443 15 Journal Name research initiative, treatment action -0.38856852 16 Journal Name critical path aids project -0.36833073 17 Journal Name aids treatment news -0.34408838 18 Title Unigram roxithromycin -0.33627865 19 Journal Name the medical letter on drugs and therapeutics -0.32816788 20 Abstract Bigram randomized^studies -0.30929305 21 Journal Name the body positive -0.30388445 22 Journal Name healthcare benchmarks and quality improvement -0.30267050 23 Title Unigram lipids -0.29794727 24 Title Unigram 3tc/azt -0.29678165 25 Journal Name positive health news -0.29649262 16 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 17 of 22 Table S7. The most strongly positively weighted features for the CITATION_PLUS_MESH model. TOP MOST POSITIVELY WEIGHTED FEATURES CITATION_PLUS_MESH MODEL FEATURE_SET FEATURE_NAME RANK WEIGHT 1 Abstract Unigram randomized 0.99602987 2 Mesh Term Double-Blind Method 0.79326574 3 Abstract Unigram randomised 0.78130029 4 Abstract Unigram randomly 0.65631655 5 Mesh Term Cross-Over Studies 0.45511840 6 Title Unigram randomized 0.43826829 7 Abstract Bigram assigned^randomly 0.40937356 8 Title Unigram hope 0.38882575 9 Mesh Term Carnitine 0.36993776 10 Abstract Unigram randomization 0.36062942 11 Abstract Bigram order^random 0.35642897 12 Mesh Term Adenine 0.31228530 13 Abstract Bigram allocated^randomly 0.31187248 14 Abstract Unigram random 0.30945023 15 Mesh Term Single-Blind Method 0.30899233 16 Title Bigram nelfinavir^protease 0.29964456 17 Title Bigram inhibitor^study 0.29964456 18 Title Bigram recruiting^study 0.29964456 19 Title Unigram recruiting 0.29193907 20 Title Bigram antiviral^finds 0.28440239 21 Mesh Term Placebos 0.28103408 22 Title Unigram randomised 0.27128954 23 Title Bigram antiviral^effect 0.27090553 24 Title Unigram nelfinavir 0.24685398 25 Abstract Bigram divided^randomly 0.24408650 17 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 18 of 22 Table S8. The most strongly negatively weighted features for the CITATION_PLUS_MESH model. TOP MOST NEGATIVELY WEIGHTED FEATURES CITATION_PLUS_MESH MODEL RANK FEATURE_SET FEATURE_NAME WEIGHT 1 Abstract Bigram randomized^trials -0.35509579 2 Mesh Term Randomized Controlled Trials as Topic -0.26933683 3 Abstract Bigram randomized^studies -0.24622715 4 Mesh Term Clinical Trials, Phase II as Topic -0.21006912 5 Abstract Bigram randomly^selected -0.20452695 6 Mesh Major Term Randomized Controlled Trials as Topic -0.19471967 7 Abstract Bigram controlled^trials -0.19351891 8 Journal Name gmhc treatment issues -0.19087672 9 Abstract Trigram clinical^randomized^trials -0.18854145 10 Journal Name pi perspective -0.18082856 11 Mesh Term Clinical Trials as Topic -0.18043824 12 Mesh Term Clinical Trials, Phase III as Topic -0.17958616 13 Title Unigram trials -0.17327485 14 Mesh Term Research -0.17223032 15 Mesh Term Retrospective Studies -0.17061191 16 Journal Name aids alert -0.16451899 17 Abstract Unigram observational -0.15703558 18 Journal Name aids policy & law -0.15690963 19 Mesh Term Nevirapine -0.15676741 20 Abstract Unigram meta-analysis -0.15140712 21 Mesh Term Acquired Immunodeficiency Syndrome -0.15088611 22 Mesh Term Case-Control Studies -0.14666599 23 Title Unigram meta-analysis -0.14554765 24 Mesh Term Reverse Transcriptase Inhibitors -0.14105520 25 Journal Name prescrire international -0.13791475 18 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 19 of 22 11. Distribution of Confidence Predictions for Articles Tagged with the MEDLINE Randomized Controlled Trial Publication Type vs. Articles Not Tagged Figure S3 below shows the distribution RCT confidence prediction values using the CITATION_PLUS_MESH_MODEL model on the 2011 held out data set. The figure clearly shows the strong difference in confidence prediction distributions between the two sub-sets. The non- Randomized Controlled Trial publication type articles are assigned low confidence values very close to zero. The articles tagged with the MEDLINE Randomized Controlled Trial publication type are mostly assigned high confidences near 1.0, but also have a long tail of confidences at lower values with bump around zero. Based on our manual review, these nearzero tagged articles include quite a few articles that likely do not meet the definition of an RCT, and therefore it may be worthwhile to have these articles RCT_PT status re-reviewed by MEDLINE annotators. 19 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 20 of 22 Figure S3. Histogram distribution of RCT confidence prediction values using the CITATION_PLUS_MESH_MODEL model on the 2011 held out data set for articles tagged with the MEDLINE Randomized Controlled Trial publication type vs. articles not tagged with this publication type. Frequency Histogram Distribution of CITATION_PLUS_MESH Model RCT Confidence in MEDLINE Year 2011 Articles 9568 Has RCT Pubtype No RCT Pubtype 2000 624 611 0.0 0.2 743 618 988 836 0.4 1112 1280 0.6 0.8 1.0 Confidence Frequency 371255 8936 3692 1622 0.0 0.2 926 0.4 519 360 0.6 238 118 0.8 151 1.0 Confidence 20 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 21 of 22 12. Single Token Match Classifier Results To serve as a baseline comparison, we conducted a study of using single tokens as binary classifiers for article RCTs. This was performed on the 2011 held out dataset for comparison with our other results. Results are shown in Table S9. There is no single token classifier that can perform with an F1 measure greater than 0.71, and only two tokens (randomized and trial) have F1 measures greater than 0.50. Precision and recall are poor across all of these single tokens. To achieve higher performance, many additional features are needed. There is no obvious noncomputational method to combine these features into a simple single model. A machine learning based approach is a useful way to combine many features into a single predictor model. Table S9. The 20 best performing single token article RCT binary classifiers on the 2011 held out dataset, ranked by F1 performance. PRECISION RECALL F1 MSE randomized TOKEN 0.6721 0.7691 0.7173 0.0274 trial 0.5617 0.6945 0.6211 0.0383 controlled 0.3575 0.3434 0.3503 0.0576 randomly 0.5684 0.2429 0.3403 0.0426 double-blind 0.8643 0.1910 0.3128 0.0380 placebo 0.6643 0.1993 0.3066 0.0408 assigned 0.6055 0.1881 0.2870 0.0423 baseline 0.2543 0.2626 0.2584 0.0682 randomised 0.6226 0.1560 0.2495 0.0425 group 0.1639 0.5136 0.2485 0.1406 groups 0.1689 0.4143 0.2400 0.1187 placebo-controlled 0.7761 0.1344 0.2292 0.0409 intervention 0.2112 0.2457 0.2271 0.0757 efficacy 0.1957 0.2500 0.2195 0.0804 receive 0.4009 0.1438 0.2117 0.0485 mg 0.3814 0.1449 0.2101 0.0493 weeks 0.2190 0.2011 0.2097 0.0686 versus 0.1995 0.2071 0.2032 0.0735 received 0.1842 0.1948 0.1894 0.0755 = 0.1161 0.3848 0.1783 0.1604 21 Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial Articles: An Aid to Evidence-Based Medicine, Page 22 of 22 13. Database and Computer System Details The pre-processed article features were stored in a ~300GB MongoDB (www.mongodb.com) database to facilitate machine learning training and subsequent evaluation. We used a Mac Pro with 32GB running OS X 10.6 Snow Leopard of RAM as the database server to conduct our experiments. References 1. Funk ME, Reid CA. Indexing consistency in MEDLINE. Bull Med Libr Assoc. 1983 Apr;71(2):176–83. 2. Rösner S, Hackl-Herrwerth A, Leucht S, Vecchi S, Srisurapanont M, Soyka M. Opioid antagonists for alcohol dependence. Cochrane Database Syst Rev. 2010;(12):CD001867. 3. Gurusamy KS, Davidson C, Gluud C, Davidson BR. Early versus delayed laparoscopic cholecystectomy for people with acute cholecystitis. Cochrane Database Syst Rev. 2013;6:CD005440. 4. Anderson K, Norman RJ, Middleton P. Preconception lifestyle advice for people with subfertility. Cochrane Database Syst Rev. 2010;(4):CD008189. 5. Worthington HV, Clarkson JE, Bryan G, Furness S, Glenny A-M, Littlewood A, et al. Interventions for preventing oral mucositis for patients with cancer receiving treatment. Cochrane Database Syst Rev. 2011;(4):CD000978. 22