Reducing Workload in Systematic Review Preparation

advertisement
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 1 of 22
Automated Confidence Ranked Classification of Randomized
Controlled Trial Articles: An Aid to Evidence-Based Medicine
Supplemental Appendix
1. Details of the Manual Annotation Process and Inter-rater agreement with the
MEDLINE Randomized Controlled Trial (RCT) publication type
The following instructions were provided to the manual annotators:
The Medline definition of an RCT
(http://www.ncbi.nlm.nih.gov/mesh/68016449) is as follows:
Work consisting of a clinical trial that involves at least one test treatment
and one control treatment, concurrent enrollment and follow-up of the
test- and control-treated groups, and in which the treatments to be
administered are selected by a random process, such as the use of a
random-numbers table.
For ambiguous cases, the following guidance may be helpful:
An RCT article is a publication that is the primary description of a study
that, at least in one of its components, is a randomized controlled trial, or
a specific description of one or more of a set of related randomized
controlled trials. For a study to be a randomized controlled trial the
selection of subject receiving various interventions must be randomized,
there must be a control intervention to contrast with the studied
intervention, and the study must be a prospective trial of a well-defined
intervention or interventions.
Review each article in order in the spreadsheet and using the drop down for the
judgment field, select one of: IS_RCT, NOT_RCT, UNCERTAIN.
IS_RCT – the article meets the definition of a randomized controlled trial given.
NOT_RCT – the article does not meet the definition of a randomized controlled
trial given.
UNCERTAIN – based on the title and abstract, the article may meet the
definition of a randomized controlled trial, but it is not possible to confirm this
because the full text is not available or the full text provides insufficient detail.
When making the judgments, you may review the article’s title, abstract, author,
journal, publication date, and full article text. Do not look at the MeSH terms or
assigned MEDLINE publication types.
1
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 2 of 22
To evaluate the reproducibility of the RCT manual review process, a random subset of 100
previously unseen articles published in the year 2014 were manually reviewed by an independent
human assessor and annotated using the definition. The independent annotator was experienced
with evidence-based medicine and familiar with the work of our team, but had no part in
developing the machine learning model. These articles were chosen so that half had been
assigned the MEDLINE RCT_PT, and half had not.
The Cohen’s Kappa inter-rater agreement between the PubMed Randomized Controlled Trial
publication type and the assessment of the independent annotator was 0.72. This represents a
substantial level of agreement. Hooper’s measure of indexing consistency was 72%, which is
comparable to the best performance obtained in other inter-rater agreement students on
MEDLINE annotation.[1] The annotator and the publication type agreed perfectly that all 50 of
the articles that were not annotated with the MEDLINE RCT_PT were indeed not RCTs. Of the
50 articles having the MEDLINE RCT_PT, the independent annotator differed from the
MEDLINE RCT_PT on 14 articles, designating these as not RCTs. Reasons given by the
independent annotator for designating the articles as not RCTs included: “not randomized”, “not
controlled”, “treatment allocation blocked by clinic not patient”, and “describes a proposed
RCT”.
2. Human Related Articles PubMed Query and Dataset Details
The full query used to construct our data set is shown in Figure S1 below. The query uses both
subject-related terms, such as Human, and study design related terms, such as case-control
studies and double-blind method in order to identify as many human-related articles as possible.
We only included articles that were published in English language journals, were annotated with
MeSH terms, and that had an available abstract.
2
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 3 of 22
The dataset was split into a training set, comprising articles from 1987-2010, and two
evaluation/testing sets, one corresponding to each of the years 2011 and 2012. The complete
dataset included 6,316,826 articles: 5,482,943 articles for training, 406,558 articles in the 2011
evaluation set, and 427,325 articles in the 2012 evaluation set. The percentage of articles that
were indexed as having RCT_PT was approximately 4.4%, 4.5%, and 4.6%, in the training,
2011, and 2012 data subsets, respectively.
Figure S1. Pubmed query used to build the document data set used for this research. The first portion of the query
(before the initial AND) defines the date range. The second portion (up to the first NOT) captures clinical research
articles. The third portion (after the first NOT) excludes studies about animals and not humans. The last portion of
the query (starting with “hasabstract”) requires the articles to have abstracts and be published in English.
("1987/01/01"[PDAT] : "2012/12/31"[PDAT])
AND (hasabstract[text] AND English[lang])
AND ("humans"[MeSH Terms] OR "meta-analysis"[Publication Type] OR "Clinical
Trial"[Publication Type] OR "Clinical Trial, Phase I"[Publication Type] OR "Clinical
Trial, Phase II"[Publication Type] OR "Clinical Trial, Phase III"[Publication Type] OR
"Clinical Trial, Phase IV"[Publication Type] OR "Controlled Clinical Trial"[Publication
Type] OR "Randomized Controlled Trial"[Publication Type] OR "case
reports"[Publication Type] OR "Multicenter Study"[Publication Type] OR
"Comparative Study"[Publication Type] OR "case-control studies"[MeSH Terms] OR
"cross-sectional studies"[MeSH Terms] OR "cross-over studies"[MeSH Terms] OR
"double-blind method"[MeSH Terms] OR "Genome-Wide Association Study"[MeSH
Terms] OR "single-blind method"[MeSH Terms] OR "random allocation"[MeSH
Terms] OR "focus groups"[MeSH Terms] OR "cohort studies"[MeSH Terms] OR
"prospective studies"[MeSH Terms] OR "follow-up studies"[MeSH Terms] OR
"Multicenter Studies as Topic"[MeSH Terms] OR "Randomized Controlled Trials as
Topic"[MeSH Terms] OR "Clinical Trials as Topic"[MeSH Terms])
NOT ("animals"[MeSH Terms] NOT "humans"[MeSH Terms])
3
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 4 of 22
3. Per-Feature Statistical Significance Testing
As part of the feature set pre-processing, we evaluated the statistical significance of every feature
in the training set, across all feature types, for predicting whether an article was an RCT. We
counted the number of articles that each feature occurred in, as well as the number of articles
with each feature that were annotated with the MEDLINE publication type Randomized
Controlled Trial. Features occurring in less than 500 articles (< 0.01% of articles) were dropped,
and a 2x2 chi-squared test was computed for the rest of the features, with the rows of the 2x2
table representing citations including the feature or not, and the column representing citations
annotated with the MEDLINE publication type Randomized Controlled Trial or not. We did not
do any alpha correction, such as Bonferroni, when doing these statistical tests. With over 45
million total features to be tested, the required cut off value would have been exceedingly small,
leaving very few features. Instead we choose to apply the commonly used alpha cut-off value of
0.05 to select significant features, which left us with a large but manageable set of 113,177
statistically significant features. This collection of features will be referred to as the statistically
significant feature set. Predictive machine learning models built with this feature set will be
compared with models built without first filtering the features by significance testing.
4. Description of the Recursive Partitioning Algorithm Used on the Author Count Feature
We used a method called recursive partitioning to split these count-based features into ranges
and treat each range as a separate binary feature. Recursive partitioning (RP) uses a minimum
description length (MDL) approach. RP works by starting with the full range of author count
values and determining whether there is a place to split the range in between values that reduces
the total number of bits necessary to encode the dependent variable for each sample (in this case
RCT status), including the overhead of representing the partitioning value. This approach is then
recursively applied to each sub-range, and the process repeated until splitting is no longer
4
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 5 of 22
effective at reducing the number of bits necessary for encoding. The result is a set of partition
boundaries that define the ranges of individual binary features. In our implementation this is
handled automatically within each run before the sample feature vectors are passed to the SVM
classifier. In all cases, including cross-validation, the partitions are determined based on the
training data being used for that run, and then that partitioning is applied to the evaluation data
for that run.
Classification using only recursive partitioning of the author count resulted in a mean AUC of
0.631. For author count, MDL optimal partition boundaries on the entire training set were found
at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 9.5, and 11.5. Therefore partitioned author count features appear to
be worth including in the predictive models. Forward selection confirmed these observations.
5
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 6 of 22
5. Feature Set Creation
Feature sets included in the final models are shown in Table S1. Text-based features were
created by tokenizing the text at delimiters and then filtering out stop word tokens before
creating n-grams. We used the stop word list created by Andrew McCallum
(http://www.cs.cmu.edu/~mccallum/bow/).
Table S1. Feature sets, types, and whether they were included in each of the final predictive models by the forward
selection process.
TYPE
INCLUDED IN
CITATION_ONLY MODEL?
INCLUDED IN
CITATION_PLUS_MESH
MODEL?
Title unigrams
binary
✓
✓
Title bigrams
binary
✓
✓
Abstract unigrams
binary
✓
✓
Abstract bigrams
binary
✓
✓
Abstract trigrams
binary
✓
✓
Author count
scalar
✓
✓
Journal name
binary
✓
✓
MeSH article terms
binary
✓
MeSH major article terms
binary
✓
MeSH article term qualifiers
binary
✓
Title trigrams
binary
MeSH major terms with qualifiers
binary
Authors last names with first
initial
binary
Languages (en, fr, etc.)
binary
Databank accession numbers
binary
Title acronym count
scalar
FEATURE SET
6
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 7 of 22
6. Forward Selection Method for Feature Sets
We built our models using a forward selection process on each feature set. We started out
building individual predictive models for each feature set and evaluating the feature set using
5x2 cross-validation on the training data. From these runs, we choose the best performing feature
set. Then we performed evaluation runs adding each not yet included feature set to the previously
determined best feature set. Then we again chose the best run from these, and added to the
current best collection of feature sets each remaining feature set not yet included and evaluated
these using cross-validation. We repeated this iterative process until no performance gain was
achieved.
For forward selection we chose AUC and average precision as the primary measures to select the
best feature or combination of features. We chose these measures since they are both sensitive to
ranking and the ability to accurately rank articles as RCTs was one of our primary goals. If a run
had the best performance on both of these measures, or on one of these measures and no worse
performance on the other, that run was chosen as the best for that iteration. Feature set selection
ended when no cross-validation performance improvement could be obtained by adding an
unused feature set to the current best feature set collection. We compared performance to three
decimal places. The decision was based on the fact that the width of the AUC 95% confidence
interval for the individual cross-validation estimates is approximately +/- 0.002 according to the
conservative Hanley/McNeil method, and we wanted to ensure we included any feature set that
could lead to statistically significant improvement. Furthermore, at the high level of performance
at which we were targeting our models, a performance improvement of 0.001 represents an
improvement of 2.5% of the maximum possible improvement, which could be important in an
actual system.
7
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 8 of 22
The feature sets added at each stage of the forward selection decision process for the
CITATION_PLUS_MESH model are shown in Table S2, with the same information for the
CITATION_ONLY model shown in Table S3. Listed are the mean values of the AUC and
average precision measures (across all 5 iterations of 2-way cross-validation runs) used to make
decisions on which feature set most improved the model. Beyond the last model in each table, no
feature set improved the performance on AUC or average precision.
Table S2. Results of the forward selection process feature set inclusion in the CITATION_PLUS_MESH model.
Values shown are the mean performance measures of 5 iterations of 2-way cross-validation on training Dataset
(1987-2010), randomly down-sampled at 7.5% for each iteration.
Feature Sets
abstract unigrams
add mesh terms
add abstract bigrams
add title unigrams
add abstract trigrams
add author count
add title bigrams
add mesh qualifiers
add journal name
add mesh major terms
Mean AUC Mean Average Precision
0.941
0.741
0.959
0.801
0.970
0.856
0.971
0.865
0.973
0.869
0.975
0.873
0.975
0.875
0.976
0.875
0.976
0.876
0.976
0.877
Table S3. Results of the forward selection process feature set inclusion in the CITATION_ONLY model. Values
shown are the mean performance measures of 5 iterations of 2-way cross-validation on training Dataset (19872010), randomly down-sampled at 7.5% for each iteration.
Feature Sets
abstract unigrams
add abstract bigrams
add author count
add abstract trigrams
add title bigrams
add journal name
add title unigrams
Mean AUC Mean Average Precision
0.941
0.741
0.955
0.804
0.965
0.837
0.968
0.844
0.969
0.849
0.969
0.852
0.969
0.855
8
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 9 of 22
7. Extensions to the Method of Rüping for SVM Confidence Estimation
The signed margin distance is essentially the distance in front of or in back of the separating
hyperplane that the SVM algorithm creates as the optimal solution to the classification problem.
Samples with positive signed margin values are predicted to be drawn from the positive class,
and zero or negative signed margin values are predicted to be from the negative class. Samples
can be ranked using the signed margin values – more positive values are considered more
strongly positive predictions. However, a signed margin value on its own does not have any
direct probabilistic interpretation.
The method of Rüping converts the signed margin values into confidence predictions using a
simple algorithm, based on the idea that most of the information about confidence is in the
margin value interval between +1.0 and -1.0. This is due to the way the SVM mathematical
optimization problem is formulated. Treating all predictions as positive with a confidence
between 0.0 and 1.0, the method estimates that the minimum confidence occurs at a signed
margin value <= -1.0, and is equal to the ratio of the number of positive training samples with a
signed margin value <= -1.0 divided by the total number of training samples with a signed
margin value <= -1.0. Similarly, the maximum confidence occurs at a signed margin value >=
+1.0, and is equal to the ratio of the number of positive training samples with a signed margin
value >= -1.0 divided by the total number of training samples with a signed margin value >= 1.0. Between the margin values of -1.0 and +1.0 Rüping’s method simply linearly interpolates
the confidence between these minimum and maximum values.
The method works well but has a couple of shortcomings. First, while margin values > +1.0 and
< -1.0 certainly have less information about confidence than those in the interval between -1.0,
+1.0 it is still reasonable to expect that a signed margin value of +100.0 would represent a higher
confidence prediction than a value of +1.5. Furthermore, “flat lining” the confidence values
9
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 10 of 22
results in poor AUC and average precision performance measures because all of the ranking
information for the very high and very low confidence information is thrown away. Secondly,
using the margin values from the training samples used to create the model in the first place is a
biased approach, which will likely result in the high and low confidence points being somewhat
more extreme than they really are.
We extended the method of Rüping in two ways to overcome these deficiencies. First, instead of
estimating the margin distances and the high and low confidence points from the trained model,
we estimate these quantities using 10-fold cross-validation on the training samples used to build
the model. Therefore, the margin distances and the resulting high and low confidence ratios are
estimated not with the model trained on all the data, but instead on a set of a very similar models
that do not include the data sample being estimated.
The second extension is in how we handle the margin intervals below -1.0 and above +1.0.
Instead of clamping the confidence at the maximum/minimum value computed we extend the
linear portion of the margin vs. confidence graph with arctan “splines” attached to the top and
bottom. The positive arctan segment is scaled so that the value at a signed margin of +1.0 is
equal to the maximum computed confidence and the value at a margin of positive infinity is
equal to 1.0. Similarly, the negative arctan segment is scaled so that the value at a signed margin
of -1.0 is equal to the minimum computed confidence and the value at minus infinity is equal to
0.0. These arctan extensions preserve the rank ordering of the samples, and slightly improve the
accuracy of the confidence predictions.
10
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 11 of 22
8. Cross-validation performance of title and abstract features by data set subsampling rate
Figure S2 shows the cross-valuation performance for a number of evaluation measures at varying
fractions of training data samples. It is clear that performance is basically flat across all measures
between 5% and 10% sampling of the training data. This represents between 300,000 and
600,000 individual article citations.
Figure S2. Plot of a number of performance measures obtained by 5x 2-way cross-validation while varying the
fraction of the training dataset used to train the classifier. Classification performance begins to flatten out at 2.5% of
the dataset and is essentially maximized at 5% for all measures.
Citation Features Cross-Validation RCT
Confidence Prediction
vs. Training Data Fraction
Mean AUC
Mean F1
Mean Average Precision
Mean F1
Mean MCC
Mean Precision
Mean Recall
1.000
Metric Value
0.900
0.800
0.700
0.600
0.500
0.000
0.020
0.040
0.060
0.080
0.100
Fraction of the Human-Related Article Universe Training Dataset Used for
Classifier Training
11
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 12 of 22
9. Human Related Topics PubMed Queries
The Cochrane review search topics and full queries used in the four topic evaluations are shown in Table S4 below.
Table S4. Search topics included in manual review of classifier confidence predictions.
COCHRANE ID
REVIEW TOPIC
Opioid antagonists for alcohol
dependence [2]
TOPIC SPECIFIC SEARCH TERMS
Alcohol-Related Disorders[mesh] OR ((alcohol) AND
(dependen* or disorder* or drink* or misuse or abuse* or
consumption)) OR alcoholism [mesh] OR alcohol* OR
"drinking behaviour"
Early versus delayed
laparoscopic cholecystectomy
for people with acute
cholecystitis [3]
((laparoscop* OR celioscop* OR coelioscop* OR
abdominoscop* OR peritoneoscop*) AND (cholecystecto* OR
colecystecto*)) OR (cholecystitis OR colecystitis OR
colecistitis*) OR (CHOLECYSTECTOMY LAPAROSCOPIC (MeSH)
OR CHOLECYSTITIS ACUTE (MeSH))
Preconception lifestyle advice
for people with subfertility [4]
(infertility OR reproductive techniques)
AND (preconception care OR counselling OR body weight OR
body mass index OR folic acid OR vitamin A OR vitamin D OR
iodine OR alcohol OR ethanol OR caffeine OR smoking OR
nonprescription drug OR environmental pollutants OR
infection OR immunization OR vaccination OR prescription
drug OR disease OR genetic counselling OR genetic screening)
Interventions for preventing
oral mucositis for patients
with cancer receiving
treatment [5]
((neoplasm* OR leukemia OR leukaemia OR leukaemia OR
lymphoma* OR plasmacytoma OR "histiocytosis malignant"
OR ((neoplasm* OR leukemia OR leukaemia OR leukaemia OR
lymphoma* OR plasmacytoma OR "histiocytosis malignant"
OR reticuloendotheliosis OR "sarcoma mast cell" OR "Letterer
Siwe disease" OR "immunoproliferative small intestine
disease" OR "Hodgkin disease" OR "histiocytosis malignant"
OR "bone marrow transplant*" OR cancer* Or tumor* OR
tumour* OR malignan* OR neutropeni* OR carcino* OR
adenocarcinoma* OR radioth* OR radiat* OR radiochemo*
OR irradiat* OR chemotherap*) AND (stomatitis OR "Stevens
Johnson syndrome" OR "candidiasis oral" OR mucositis OR
(oral AND (candid* OR mucos* OR fung*)) OR mycosis OR
mycotic OR thrush))
CD001867
CD005440
CD008189
CD000978
12
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 13 of 22
10. The Most Strongly Weighted Positive and Negative Model Features
The most strongly positively and negatively weighted features are shown for the
CITATION_ONLY_MODEL and the CITATION_PLUS_MESH_MODEL in Tables S5-S8
below. (We have removed features that occur in >90% of all articles and therefore contain a
large amount of negative bias weighting.) When examining these model weights it should be
kept in mind that these feature weights only contribute to an article’s score when the
corresponding feature is present for an article. Therefore these feature weights should not be
strictly interpreted as how important they are to the model in a probabilistic sense, and rather be
interpreted how much influence that feature has when present compared to other features.
As expected the top positively weighted features in both models include unigrams and bigrams
containing variations of the word randomized. However there are a number of interesting top
features in each of the models.
The CITATION_ONLY model includes hope as a top title term. The presence of this word in the
title is associated with increased likelihood of an article being an RCT. There are also several
study-design related text features that are associated with RCTs including: placebo, enrolling,
crossover, and double-blind. Certain text bigrams occurring in the title are associated with RCTs
including --^study and --^trials. The double-dash is an actual text string occurring in some
articles titles. The model also includes some disease topic-specific terms that would only apply
to papers within the given topic domain that also are associated with RCTs including: icc
(abbreviation for Interstitial cell of Cajal), and antiretroviral^combination.
Most of the highly negatively weighted features for the CITATION_ONLY model are journal
names that are associated with less likelihood of an article being an RCT. There are also several
13
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 14 of 22
disease topic specific title and abstract terms that are associated with less likelihood of an article
being an RCT such as lipids, aids, and 3tc/azt.
The CITATION_PLUS_MESH model includes many of the positively weighted features that
that CITATION_ONLY model included. Combined with these are MeSH terms for various
study designs that tend to be associated with RCTs such as Single-Blind Method, Double-Blind
Method, Cross-Over Studies, and Placebos. There are also several medical topic-based MeSH
terms included here such as Carnitine and Adenine. In addition several medical topic-based title
bigrams make it to the top of the list including: nelfinavir^protease and antiviral^effect.
The top negatively weighted features in the CITATION_PLUS_MESH model includes some of
the text and journal name negative features from the CITATION_ONLY model, but adds many
negatively weighted MeSH terms. Some of these are general study design terms, such as
Retrospective Studies and Case-Control Studies. Others such as Nevirapine (a reverse
transcriptase inhibitor used to treat HIV) are specific to a particular medical condition. Of
particular note are all of the negatively weighted terms that treat study design as the topic of the
article itself, such as Randomized Controlled Trials as Topic, Clinical Trials, Phase II as Topic,
and Clinical Trials, Phase III as Topic. Also of interest are the negatively weighted title and
abstract terms such as randomized^trials, randomized^studies, meta-analysis and trials (note the
plural forms), that could indicate an article is a review or synthesis of several studies. For these
features, text pre-processing such as stemming would have eliminated their usefulness. This
shows that even for highly predictive text features the context of the term can add substantial
information predictive information (e.g. randomized is given a weight of 0.99602987 but when
occurring as randomized^trials it incurs a substantial negative penalty of -0.35509579.
14
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 15 of 22
Table S5. The most strongly positively weighted features for the CITATION_ONLY model.
RANK
TOP MOST POSITIVELY WEIGHTED FEATURES
CITATION_ONLY MODEL
FEATURE_SET
FEATURE_NAME
WEIGHT
1
Abstract Unigram
randomized
0.97411508
2
Abstract Unigram
randomised
0.82340813
3
Abstract Unigram
randomly
0.63069745
4
Title Unigram
hope
0.61080829
5
Title Unigram
randomized
0.46978424
6
Abstract Bigram
assigned^randomly
0.46964635
7
Title Bigram
antiviral^finds
0.41717700
8
Abstract Unigram
randomization
0.41659960
9
Abstract Bigram
order^random
0.39865617
10
Title Bigram
antiviral^effect
0.38119962
11
Title Unigram
enrolling
0.37362438
12
Title Bigram
--^study
0.37128836
13
Title Bigram
enrolling^study
0.37128836
14
Abstract Unigram
placebo
0.33636001
15
Abstract Bigram
allocated^randomly
0.33516724
16
Title Unigram
randomised
0.31382431
17
Title Bigram
icc^trials
0.30972921
18
Abstract Unigram
random
0.30738346
19
Title Unigram
icc
0.30669612
20
Title Bigram
--^trials
0.29650661
21
Abstract Unigram
crossover
0.29052612
22
Abstract Unigram
double-blind
0.27706781
23
Abstract Bigram
randomized^receive
0.27099984
24
Abstract Bigram
divided^randomly
0.26311613
25
Title Bigram
antiretroviral^combination
0.26080109
15
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 16 of 22
Table S6. The most strongly negatively weighted features for the CITATION_ONLY model.
TOP MOST NEGATIVELY WEIGHTED FEATURES
CITATION_ONLY MODEL
FEATURE_SET
FEATURE_NAME
RANK
WEIGHT
1
Journal Name
aids policy & law
-0.85491611
2
Journal Name
aids alert
-0.69202547
3
Journal Name
project inform perspective
-0.67780928
4
Journal Name
positively aware
-0.67168294
5
Journal Name
treatment review
-0.61031358
6
Journal Name
gmhc treatment issues
-0.60816330
7
Journal Name
pi perspective
-0.57353228
8
Journal Name
prescrire international
-0.51756177
9
Journal Name
notes from the underground (new york, ny)
-0.50087075
10
Journal Name
beta bulletin of experimental treatments for aids
-0.48566774
11
Title Unigram
aids
-0.47393402
12
Journal Name
focus (san francisco, calif)
-0.44763457
13
Journal Name
newsline (people with aids coalition of new york)
-0.44190878
14
Abstract Bigram
randomized^trials
-0.43005443
15
Journal Name
research initiative, treatment action
-0.38856852
16
Journal Name
critical path aids project
-0.36833073
17
Journal Name
aids treatment news
-0.34408838
18
Title Unigram
roxithromycin
-0.33627865
19
Journal Name
the medical letter on drugs and therapeutics
-0.32816788
20
Abstract Bigram
randomized^studies
-0.30929305
21
Journal Name
the body positive
-0.30388445
22
Journal Name
healthcare benchmarks and quality improvement
-0.30267050
23
Title Unigram
lipids
-0.29794727
24
Title Unigram
3tc/azt
-0.29678165
25
Journal Name
positive health news
-0.29649262
16
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 17 of 22
Table S7. The most strongly positively weighted features for the CITATION_PLUS_MESH model.
TOP MOST POSITIVELY WEIGHTED FEATURES
CITATION_PLUS_MESH MODEL
FEATURE_SET
FEATURE_NAME
RANK
WEIGHT
1
Abstract Unigram
randomized
0.99602987
2
Mesh Term
Double-Blind Method
0.79326574
3
Abstract Unigram
randomised
0.78130029
4
Abstract Unigram
randomly
0.65631655
5
Mesh Term
Cross-Over Studies
0.45511840
6
Title Unigram
randomized
0.43826829
7
Abstract Bigram
assigned^randomly
0.40937356
8
Title Unigram
hope
0.38882575
9
Mesh Term
Carnitine
0.36993776
10
Abstract Unigram
randomization
0.36062942
11
Abstract Bigram
order^random
0.35642897
12
Mesh Term
Adenine
0.31228530
13
Abstract Bigram
allocated^randomly
0.31187248
14
Abstract Unigram
random
0.30945023
15
Mesh Term
Single-Blind Method
0.30899233
16
Title Bigram
nelfinavir^protease
0.29964456
17
Title Bigram
inhibitor^study
0.29964456
18
Title Bigram
recruiting^study
0.29964456
19
Title Unigram
recruiting
0.29193907
20
Title Bigram
antiviral^finds
0.28440239
21
Mesh Term
Placebos
0.28103408
22
Title Unigram
randomised
0.27128954
23
Title Bigram
antiviral^effect
0.27090553
24
Title Unigram
nelfinavir
0.24685398
25
Abstract Bigram
divided^randomly
0.24408650
17
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 18 of 22
Table S8. The most strongly negatively weighted features for the CITATION_PLUS_MESH model.
TOP MOST NEGATIVELY WEIGHTED FEATURES
CITATION_PLUS_MESH MODEL
RANK FEATURE_SET
FEATURE_NAME
WEIGHT
1
Abstract Bigram
randomized^trials
-0.35509579
2
Mesh Term
Randomized Controlled Trials as Topic
-0.26933683
3
Abstract Bigram
randomized^studies
-0.24622715
4
Mesh Term
Clinical Trials, Phase II as Topic
-0.21006912
5
Abstract Bigram
randomly^selected
-0.20452695
6
Mesh Major Term
Randomized Controlled Trials as Topic
-0.19471967
7
Abstract Bigram
controlled^trials
-0.19351891
8
Journal Name
gmhc treatment issues
-0.19087672
9
Abstract Trigram
clinical^randomized^trials
-0.18854145
10
Journal Name
pi perspective
-0.18082856
11
Mesh Term
Clinical Trials as Topic
-0.18043824
12
Mesh Term
Clinical Trials, Phase III as Topic
-0.17958616
13
Title Unigram
trials
-0.17327485
14
Mesh Term
Research
-0.17223032
15
Mesh Term
Retrospective Studies
-0.17061191
16
Journal Name
aids alert
-0.16451899
17
Abstract Unigram
observational
-0.15703558
18
Journal Name
aids policy & law
-0.15690963
19
Mesh Term
Nevirapine
-0.15676741
20
Abstract Unigram
meta-analysis
-0.15140712
21
Mesh Term
Acquired Immunodeficiency Syndrome
-0.15088611
22
Mesh Term
Case-Control Studies
-0.14666599
23
Title Unigram
meta-analysis
-0.14554765
24
Mesh Term
Reverse Transcriptase Inhibitors
-0.14105520
25
Journal Name
prescrire international
-0.13791475
18
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 19 of 22
11. Distribution of Confidence Predictions for Articles Tagged with the MEDLINE
Randomized Controlled Trial Publication Type vs. Articles Not Tagged
Figure S3 below shows the distribution RCT confidence prediction values using the
CITATION_PLUS_MESH_MODEL model on the 2011 held out data set. The figure clearly
shows the strong difference in confidence prediction distributions between the two sub-sets. The
non- Randomized Controlled Trial publication type articles are assigned low confidence values
very close to zero. The articles tagged with the MEDLINE Randomized Controlled Trial
publication type are mostly assigned high confidences near 1.0, but also have a long tail of
confidences at lower values with bump around zero. Based on our manual review, these nearzero tagged articles include quite a few articles that likely do not meet the definition of an RCT,
and therefore it may be worthwhile to have these articles RCT_PT status re-reviewed by
MEDLINE annotators.
19
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 20 of 22
Figure S3. Histogram distribution of RCT confidence prediction values using the
CITATION_PLUS_MESH_MODEL model on the 2011 held out data set for articles tagged with the MEDLINE
Randomized Controlled Trial publication type vs. articles not tagged with this publication type.
Frequency
Histogram Distribution of CITATION_PLUS_MESH Model
RCT Confidence in MEDLINE Year 2011 Articles
9568
Has RCT Pubtype
No RCT Pubtype
2000
624
611
0.0
0.2
743
618
988
836
0.4
1112 1280
0.6
0.8
1.0
Confidence
Frequency
371255
8936 3692 1622
0.0
0.2
926
0.4
519
360
0.6
238
118
0.8
151
1.0
Confidence
20
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 21 of 22
12. Single Token Match Classifier Results
To serve as a baseline comparison, we conducted a study of using single tokens as binary classifiers for article
RCTs. This was performed on the 2011 held out dataset for comparison with our other results. Results are shown in
Table S9. There is no single token classifier that can perform with an F1 measure greater than 0.71, and only two
tokens (randomized and trial) have F1 measures greater than 0.50. Precision and recall are poor across all of these
single tokens. To achieve higher performance, many additional features are needed. There is no obvious noncomputational method to combine these features into a simple single model. A machine learning based approach is a
useful way to combine many features into a single predictor model.
Table S9. The 20 best performing single token article RCT binary classifiers on the 2011 held out dataset, ranked by
F1 performance.
PRECISION
RECALL
F1
MSE
randomized
TOKEN
0.6721
0.7691
0.7173
0.0274
trial
0.5617
0.6945
0.6211
0.0383
controlled
0.3575
0.3434
0.3503
0.0576
randomly
0.5684
0.2429
0.3403
0.0426
double-blind
0.8643
0.1910
0.3128
0.0380
placebo
0.6643
0.1993
0.3066
0.0408
assigned
0.6055
0.1881
0.2870
0.0423
baseline
0.2543
0.2626
0.2584
0.0682
randomised
0.6226
0.1560
0.2495
0.0425
group
0.1639
0.5136
0.2485
0.1406
groups
0.1689
0.4143
0.2400
0.1187
placebo-controlled
0.7761
0.1344
0.2292
0.0409
intervention
0.2112
0.2457
0.2271
0.0757
efficacy
0.1957
0.2500
0.2195
0.0804
receive
0.4009
0.1438
0.2117
0.0485
mg
0.3814
0.1449
0.2101
0.0493
weeks
0.2190
0.2011
0.2097
0.0686
versus
0.1995
0.2071
0.2032
0.0735
received
0.1842
0.1948
0.1894
0.0755
=
0.1161
0.3848
0.1783
0.1604
21
Cohen AM et al., Automated Confidence Ranked Classification of Randomized Controlled Trial
Articles: An Aid to Evidence-Based Medicine, Page 22 of 22
13. Database and Computer System Details
The pre-processed article features were stored in a ~300GB MongoDB (www.mongodb.com)
database to facilitate machine learning training and subsequent evaluation. We used a Mac Pro
with 32GB running OS X 10.6 Snow Leopard of RAM as the database server to conduct our
experiments.
References
1.
Funk ME, Reid CA. Indexing consistency in MEDLINE. Bull Med Libr Assoc. 1983 Apr;71(2):176–83.
2.
Rösner S, Hackl-Herrwerth A, Leucht S, Vecchi S, Srisurapanont M, Soyka M. Opioid antagonists for alcohol
dependence. Cochrane Database Syst Rev. 2010;(12):CD001867.
3.
Gurusamy KS, Davidson C, Gluud C, Davidson BR. Early versus delayed laparoscopic cholecystectomy for
people with acute cholecystitis. Cochrane Database Syst Rev. 2013;6:CD005440.
4.
Anderson K, Norman RJ, Middleton P. Preconception lifestyle advice for people with subfertility. Cochrane
Database Syst Rev. 2010;(4):CD008189.
5.
Worthington HV, Clarkson JE, Bryan G, Furness S, Glenny A-M, Littlewood A, et al. Interventions for
preventing oral mucositis for patients with cancer receiving treatment. Cochrane Database Syst Rev.
2011;(4):CD000978.
22
Download