Conducting a Rapid Evidence Assessment – lessons learned

advertisement
Conducting an evidence assessment – method and lessons learned
Introduction
Last summer the effectiveness of treatment for drug-using offenders within criminal
justice settings was highlighted as a key evidence gap. In particular, there was a need to
establish which drug treatment interventions for offenders in the CJS are effective, and
why, to help inform the roll out of the Criminal Justice Interventions Programme (CJIP).
In an attempt to address this evidence gap, the Drugs and Alcohol Research unit in RDS
decided to carry out a rapid assessment of existing evidence in this field. This
assessment had three main purposes:
 To provide an addition to the evidence base informing the further development of
interventions aimed at drug-using offenders.
 To inform the planned CJIP evaluations.
 To identify areas for further research.
A paper has been produced setting out the key findings from the assessment as well as
implications for policy makers.
This was the first time an attempt had been made to conduct a rapid evidence
assessment of this type. The original intention was that 3 members of DAR, with
assistance from HO library and advice from Cabinet Office colleagues, would be able to
complete the exercise within 8 weeks. For future reference, this note sets out the
process that was adopted and the time and resource actually devoted to the project. It
also presents a number of lessons learned along the way.
Method
Preparation
Having identified the evidence gap in broad terms, three research questions were posed:
 How effective is treatment (tiers 1 through to 4) for individuals in the criminal
justice system in terms of reducing their drug misuse and reducing their drugrelated offending?
 What evidence is there for the effectiveness of coercive treatment for drug
misuse (in terms of reducing drug misuse and drug-related offending) compared
to voluntary treatment or no treatment at all?
 How effective is a case management approach within the Criminal Justice
System in terms of criminal justice outcomes? (Case management is defined as
the monitoring/supervision and assistance given to an offender from entering
through to leaving the criminal justice system by someone working within that
system.)
As the primary aim of the assessment was to determine the effectiveness of drug
treatment, it was agreed that the assessment should only consider evidence from
studies conducted using robust quasi-experimental designs. Consequently a systematic
review style approach was adopted for the assessment. Time and resource constraints
prevented a full systematic review being conducted, so a quasi-systematic approach
was adopted. In consultation with Cabinet Office colleagues, a plan was derived to
conduct a rapid evidence assessment (REA). Based on the format used in Health
Technology Assessments (HTA), rapid evidence assessments are intended to be quasisystematic reviews, carried out in a much shorter timescale to traditional systematic
reviews (typically 8-12 weeks compared to 6-12 months)1.
The main differences between the REA and a full systematic review were that:
a) only published studies were considered (i.e. no ‘grey’ literature was considered), and
b) no hand searching of journals took place to ensure that all possible studies had been
included.
Protected time
Before going into detail on the method, it is worth highlighting one consistent theme that
ran throughout the project – the need for protected time to be set aside in order to
complete the project in the timescales suggested for a rapid evidence assessment.
Almost everything took twice as long as originally planned, because initial calculations
were based on an assumption that all the time required could be dedicated to the project
in one block. In the reality, the assessment had to fit around other work priorities. A two
week abstract sift took four weeks; a four week assessment process took eight weeks;
and a four week write-up took eight weeks.
1
Cabinet Office (2003) The Magenta Book: Guidance Notes for Policy Evaluation and Analysis
The search
The following key terms were selected from the research questions: ‘treatment’,
‘offending’, ‘drug’, ‘effectiveness’, ‘coercion’ and ‘criminal justice system’. Alternatives
for, and derivatives of, those terms were produced, and were then included in the full list
of search terms. This list can be found at Annex A.
The search terms were refined and tested in co-operation with Home Office library staff.
They advised that that the collection of terms was too extensive and extremely difficult
for the databases to cope with. They also advised that it would not be possible, even
with the most sophisticated use of Boolean logic, to devise a search strategy that would
combine all of them and give sensible results. The final search strings used for each of
the databases interrogated are also included at Annex A.
HO library staff conducted the search on databases to which the HO library subscribed
to at the time2. For each database, two separate searches were conducted. One string
was constructed to address the questions on treatment effectiveness and coercion, on
the assumption that the way the search terms were constructed, coercive forms of
treatment would be captured in the general search for treatment. A separate string was
constructed to address the case management question.
The following databases were searched.







National Criminal Justice Records Service (NCJRS)
Criminal Justice Abstracts
Medline
PsycINFO
Applied Social Sciences Index and Abstracts (ASSIA)
Social Science Citations Index (SSCI)
Public Affairs Information Service (PAIS)
There was no direct cost to the project for the CJA and ASSIA searches, as the
databases were covered under general HO library subscriptions. All other databases
were searched through the Dialog system at a total cost of £1,300. HO library staff spent
approximately 30 hours conducting the searches3.
To cut down on the time and resource required for the abstract sift and for the
assessment itself, a decision was taken to restrict analysis to post-1980 studies from all
databases.
In total, almost 2,987 abstracts (including duplicates) were elicited from the databases
above. The treatment search string produced 2,312 abstracts, while the case
management search string produced 675. The majority of abstracts came from NCJRS
(1,256 and 405 respectively for the treatment and case management search strings).
Since this assessment took place the HO Library has fundamentally revised the way in which literature
searches are managed. Desktop access to the abstract databases has been provided to all researchers in RDS to
enable them to conduct their own searches. The HO library now undertakes an advisory role in database
searching, rather than conducting the searches itself.
3 This Dialog cost is no longer relevant due to the revision mentioned above. All database searches are covered
under the desktop access subscriptions managed by the Home Office. Also, HO library staff would no longer
carry out the searching.
2
All searching was carried out over a two-week period between 19th August and 11th
September 2003.
The sift
The abstracts were sifted on the basis of a) relevance to the research questions outlined
above and b) whether the paper was based on a primary study examining the
effectiveness of an intervention. If it was not clear whether the paper was a report of a
primary study or was a policy piece, the paper was requested.
The sifting process was much more time-consuming than originally envisaged. With
almost 3,000 abstracts to sift (and record where appropriate), the three members of DAR
originally assigned to the project had difficulty coping with the sheer volume. The sifting
process itself took an estimated total of 20 working days. In reality, the sift took much
longer than the two calendar weeks originally planned for, as it had to be fitted around
other work priorities. It took a month to complete the sift. The first batch of abstracts was
sent to DAR on 28th August. The last request for papers was sent on 24th September.
Once a paper had been identified, basic bibliographical details were recorded on an
excel spreadsheet. This spreadsheet was used to manage the process of requesting
papers from the HO library. In most instances the HO library did not hold a copy of the
paper requested, but most were easily and quickly obtainable via inter-library loan (ILL)
from the British Library. In order to ease the burden on HO library staff, requests were
made in batches on a weekly basis.
Several lessons for future assessments were learned during the sifting process and are
set out below. It has to be acknowledged that some of these issues can be addressed
internally, but wider problems with the format and quality of abstracts in particular, can
not be rectified solely by the Home Office.
Tightening up search criteria
With hindsight, the search criteria could have been tightened to exclude a number of
spurious abstracts. For example, in this search produced many abstracts for studies into
treatment for drink drivers (which were not relevant) and offenders with co-morbidity
(which we had chosen to exclude from this assessment as these special conditions
require much different types of intervention). Current guidance on conducting systematic
reviews suggests that NOT Boolean terms should not be used to avoid the risk of
excluding potentially relevant studies. For this project the HO library was advised not to
include NOT terms in the search strings. However, a more pragmatic approach adopted
for a REA which used NOT terms may have helped reduce the burden of the abstract
sift.
Duplication of abstracts
Abstracts for most studies were found in more than one database. If duplicates had been
eliminated a great deal of time and effort would have been saved during the sifting
process. (It subsequently came to light that duplicates could be removed in Dialog. As a
result of miscommunication between DAR and the HO Library, duplicates for the
databases searched using Dialog were included).
Both points above highlight the need to ensure a clear understanding about the search
strategy is reached before the search starts, if searches are being carried out by a third
party (i.e. someone not involved in the assessment itself).
"Hard to reach" papers
In this assessment we sifted abstracts from the most comprehensive database last.
Many of the papers found on NCJRS were often harder to get hold of than traditional
journal articles. For example, a significant number of reports on drug court evaluations
that had been prepared for the individual jurisdictions operating the drug courts were
publicly available but were not published in peer-reviewed journals and/or available via
ILL. It is perhaps advisable to start with NCJRS for two reasons. Firstly, it produced by
far the greatest number of abstracts (and most studies found on other databases were
also found on NCJRS). Secondly, because a higher proportion of the papers found on
NCJRS were harder to obtain, the longer the time available to find them the better.
Managing the process - systematic recording of studies
All studies requested were recorded on an excel spreadsheet, but this became very
unwieldy and difficult to manage. A few papers were requested and received twice,
suggesting that the process was not foolproof. The HO library has suggested that in-built
mechanisms to avoid requesting duplicate papers may have failed because more than
one member of the assessment team was making requests for papers and more than
member of HO library staff were making ILL requests. A software package such as
Endnote or RefWorks to manage the process would have been an extremely useful
tool4.
Format of abstracts
Most abstracts from the library arrived in unwieldy rich text files, which had to printed off
and read. It was not easy to transfer the necessary information from the abstracts to the
recording spreadsheet (see above) electronically. A more user-friendly format for
abstracts would speed up the abstract sift greatly5.
Quality of abstracts
The primary basis for inclusion in the assessment was methodology. However, the
relatively poor quality of many of the abstracts made it difficult to determine whether a
study was suitable for inclusion. In many cases, no details were given on methodology.
We adopted a cautious approach and consequently requested many papers on studies
that were not based on a suitably rigorous methodology. This added considerable time
and expense to the subsequent assessment process.
Document delivery costs for this project amounted to £1,000.
The assessment
A total of 238 papers were identified and requested during the sift stage. In the time
available (an arbitrary cut-off point of October 30th was set), 198 papers were received.
At the time of writing, HO library is in the process of purchasing a RefWorks licence, which will facilitate
desktop access to all members of RDS.
5 This was mainly due to the way in which Dialog produced output. Again, now desktop access to the abstract
databases has been organised and RefWorks has been purchased, this problem has to a large extent been
resolved.
4
The remainder could not be obtained via inter-library loan from the British Library, or the
source of the paper could not be identified.
As mentioned above, it was not always possible to tell from an abstract whether a paper
was a policy piece of whether it was a report of a primary study. Of the 198 papers
received, 120 were reports of primary studies. The remaining papers were primarily
discursive pieces or literature reviews. (For the literature reviews, it must be noted that in
most cases, they had been identified as such by the abstract. These papers were
acquired to provide further background information for the assessment and to gauge
whether the studies we had found were broadly representative of the literature in this
field. They were not requested because poor abstracts failed to identify the nature of the
paper, as was the case with most of the discursive pieces.)
The 120 primary studies were then reviewed to determine whether they were a) relevant
to the research questions and b) methodologically sound. Once a study had been
acknowledged as relevant, an initial assessment of the methodology was carried out
based on the “Maryland Scale”, devised by Sherman et al (1998) at the University of
Maryland for a systematic review into what works in crime prevention. The Scale can be
found at Appendix C. Only those studies with a robust comparison group design (point 3
on the “Maryland Scale”) were considered to be sufficiently robust for inclusion in the
assessment. Sherman et al argue that only those studies using a robust comparison
group can provide strong evidence of causality (and hence effectiveness). A total of 64
relevant studies rating at least 3 on the Scale were identified for further assessment.
This further assessment was carried out using an ad hoc “Quality Assessment Tool”
(QAT) devised specifically for this project. The QAT was based on a combination of
more detailed criteria prepared for the systematic review by the University of Maryland
Team mentioned above and criteria established by the Crime and Policing Group in RDS
for a recently conducted systematic review of studies on HR management and police
performance.
By this stage in the process it had become clear that extra resource would be required
for the assessment stage in order to speed up the process. In total 7 members of DAR
undertook the assessments. To develop and refine the QAT, the 6 original reviewers (a
7th reviewer joined at a later date) each reviewed the same 3 studies. The individual
assessments were then compared and a consensus was reached over any
discrepancies between scores. This process had the dual effect of refining the QAT
guidance and ensuring that there was an element of consistency between the reviewers
when assessing the papers.
Each study was marked according to its methodology in four key areas: sample
selection; bias; data collection and data analysis. Each element was rated as either 1
(good), 2 (average), 3 (weak) or 5 (unable to determine from the paper). The scores for
each component were than added together to provide and overall rating for the study.
Those studies with the lowest scores were considered the most methodologically robust.
The minimum score available was 4, with a maximum score of 20.
A potential flaw in this scoring system was the use of a ‘5’ score when a particular
criterion could not be marked due to a lack of evidence within a paper. This potentially
resulted in a number of sound studies being excluded based on the result of scant or
poor reporting of methodology, rather than a poor methodology per se. However, in
defence of this approach, studies that were inadequately reported had to be excluded on
the basis that the reviewers could not be sufficiently confident in the methodology based
on the available evidence. If time were available, those studies where it was felt that
scant reporting resulted in exclusion might have been followed up. However, this was
simply not feasible for this evidence assessment.
Two reviewers assessed each study. Where “major” discrepancies between the QAT
scores existed (greater than 2 points), the study was assessed independently by a third
reviewer. Where the third reviewer agreed with one of the original scores that score was
chosen. If the third reviewer produced a different score to both of the original scores, the
third reviewer took the final decision on the score. It has been suggested that this is a
rather arbitrary process. It was rightly pointed out that there is no real justification for the
third reviewer’s judgement to supersede the previous judgements. However this
approach was necessitated by the timescale. If time were available, a more systematic
approach would have been adopted whereby the two original reviewers came to a
consensus on the score.
In total, there was a “major” discrepancy between the scores in 14 of the 64 studies
assessed. This suggests that the QAT may have benefited from further guidance to
ensure greater consistency between reviewers.
Once a final score had been established for each study, they were rated as either
strong, average, weak (but eligible for consideration) and poor (and not eligible for
consideration). Those studies with a score of 4 to 6 were considered strong (8 studies);
studies with a score of 7 or 8 were average (21 studies); while studies with a score of 9
& 10 were considered weak but eligible for consideration (21 studies). Studies with a
score of greater than 10 were excluded from further consideration as they were
considered to be so poor methodologically that the results could not be relied upon (14
in total).
Further details on the criteria used in the QAT can be found in Annex B, along with the
guidance notes prepared for reviewers.
As with the sifting stage, several lessons for the future were learned during the
assessment stage.
“Singing from the same songsheet”
If an evidence assessment is to be undertaken by a team (and resource constraints are
likely to dictate that it is a team exercise), it is essential that all the reviewers in the team
interpret the scoring guidelines in the same manner. Developing the QAT completion
guidelines was a much more difficult process than originally envisaged. The refinement
exercise proved to be invaluable in terms of ensuring that people were interpreting the
criteria in the same manner. However, once the assessment stage was underway, it
became clear that interpretations still varied across the reviewers. This makes the
‘systematic’ approach very difficult to manage. Almost one in four of the studies
assessed produced significantly different scores between the two reviewers assigned
the studies. Arbitration by a third reviewer was required in these instances, but there
remains a question mark as to how systematic the arbitration process was, given that it
relied on one further reviewer.
Managing the process - systematic recording of assessments
All assessments were recorded and aggregated on an excel spreadsheet. As with the
abstract sifting spreadsheet, this became very unwieldy and difficult to manage. A great
deal of effort had to be put in to keeping track of who had reviewed which paper, which
papers were due for a second review, etc, etc. Again, a software package such as
Endnote to manage the process may have been an extremely useful tool.
Being realistic about the time and effort required to assess papers
As with the sifting process, possibly unrealistic expectations were held at the start of the
process about how quickly the papers could be assessed. Rating the methodology of
studies can be an extremely laborious process, particularly given the poor quality of
reporting in many papers. It is also extremely difficult to do for extended periods –
devoting whole days to assessing papers was a draining experience! A better approach
would be to intersperse assessments with other work. Assessing papers in small
batches is likely to lead to a better quality of assessment overall. However, this clearly
has implications for the timescale in which an assessment can be carried out.
Methodological knowledge
It is important that those reviewing papers for an evidence assessment have sufficient
knowledge and experience in research methods to carry out the assessment. Some of
the discrepancies in QAT scores may not have been the result of different interpretations
of the criteria, but a misunderstanding of the methodology as set out in the papers. In
turn, this may in part be due to poor reporting, but in some instances a greater
knowledge of quasi-experimental research methods may have improved the accuracy
and consistency of the assessment. Training on ‘what makes a good study’ would be
extremely helpful.
Annex A – Search terms
Search terms derived
Effective:
effective*, useful*, practica*, able, work*, outcome, impact,
result*
Treatment:
treat*, screen*, refer*, assess*, harm, specialist, service*,
care, community, residential, ‘in*patient’, support, aftercare,
throughcare, advice, advize, advis*, ‘social worker*’,
teacher*, probation, housing, GP*, nurs*, information,
program*, ‘needle exchange’, ‘care plan’, prescrib*,
motivation*, ‘drop-in’, outreach, detached, peripatetic,
domiciliary, ‘arrest referral’, CARATS, ‘brief interventions’,
intervention*, psychotherap*, ‘cognitive behavio*r*’,
counselling, methadone, detox*, rehabilitat*, ‘drug testing’,
‘DTTO’, ‘drug treatment and testing order’, ‘stabili*’,
abstinence, therap*, help, ‘self-help’, ‘health promotion’
Criminal Justice System: crim*, justice, law, enforce*, correction*, court, arrest,
apprehension, prosecution, defense, defence, sentenc*,
prison, jail, incarceration, probation, judic*, criminal,
offence, offense, custody, bail, trial, pre-trial, parole.
Drugs:
drug*, substance*, misuse, abuse, illegal, illicit, narcotic,
opiates, opioid, tranquillisers, hallucinogen, addict*, heroin,
cocaine, crack, methadone, amphetamine*, cannabis,
ecstasy, diazepam, temazepam, inject*, overdos*,
stimulant*
Offending:
offense, offence, offend*, crim*, shoplift*, burglar*, robbery,
theft, supply, handling, deception, violen*
Coercion:
coerc*, compel*, involuntary, compulsory, mandatory,
obligatory, forc*, necessary
Actual search strings used
ASSIA
Case mgmt: Query: KW=((offender* or burglar*) OR (criminal* or reoffend*) OR (theft or
shoplift*)) AND KW=((drug* or narcotic*) OR (substance or heroin) OR (crack or
methadone)) AND KW=((case management or case worker) OR (throughcare or through
care or aftercare) OR (link worker or care coordinator))
Treatment: Query: KW=((offender* or burglar*) OR (criminal* or reoffend*) OR (theft or
shoplift*)) AND KW=((drug* or narcotic*) OR (substance or heroin) OR (crack or
methadone)) AND KW=((treatment or therapy) OR (drop in or rehabilitat*) OR (needle
exchange))
CJA
Case mgmt: (offender* or criminal* or burglar* or theft or shoplift*) and (drug* or
narcotic* or heroin or methadone or crack) and (case management or case worker or
throughcare or through care or after care or link worker)
Treatment: (offender* or criminal* or burglar* or theft or shoplift*) and ((drug* or narcotic*
or heroin or crack or methadone) and (illicit or banned or illegal or abuse)) and
(treatment or therapy or needle exchange or rehabilitat*)
MEDLINE
(offender or offenders or criminal or theft or shoplifting or burglar or burglars) AND (drug
or drugs or heroin or substance or crack or methadone or narcotic) AND (illicit or illegal
or abuse or banned) AND (treatment or therapy or drop in or rehabilitation or needle
exchange or CARATS or DTTO) AND (effective or success or relapse)
Dialog (for NCJRS, PsycINFO, SSCI and PAIS)
Case mgmt: (offender? OR burglar? OR criminal? OR reoffend? OR theft OR shoplift?)
AND (drug? OR narcotic? OR abuse OR substance OR heroin OR crack OR
methadone) AND (case(w)management OR throughcare OR through(w)care OR
case(w)worker OR link(w)worker OR care(w)coordinator OR aftercare)
Treatment: (offender? OR burglar? OR criminal? OR reoffend? OR theft OR shoplift?)
AND (drug? OR narcotic? OR abuse OR substance OR heroin OR crack OR
methadone) AND (treatment OR therapy OR drop(w)in OR rehabilitat? OR
needle(w)exchange) AND (illegal OR illicit OR abuse OR banned) AND (effective? OR
success? OR relaps?)
Breakdown of abstracts produced
Database
NCJRS
CJA
ASSIA
PAIS
SSCI
PsycINFO
MEDLINE
Total
Treatment
Case mgmt
search string search string
1256
405
183
42
293
17
5
13
183
62
204
89
188
47
2312
675
Total
1661
225
310
18
245
293
235
2987
Annex B – Assessment guidance
1) An initial test of the methodology is carried out by ranking it on the Maryland Scale of
Scientific Methods. To be included in the assessment the methodology must achieve at
least 3 on the Maryland Scale. This is a “comparison between two or more comparable
units of analysis, one with and one without the programme”.
2) If the study qualifies for inclusion according to its Maryland Scale ranking, then a more
detailed assessment of its methodology will be conducted using a customized quality
assessment tool (QAT). The quality criteria on which the methodology for each study
should be assessed are set out in the table below, along with detailed guidance notes on
how to rate the studies against each criterion. There are four components in the QAT.
The first three components (sample, bias and data collection) are assessed on three
separate criteria. The fourth component (data analysis) is assessed on a single
criterion. For each criterion a score of 1 to 3 will be assigned based on the methodology
used. A score of 5 should be recorded where there is insufficient detail to assess the
study against that particular criterion. Where a particular criterion is not applicable, a
score of 1 should be assigned, but it should be noted that this score was only given
because the criterion was inapplicable.
The scores for individual criteria will then be averaged to provide an overall score for
each component of the QAT (with the exception of the data analysis component). In turn,
the scores for each component will be aggregated to provide an overall score for each
study.
The Maryland Scale of Scientific Methods
Level 1: Correlation between a crime prevention programme and a measure of
crime or crime risk factors at a point in time.
For example, a single, post-treatment survey of clients who have received treatment to
compare treatment outcomes with criminal justice outcomes.
Level 2: Temporal sequence between the programme and the crime or risk
outcome clearly observed, or the presence of a comparison group without
demonstrated comparability to the treatment group.
For the purpose of this exercise, all comparison group studies should be given a rating
of at least 3, with the exception of:
a) those studies where the only comparison is between completers and non- (or
partial) completers of a particular treatment
b) those studies where the comparibility of the comparison groups is seriously
compromised and no attempt has been made to control for this (please state
reasons).
Level 3: A comparison between two or more comparable units of analysis, one
with and one without the programme.
Level 4: A comparison between multiple units with and without the programme, or
using comparison groups that evidence only minor differences.
For studies using just one treatment and one non-treatment comparison groups, if those
groups evidence major differences prior to the intervention that have to be controlled for
statistically, the study should be treated as level 3. (For example, where the dependent
variable was recidivism, any significant differences in the level of pre-intervention
offending behaviour between comparison groups would need to be controlled for). Only
if it has been clearly demonstrated that, prior to the intervention, there is very little
difference between comparison groups, then the study should be rated as level 4.
Level 5: Random assignment and analysis of comparable units to programme and
comparison groups. Differences between groups are not greater than expected
by chance. Units for random assignment match units for analysis.
Sometimes random assignment takes place at a different level than the analysis, e.g.
prisons are randomly assigned, but prisoners are the unit of analysis. These cases
should not be treated as random assignments.
Further details on the Maryland Scale can be found in:
Sherman, LW et al (1998) Preventing Crime: What Works, What Doesn’t, What’s
Promising, Washington, NIJ (http://www.preventingcrime.org/report/index.htm)
Quality Assessment Tool
Quality
indicator
1. Sample
a) size
b) method
c) selection
2. Bias
a) response
/refusal bias
Level of quality
Grade
Whole population or 100+ participants in both treatment and control groups
70% of population or 50-100 participants in both treatment and control groups
Less than 50 participants in both treatment and control groups
Not reported
Whole population or random samples
Purposive samples with potential impact adequately controlled for statistically
Purposive samples with potential impact not adequately controlled for
statistically, or not controlled for at all
Not reported
Control and experimental groups comparable
Control and experimental groups not comparable, but differences adequately
controlled for statistically
Control and experimental groups not comparable, and differences not
adequately controlled for statistically, or not controlled for at all
Not reported
No bias
Some bias but adequately controlled for statistically
Some bias and not adequately controlled for statistically, or not controlled for
at all
Not reported
b) attrition
No/very little (< 10%) attrition
bias
Some attrition butt adequately controlled for statistically
Some attrition and not adequately controlled for statistically, or not controlled
for at all
Not reported
c)
Control and exp. groups treated equally
performance Control and exp. groups not treated equally – minor effect
bias
Control and exp. groups not treated equally – major effect
Not reported
3. Data collection
a) method
Very appropriate
Appropriate
Not appropriate
Not reported
b) timing
Very appropriate
Appropriate
Not appropriate
Not reported
c) validation Very appropriate
Appropriate
Not appropriate
Not reported
4. Data analysis
a)
Very appropriate
appropriate
Appropriate
techniques/
Not appropriate
reporting
Not reported
1
2
3
5
1
2
3
5
1
2
3
5
1
2
3
5
1
2
3
5
1
2
3
5
1
2
3
5
1
2
3
5
1
2
3
5
1
2
3
5
Guidance notes for completion of Quality Assessment Tool
The overriding guide for the Quality Assessment – assume the worst! If insufficient
information is available to assess a particular criterion, mark that as 5. The first two
criteria are relatively straightforward and self-explanatory. Further definitions are below:
1c) Sample selection
Were the experimental and control groups somehow selected differently, or were not
comparable for some reason? For example, did the groups demonstrate very different
patterns of offending prior to entering treatment and control groups? This score relates
to the ‘recruitment’ phase only, i.e. before any treatment takes place or is even offered.
2a) Response/refusal bias
This score relates to any bias that may have been introduced once the samples had
been selected. Two examples of potential response/refusal bias:
If a study relied on voluntary take-up of treatment/intervention once the treatment
sample had been selected, were those that volunteered to participate comparable to all
those chosen to participate in the treatment group? See Harrell & Roman (2001) for an
example.
If a study relied on self-reported data among treatment and control groups (once those
groups had been selected), were those in the treatment and control groups who
completed the self-report questionnaire/interview comparable to the total populations of
the treatment and control groups?
2b) Attrition bias
Were all the participants in the experimental and the control samples accounted for?
Were there differences between the study participants (in both treatment and control
groups) at the pre- and post-test stages? Were there more “lost-to-follow-ups” in one
treatment group compared to the control group (or vice versa)? Is attrition evident but
no adequate discussion found in the study, or is it discussed but not controlled for
adequately?
2c) Performance bias
Were experimental and controls dealt with separately other than the intervention under
inquiry? Could any other differences in the way in which the groups were treated have
any major impact on the outcomes? For example, performance bias will exist where
different drug testing regimes were in place for control and treatment groups, if the
testing regime itself was not the intervention being investigated. Also, if appropriate,
were the participants and interviewers blinded?
3a) Method of data collection
What data collection methods were employed, e.g. self-completion questionnaire,
structured interview, analysis of administrative data (crime records)? Were these
appropriate in terms of supplying the required data to be able to answer the research
question(s) posed?
Studies that rely on the retrospective collection of self-reported pre- and postintervention data only should be given a maximum score of 2 (given likely recall issues).
Studies relying on a single data collection method should be given a maximum score of
2.
3b) Timing of data collection
Was the timing of data collection from the control and comparison groups before and
after the treatment appropriate? Was a sufficient length of time left after treatment when
collecting recidivism data to adequately determine outcome in terms of reduced
offending?
24+ month follow-ups should be rated as 1, 12-24 month follow-ups should be rated as 2
and under-12 month follow-ups should be rated as 3. Those studies where no baseline
data are collected should be marked as 3
For longitudinal studies, were the data collected at appropriate intervals?
rationale given for the timing of the data collection, and was it appropriate?
Was a
3c) Validation of data
If appropriate, were different sources of data used? Was any triangulation carried out?
For example, was self-reported criminality matched to official records?
Studies relying on a single data source should be given a maximum score of 2. Studies
that rely on a single measure of recidivism should be given a maximum score of 2.
Data collection – general
Where multiple methods are used, the reviewer must make a judgment regarding the
overall standard of the data collection, concentrating on those data deemed most
appropriate to answering the research questions.
4a) Appropriate statistics and techniques used
Were appropriate statistics used (e.g. Chi-square, t-test, ANOVA, regression) and
reported? Were standard deviations reported as well as differences of means? Were
lower and upper quartiles reported (or the range) as well as medians? Were confidence
intervals reported as well as odds ratio? Were significance levels reported?
Were repeated measures reported, i.e. were baseline data and post-treatment data
reported? If post treatment data only are reported, the maximum score given should be
2.
Annex C – Project resources
Planned and actual timetables for the assessment
Date
Week
11Aug
1
18Aug
2
25Aug
3
01Sep
4
08Sep
5
15Sep
6
22Sep
7
29Sep
8
06Oct
9
13Oct
10
20Oct
11
27Oct
12
03Nov
13
10Nov
14
17Nov
15
24Nov
16
01Dec
17
08Dec
18
15Dec
19
22- 29- 05Dec Dec Jan
20 21 22
12Jan
23
19Jan
24
26Jan
25
Planned timetable
Devise strategy
Database searches
Abstract sifting
Papers received
Develop assessment tool
Assessment
Report write up
Actual timetable (grey blocks represent the over-run on the planned time allocated to that element of the project)
Devise strategy
Database searches
H
Abstract sifting
X
O
Papers received
M
L
Develop assessment tool
A
I
Refine assessment tool
S
D
Assessment
A
Report write up
Y
Estimated time spent on the project and total costs
Activity
Devise strategy
Database searches
Abstract sifting
Requesting papers
Develop/refine assessment tool
Assessment
Report write up
Total
No of people
6 (3xDAR, 1xCab Off, 2xHO Lib)
1 (1xHO Lib)
3 (3xDAR)
2 (2xHO lib)
6 (6xDAR)
7 (7xDAR)
2 (2xDAR)
7xDAR, 4xHO Lib, 1xCab Off
Total hours
20
30 hours searching + 12 hours other admin = 42 hours
20 days x 7.5 hours = 150 hours
36
20
10 per day x 200 reviews = 20 days x 7.5 hours = 150 hours
10 days x 7.5 hours = 75 hours
493 hours
Cost (excl staff costs)
£1,300
£1,000
£2,300
Download