Kaveh G. Shojania, MD; Margaret Sampson, MLIS; Mohammed T. Ansari, MBBS, MMedSc, MPhil; Jun Ji, MD, MHA; Steve Doucette, MSc; David Moher, PhD
Disclaimer: The authors of this report are responsible for its content. Statements in the report should not be construed as endorsement by the Agency for Healthcare Research and Quality or the U.S. Department of Health and Human Services.
Acknowledgments: The authors thank Keith O'Rourke for statistical advice, Jessie McGowan and Tamara Rader for assistance with searches, and Alison Jennings for assistance with development of the meta-analytic worksheet. They also gratefully acknowledge Dr. David Atkins and the members of the technical advisory panel for the project funded by the Agency for Healthcare Research and Quality from which this work derives: Drs. Paul Shekelle, Evelyn Whitlock, Cynthia Mulrow, Doug Altman, Martin Eccles, and P.J. Devereaux.
Grant Support: By the Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services (contract no. 290-02-0021). Dr. Shojania received additional salary support from the Government of Canada Research Chairs program. Dr. Moher is the recipient of a University of Ottawa Research Chair.
Potential Financial Conflicts of Interest: None disclosed.
Reproducible Research Statement: The data set is available to interested readers by contacting Dr. Shojania (e-mail, email@example.com); statistical code can be obtained from Mr. Doucette (e-mail, firstname.lastname@example.org).
Requests for Single Reprints: Kaveh G. Shojania, MD, The Ottawa Hospital–Civic Campus, 1053 Carling Avenue, Room C403, Box 693, Ottawa, Ontario K1Y 4E9, Canada; e-mail, email@example.com.
Current Author Addresses: Drs. Shojania and Ji: Ottawa Health Research Institute, 1053 Carling Avenue, Ottawa, Ontario K1Y 4E9, Canada.
Ms. Sampson, Mr. Ansari, and Dr. Moher: Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, Ontario K1H 8L1, Canada.
Mr. Doucette: The Ottawa Hospital, 501 Smyth Road, Ottawa, Ontario K1H 8L6, Canada.
Author Contributions: Conception and design: K.G. Shojania, M. Sampson, M.T. Ansari, D. Moher.
Analysis and interpretation of the data: K.G. Shojania, M. Sampson, M.T. Ansari, J. Ji, S. Doucette, D. Moher.
Drafting of the article: K.G. Shojania.
Critical revision of the article for important intellectual content: M. Sampson, M.T. Ansari, S. Doucette, D. Moher.
Final approval of the article: K.G. Shojania, M. Sampson, M.T. Ansari, D. Moher.
Provision of study materials or patients: M. Sampson.
Statistical expertise: K.G. Shojania, S. Doucette.
Obtaining of funding: M. Sampson, D. Moher.
Administrative, technical, or logistic support: D. Moher.
Collection and assembly of data: M. Sampson, M.T. Ansari, J.Ji.
Shojania K., Sampson M., Ansari M., Ji J., Doucette S., Moher D.; How Quickly Do Systematic Reviews Go Out of Date? A Survival Analysis. Ann Intern Med. 2007;147:224-233. doi: 10.7326/0003-4819-147-4-200708210-00179
Download citation file:
Published: Ann Intern Med. 2007;147(4):224-233.
Clinicians rely on systematic reviews for current, evidence-based information.
This survival analysis of 100 meta-analyses indexed in ACP Journal Club from 1995 to 2005 found that new evidence that substantively changed conclusions about the effectiveness or harms of therapies arose frequently and within relatively short time periods. The median survival time without substantive new evidence for the meta-analyses was 5.5 years. Significant new evidence was already available for 7% of the reviews at the time of publication and became available for 23% within 2 years.
Clinically important evidence that alters conclusions about the effectiveness and harms of treatments can accumulate rapidly.
Systematic reviews have become increasingly common in recent years (1) and are recommended by many as the best sources of evidence to guide both clinical decisions (2) and health care policy (3). For systematic reviews to fulfill these roles, their findings must remain relatively stable for at least several years or effective mechanisms must exist for alerting end users to important changes in evidence. Yet, surprisingly little research has assessed the extent to which systematic reviews become out of date or the rate at which this occurs (4–7). Some organizations, such as the Cochrane Collaboration, recommend updating systematic reviews every 2 years, but few empirical data guide this or other recommendations about updating.
We sought to determine how quickly systematic reviews meet explicitly defined criteria for changes in evidence of sufficient importance to warrant updating. We also sought to identify predictors of “survival time,” the time to such important changes in evidence. Survival time might vary depending on many factors, including the type of question posed by the original review (for example, therapeutic or diagnostic), the types of studies included (for example, randomized trials or observational studies), and whether the systematic review provided quantitative synthesis. To limit such variation, we focused on systematic reviews of randomized, controlled trials that evaluated therapeutic benefit or harm by providing quantitative synthesis (meta-analysis) for at least 1 outcome.
We used a quasi-random process (alphabetical sort order by author) to select 100 systematic reviews that were indexed in ACP Journal Club with an accompanying commentary between January 1995 and December 2005 (with a search date no later than 31 December 2004 to ensure at least 1 full year for new evidence to appear). We chose this sampling frame because ACP Journal Club selects systematic reviews that meet explicit quality standards and are deemed directly relevant to clinical practice (8). We regarded the sample size of 100 as sufficiently large to achieve suitably narrow confidence intervals and to permit evaluation of up to 5 potential predictors of survival.
Eligible reviews evaluated the benefit or harm of a specific drug, class of drug, device, or procedure (invasive procedure or surgery) and included randomized or quasi-randomized, controlled trials. We excluded evaluations of alternative and complementary medicines because the stability of reviews of such therapies might differ substantially from reviews of conventional therapies.
We required that reviews provide a point estimate and 95% confidence interval for at least 1 outcome in the form of a relative risk, odds ratio, or absolute risk difference for binary outcomes and weighted mean differences for continuous outcomes. We excluded meta-analyses of individual-patient data, meta-regressions, and indirect meta-analyses because of the difficulty of determining whether new data would alter previous quantitative results. Two team members independently assessed eligibility, with disagreements resolved by consensus involving a third reviewer. When more than 1 review on the same topic was identified, only the earliest was included.
For each review, searches for new trials included identifying new systematic reviews on the same topic, submitting relevant content terms to the Clinical Queries function in Ovid, applying the Related Articles function in PubMed to the 3 largest and the 3 most recent trials in the original review (up to 6 trials in total), and using Scopus (http://www.scopus.com/scopus/home.url) to identify new randomized trials that cited the original review. When these search strategies yielded no eligible new trials, we conducted more comprehensive electronic searches and reviewed relevant chapters in such sources as Clinical Evidence and UpToDate to ensure that we had not missed new trials.
Team members who had backgrounds in both medicine and clinical research screened citations retrieved by the preceding methods to identify trials that would have met the inclusion criteria in the original review. Retrieved articles were screened in chronological order to ascertain quantitative or qualitative signals for the need for updating. The review protocol stopped when any criteria for updating were met. Each systematic review was discussed in detail, with the final status—signal for updating was or was not detected—adjudicated by consensus (Figure 1).
Includes the search protocols to identify candidate new trials, application of criteria from the original review to identify eligible new trials, meta-analytic pooling of new results with previous meta-analytic results, and identification of new systematic reviews on the same topic or “pivotal trials” (published in 1 of the 5 highest-impact general medical journals or more than 3 times the sample size of the previous largest trial) that met any of our criteria for qualitative signals for updating. An individual reviewer reached a tentative conclusion about the presence of quantitative and qualitative signals for updating, but each review was discussed in detail by the project team to reach a final consensus decision. For reviews that did not have any signals for updating, the group also decided whether the searches had been adequate or whether more comprehensive searching for new trials might be required, including more detailed electronic searching and hand-searching of for new trials relevant to the original review.
In designing criteria for comparing new findings with those in a previous review, we adapted methods used by other investigators to address similar problems with comparing 2 sets of results relating to the same question (9–13), such as randomized and nonrandomized studies of the same intervention. These investigators identified conflicting findings among different publications using a combination of quantitative thresholds for differences in effect magnitude and qualitative judgments about the language used to describe the results. We have similarly conceptualized quantitative and qualitative signals of potential changes in evidence that are sufficiently important to warrant updating previous systematic reviews.
Quantitative signals consisted of a change in statistical significance or relative change in effect magnitude of at least 50%. We restricted these changes to those involving 1 of the primary outcomes of the original review or any mortality outcome. We also ignored trivial changes in statistical significance—when the original and updated meta-analytic results both had P values between 0.04 and 0.06—so that quantitative signals of changes in evidence would represent robust indicators of the need to update previous reviews. Quantitative signals were detected by combining data from eligible new trials with the previous results using a fixed-effects approach. Use of fixed-effect models allowed pooling of the new trials with the previous meta-analytic result, as opposed to having to obtain original data from all of the included trials in each of 100 systematic reviews. Although random-effects models are usually preferred to avoid spurious precision in the face of heterogeneity, our goal was to detect potential changes in evidence that would warrant a formal update, not produce exact estimates of the updated results.
Qualitative signals included new information about harm sufficient to affect clinical decision making, important caveats to the original results, emergence of a superior alternate therapy, and important changes in certainty or direction of effect. Qualitative signals were detected by using explicit criteria for comparing the language in the original review with descriptions of findings in new systematic reviews that addressed the same topic, pivotal trials, clinical practice guidelines, or recent editions of major textbooks (for example, UpToDate). Pivotal trials were defined as trials that had a sample size at least 3 times larger than that of the previous largest trial or were published in 1 of the 5 highest-impact general medical journals (TheNew England Journal of Medicine, Lancet, Journal of the American Medical Association, Annals of Internal Medicine, and the British Medical Journal).
We defined 2 levels of importance for qualitative signals: “potentially invalidating changes in evidence,” which would make one no longer want clinicians or policymakers to base decisions on the original findings (such as a pivotal trial that characterized treatment effectiveness in terms opposite of those in the original systematic review), and “major changes in evidence,” which would affect clinical decision making in important ways without invalidating the previous results (such as the identification of patient populations for whom treatment is more or less beneficial). Major changes also included differing characterizations of effectiveness that were less extreme than those for potentially invalidating signals but that would still affect clinical decision making (for example, a change from “possibly beneficial” to “definitely beneficial”). Of importance, such characterizations as “possibly effective,” “probably effective,” and “promising,” were all categorized as “possibly effective.” Thus, qualitative signals for changes in evidence captured substantive differences in the characterization of treatment effects, not merely semantic differences. Full definitions for each of the specific signals can be found at http://www.ohri.ca/UpdatingSystRevs.
For each review, we characterized the clinical content area, eligibility criteria for included trials, definitions of reported outcomes, number of included trials and participants, meta-analytic result for each outcome, identification of statistical heterogeneity, and excerpted quotations of the authors' characterizations of the main results. We also abstracted whether a given outcome was explicitly identified as 1 of the “primary” or “main” outcomes. We discounted identification of more than 3 such outcomes as inconsistent with the concept of a primary outcome.
For each systematic review, we defined “birth” as publication date and “death” as the occurrence of a qualitative or quantitative signal for updating. Observations were censored on 1 September 2006, the midpoint of the 4-month period during which searches were done for the entire cohort.
We fit nonparametric Kaplan–Meier curves and used multivariable Cox proportional hazards models to examine the association between survival and various features of the systematic reviews, including clinical content area, number of included trials, identification of heterogeneity, and “activity in the field”—defined as present if the review included at least 1 trial published within the last year of its search period or if the review identified ongoing trials eligible for inclusion. We also assessed a potential predictor known only by reviewing the literature published after publication of the original review: the magnitude of the increase in the number of eligible new trials. In addition to the proportional hazards analysis to estimate predictors of survival, we used logistic regression to identify predictors of survival for less than 2 years. All analyses were done with SAS, version 9.0 (SAS Institute, Cary, North Carolina).
This work was done under contract with the Agency for Healthcare Research and Quality. The funding source did not have a role in the study design; data collection, analysis, or interpretation; or the decision to submit the manuscript for publication.
A search of the Ovid database for ACP Journal Club retrieved 651 potential systematic reviews. Achieving the target cohort size of 100 reviews necessitated that we screen the first 325 of these records (Appendix Figure 1).
Each of the 100 systematic reviews included a median of 13 studies (interquartile range, 8 to 21) and 2663 participants (interquartile range, 1281 to 8371) (Table 1). Most reviews evaluated drug therapies; the most common clinical content area was cardiovascular medicine (Table 1). We were able to identify at least 1 new eligible trial for 85 systematic reviews, with a median of 4 new trials (interquartile range, 1 to 7) and 1160 patients (interquartile range, 170 to 3689) per review.
A quantitative signal for updating occurred for 20 of the 100 systematic reviews. Qualitative signals occurred for 54 reviews, including 8 that met criteria for potentially invalidating changes in evidence. Qualitative signals were derived from new systematic reviews in 23 cases and from pivotal trials in 25 cases. The primary event of interest, a quantitative or qualitative signal for updating, occurred for 57% of reviews (95% CI, 47% to 67%) in the cohort.
Table 2(9–22) presents examples of signals for updating. The 3 reviews (9, 11, 13) that had a qualitative signal for “opposing findings” are self-explanatory. For example, in the first case, the original review reported that “for every 20 critically ill patients treated with albumin there is one additional death” (9). A subsequent trial (10) with almost 5 times the sample size of previous trials combined showed no such increase. Of the 2 reviews with important differences in characterization short of “opposing findings,” 1 of them borders on opposing findings (15). For the prevention of stroke in high-risk patients, the original review (15) mentioned that the addition of dipyridamole to aspirin was associated with a nonsignificant 6% reduction in serious vascular events, but it concluded that the “addition of dipyridamole to aspirin produced no significant further reduction in vascular events compared with aspirin alone.” Consistent with our efforts to avoid overcalling changes in evidence, we characterized the original review as consistent with “possible benefit.” Thus, the change from this characterization to the definite benefit reported in a subsequent large trial (16) fell short of opposing (and potentially invalidating) the previous findings but still met our criteria for a major change in the characterization of effectiveness. All 3 examples of reviews with qualitative signals for “opposing findings” and the 2 examples of reviews with important differences in characterization short of “opposing findings” also generated at least 1 quantitative signal.
Table 2 also shows an example of a clinically significant caveat (lack of sustained benefit reported from allergen immunotherapy for asthma) (20) and an example of expansion of evidence to a new patient population (secondary prevention for patients with recent stroke) (22). Some may not consider expansion of benefit for statins from the indications established in the original review (21) to secondary prevention in patients with recent stroke as a major change in evidence. However, as emphasized in the new trial itself (22), the editorial that accompanied it (23), and the commentary in ACP Journal Club(24), this trial was the first to evaluate the effects of statins on patients who had cerebrovascular disease but not known coronary artery disease. The commentaries also characterized this trial as providing evidence for the increasingly widespread practice of adding statins to the standard treatment for patients with acute stroke, recommendations that had previously been derived from analogies with the treatment for cardiac ischemia. Thus, we regarded this new trial as meeting our criterion of expanding the evidence from the original review in a manner that would be expected to affect practice. More detailed explanations and additional examples of signals for updating can be found at http://www.ohri.ca/UpdatingSystRevs.
Median survival free of a signal for updating was 5.5 years (CI, 4.6 to 7.6 years) (Figure 2). For the 57 reviews with signals for updating, median time to event was 3.0 years (interquartile range, 0.9 to 5.1 years). However, a signal for updating occurred within 2 years for 23% of reviews (CI, 15% to 33%) and within 1 year for 15% (CI, 9% to 24%). For 7% of reviews (CI, 3% to 14%), a signal had already occurred at the time of publication. Even with restriction only to quantitative signals or “potentially invalidated changes in evidence,” signals for updating occurred within 2 years for 15% of reviews. Restricting the analysis solely to quantitative signals, 12% of reviews had signals for updating within 2 years and 7% within 1 year, including 4 reviews for which the quantitative signal had already occurred at publication.
The immediate decrease in survival at time zero reflects the 7 systematic reviews for which signals for updating had already occurred at the time of publication. The low number of reviews at risk after 10 years reflects the fact that the sample spanned 1995 to 2005 and censoring occurred on 1 September 2006. Thus, only reviews published before September 1996 and having no signals for updating could have more than 10 years of observation.
In univariate analyses (Table 3), shorter survival was associated with a clinical content area of cardiovascular medicine (hazard ratio, 2.58 [CI, 1.39 to 4.78]) (Appendix Figure 2) and an increase in the total number of patients by a factor of 2 or more (hazard ratio, 1.79 [CI, 1.03 to 3.10]) (Appendix Figure 3). Multivariate analysis produced 3 noteworthy changes to these results: heterogeneity in the original review became a statistically significantly predictor for a signal for updating (hazard ratio, 2.15 [CI, 1.12 to 4.11]), an increase in the total number of patients by a factor of 2 or more lost statistical significance as a predictor, and including more than the median of 13 trials became a borderline statistically significant predictor of increased survival (hazard ratio, 0.56 [CI, 0.30 to 1.03]; P = 0.06).
The 5 variables shown in Table 3 represent those we had considered a priori as the most plausible potential predictors. Other potential predictors that were tested in secondary analyses included the source of the systematic review (Cochrane vs. non-Cochrane), number of participants greater than the median of 2663, detection or suspicion of publication bias in the original review, and several variables related to increases in the number of trials or participants in the literature since the original review. None of these features statistically significantly predicted signals for updating.
No variable statistically significantly predicted a signal for updating within 2 years. However, cardiovascular topics showed a nonsignificant increase in the odds of a signal for updating within 2 years (odds ratio, 2.67 [CI, 0.88 to 8.10] P = 0.08), as did an increase in the total number of patients by a factor of 2 or more (odds ratio, 2.29 [CI, 0.84 to 6.25]; P = 0.11). Sensitivity analyses involving different time frames, such as occurrence of a signal within 3 years, yielded similar results.
The median time between the end of the search period and the publication date for a systematic review was 1.1 years (interquartile range, 0.8 to 1.7 years). Time to publication did not differ substantially between Cochrane and journal reviews, nor did it decrease statistically significantly from 1995 to 2005.
When survival analyses were repeated by using the end of the search period as “birth,” median survival was 6.9 years (CI, 6.1 to 9.0 years). A signal for updating occurred within 3 years of the search for 20% of reviews (CI, 13% to 29%), within 2 years for 11% (CI, 6% to 19%), and within 1 year for 4% (CI, 1% to 11%). Predictors of survival did not differ from those identified in the analysis that used publication date as “birth.”
In a cohort of high-quality systematic reviews directly relevant to clinical practice, median survival was 5.5 years. However, signals for updating occurred within 2 years for 23% of reviews, within 1 year for 15%, and before publication for 7%. We found several statistically significant predictors of signals for updating, but no features predicted which reviews would require attention within 2 years.
Our results indicated a far greater need for updating than the only other such evaluation, a comparison of Cochrane reviews from 1998 with their updates in 2002 that reported important changes in conclusions in just 9% of reviews (4). Of importance, that study relied exclusively on interpretations of new evidence by authors of the original review, who might be disinclined to find new evidence or report important changes. Also, only 70% of Cochrane reviews had updates. It is possible that reviewers were less likely to update large increases in the number of new trials or major changes in conclusions given the greater work involved. Finally, Cochrane reviews differ in important respects, such as clinical topic coverage, from other peer-reviewed systematic reviews (25).
We restricted our cohort to systematic reviews of randomized trials of conventional drugs, devices, or procedures that reported meta-analytic results for at least 1 dichotomous outcome. Our exclusion of qualitative reviews, reviews of nontherapeutic topics, meta-analyses of individual-patient data, and meta-regressions reflected our concern that rates of change in evidence might differ across these different types of reviews. Thus, we acknowledge that our results may not generalize to all reviews. That said, as shown in Appendix Figure 1, excluding the records retrieved by our initial electronic search that were not systematic reviews, 139 of the first 287 systematic reviews (48%) were eligible for inclusion. Thus, although our cohort may seem highly selected, approximately half of the reviews indexed in ACP Journal Club were eligible for inclusion in our cohort. Granted, ACP Journal Club itself represents a nonrandom sample of all systematic reviews insofar as it selects reviews that meet certain quality standards and have high potential to affect clinical practice. However, these biases strengthen our results because such reviews represent those one would hope to have the greatest stability.
The main limitation of our findings is that the assessments of the need to update previous reviews did not involve input from experts in the relevant content areas. However, our approach of having investigators apply explicit qualitative and quantitative criteria to compare 2 sets of results addressing the same question of interest represents the norm in methodological research of this type (26–32). The notable exception was an evaluation of the average shelf life of clinical practice guidelines (7). By choosing a few guidelines (17 in total) produced by a single agency, the investigators were able to ask the authors of the original guidelines to assess changes in evidence. Using such an approach was not feasible for our analysis of a much larger sample of 100 systematic reviews. However, we chose quantitative signals of changes in evidence that few would question as important and used explicit criteria for comparing the language of new findings with those of the original review. Moreover, we used expert sources, such as editorials and textbooks, to confirm our assessments wherever possible.
It is also important to note that our judgments concerned signals of the need to update previous systematic reviews, not definitive judgments about actual changes in evidence. If a previous review concluded that a treatment was effective and a trial in a high-impact journal concluded that the treatment had no benefit, we would count the new result as a signal for updating. We regard such a signal as reasonable for 2 reasons. First, a formal update that incorporated the new evidence might in fact yield conclusions that differ substantially from those of the previous review. Second, even if a formal update would not change the conclusions, the publication of a new trial in a high-impact journal would raise important questions for clinicians about the previous review. In fact, they might preferentially act on the trial's conclusions precisely because it appeared in a high-impact journal. Thus, it would be important to reassert the findings of the original review in an update that explicitly addressed the new evidence.
Ideally, readily discernible features of systematic reviews would indicate whether major changes in evidence were likely to appear within short time frames. Although several features statistically significantly predicted survival, no features adequately distinguished reviews that would require updating within 2 years from those that would not. Our modest sample size of 100 reviews limited our ability to test predictors of survival. However, it is unlikely that we would miss associations of the magnitude required to identify reviews that will probably require updating within short time frames with useful positive and negative predictive values.
Our results have important implications for those who produce, publish, and use systematic reviews. Publishers probably cannot reduce the time for the peer review and publication processes for systematic reviews beyond the benchmarks already attempted for submissions of all types. However, authors might consider submitting their work to the journals that are most likely to accept a given review to avoid delays because of multiple iterations of the peer review process. When the process of submission and rejection from other journals has resulted in the passage of more than 1 year from the date of the previous search, authors should update the search before resubmission, as we found that only 4% of reviews had signals for updating within 1 year of the previous search date. In fact, journals might consider requiring that authors update searches more than 1 year old before submitting systematic reviews. Finally, users of systematic reviews need to recognize that changes in evidence relevant to clinical decision making can occur within relatively short time frames. Once the search date is older than even 1 year, users should check for more recent trials on the same topic to see whether new evidence has altered the findings of a given systematic review. In some cases, such changes will already have occurred at the time of publication.
The In the Clinic® slide sets are owned and copyrighted by the American College of Physicians (ACP). All text, graphics, trademarks, and other intellectual property incorporated into the slide sets remain the sole and exclusive property of the ACP. The slide sets may be used only by the person who downloads or purchases them and only for the purpose of presenting them during not-for-profit educational activities. Users may incorporate the entire slide set or selected individual slides into their own teaching presentations but may not alter the content of the slides in any way or remove the ACP copyright notice. Users may make print copies for use as hand-outs for the audience the user is personally addressing but may not otherwise reproduce or distribute the slides by any means or media, including but not limited to sending them as e-mail attachments, posting them on Internet or Intranet sites, publishing them in meeting proceedings, or making them available for sale or distribution in any unauthorized form, without the express written permission of the ACP. Unauthorized use of the In the Clinic slide sets will constitute copyright infringement.
Cardiology, Education and Training, Prevention/Screening.
Results provided by:
Copyright © 2016 American College of Physicians. All Rights Reserved.
Print ISSN: 0003-4819 | Online ISSN: 1539-3704
Conditions of Use
This PDF is available to Subscribers Only