Lisa Hartling, MSc; Finlay A. McAlister, MD, MSc; Brian H. Rowe, MD, MSc; Justin Ezekowitz, MB, BCh, MSc; Carol Friesen, MA, MLIS; Terry P. Klassen, MD, MSc
Acknowledgments: The authors thank John Russell and Michelle Tubman for administrative and technical support.
Grant Support: This paper was produced by the University of Alberta Evidence-based Practice Center under contract to the Agency for Healthcare Research and Quality, Rockville, Maryland. Dr. McAlister is supported by the Alberta Heritage Foundation for Medical Research and is the University of Alberta/Merck Frosst/Aventis Chair in Patient Health Management; Drs. Ezekowitz and McAlister are supported by the Canadian Institutes of Health Research Research (CIHR); and Dr. Rowe is supported by a Canada Research Chair from the CIHR.
Potential Financial Conflicts of Interest: Authors of this paper have received funding for Evidence-based Practice Center reports.
Requests for Single Reprints: Finlay A. McAlister, MD, MSc, 2E3.24 Walter Mackenzie Health Sciences Centre, University of Alberta, 8440 112 Street, Edmonton, Alberta T6G 2R7, Canada; e-mail, Finlay.McAlister@ualberta.ca.
Current Author Addresses: Ms. Hartling: Aberhart Centre One, Room 9424, 11402 University Avenue, Edmonton, Alberta T6G 2J3, Canada.
Dr. McAlister: 2E3.24 Walter Mackenzie Health Sciences Centre, University of Alberta, 8440 112 Street, Edmonton, Alberta T6G 2R7, Canada.
Dr. Rowe: 1G1.43 Walter Mackenzie Health Sciences Centre, University of Alberta Hospital, 8440 112th Street, Edmonton, Alberta T6G 2B7, Canada.
Dr. Ezekowitz: 2-51 Medical Sciences Building, University of Alberta, Edmonton, Alberta T6G 2H7, Canada.
Ms. Friesen: Aberhart Centre One, Room 9420, 11402 University Avenue, Edmonton, Alberta T6G 2J3, Canada.
Dr. Klassen: 2C3.00 Walter Mackenzie Health Sciences Centre, University of Alberta, 8440 112 Street, Edmonton, Alberta T6G 2B7, Canada.
Hartling L, McAlister FA, Rowe BH, Ezekowitz J, Friesen C, Klassen TP. Challenges in Systematic Reviews of Therapeutic Devices and Procedures. Ann Intern Med. 2005;142:1100-1111. doi: 10.7326/0003-4819-142-12_Part_2-200506211-00010
Download citation file:
Published: Ann Intern Med. 2005;142(12_Part_2):1100-1111.
The authors discuss 3 challenges in conducting and interpreting any systematic review that are particularly relevant for systematic reviews of therapeutic devices or surgical procedures: 1) inclusion or exclusion of grey literature, 2) the role of nonrandomized studies, and 3) issues in applying the results to clinical care that are unique to the surgical and therapeutic device literature. The authors also discus s empirical evidence related to these topics and illustrate how reviewers in the Agency for Healthcare Research and Quality's Evidence-based Practice Center program have dealt with these challenges in developing evidence reports for decision makers and clinicians about therapeutic devices or surgical procedures.
Therapeutic devices and surgical procedures are often evaluated in nonrandomized studies or small single-center trials. While some may question whether systematic reviews in this area should be performed given the limitations of such studies, we believe that these reviews should play a key role in helping to inform decisions about the implementation of new technologies or procedures. By summarizing the available evidence, systematic reviews can highlight gaps in the evidence base that clinicians and policymakers require to make informed decisions.
We highlight 3 challenges that may arise in conducting and interpreting any systematic review that evaluates the efficacy or effectiveness of therapeutic devices or surgery. We also review the empirical evidence relevant to these methodologic issues in general and present the evidence specific to devices or procedures where it is available. We do not focus on the importance of assessing study quality because another article in this supplement addresses this issue (1).
In this article, we use the U.S. Food and Drug Administration (FDA) definition of a therapeutic device (2) as “an instrument, apparatus, implement, machine, contrivance, implant, in vitro reagent, or other similar or related article, including a component part, or accessory which is . . . intended to affect the structure of any function of the body . . . and which does not achieve any of its primary intended purposes through chemical action within or on the body . . . and which is not dependent upon being metabolized for the achievement of any of its primary intended purposes.”
Therapeutic devices and surgical procedures obviously vary greatly in complexity and costs; however, the issues we discuss here are universal irrespective of the type of device or procedure. They include the following: 1) Should grey literature be included in systematic reviews of devices or procedures? 2) Should nonrandomized studies be included in systematic reviews of devices or procedures? and 3) What applicability issues are unique to studies of devices or procedures?
In discussing these issues, we use examples to illustrate how reviewers in the Evidence-based Practice Center (EPC) program and other authors dealt with them.
Grey literature generally refers to reports that are difficult to locate or retrieve by using the electronic databases commonly employed to identify studies for inclusion in systematic reviews (for example, MEDLINE, EMBASE, and CINAHL). A common misconception is that the grey literature is a homogeneous collection of works. Rather, it includes many different types of documents that can vary substantially in design, quality, and extent of peer review (including internal company reports, documents submitted to the FDA, theses and other dissertations, conference abstracts, book chapters, personal correspondence, and even personal Web pages or blogs) (3, 4). Table 1 highlights several online resources for grey literature databases, such as SIGLE (System for Information on Grey Literature in Europe) and Web sites from North America and Europe. While abstracts are the most common type of grey literature included in systematic reviews (5-7), the type most relevant to systematic reviews of devices is reports from manufacturers or the FDA. The FDA enforces regulations to ensure the effectiveness and safety of a medical device before granting marketing clearance (2). The FDA reviews of the safety and efficacy data collected through this process for approved devices are publicly available (except for data considered proprietary or confidential) and generally contain far more detail than is typically presented in journal publications.
A second common misconception is that grey literature is static. Given the substantial time lags between completion of studies and their publication in medical journals (4.2 to 4.8 years for studies with significant results and 6.4 to 8.0 years for those with nonsignificant results), it is not surprising that some studies initially identified as “grey literature” have been published by the time a systematic review is published or read (8, 9).
While the goal of a systematic review should be to compile the evidence in an unbiased manner, opinions on the appropriateness of including grey literature in a review differ. For example, while 78% of meta-analysts stated that unpublished data should definitely or probably be included in systematic reviews, only 47% of journal editors agreed; 30% of editors reported that they would not publish a review that included unpublished data (10). Of the first 988 systematic reviews published in the Cochrane Library, 56% included grey literature; in most cases, however, the unpublished information merely provided data that supplemented published studies (11). Among the 27 evidence reports on devices and surgery produced through the EPC program in 2004, only 9 included any grey literature (Appendix Table).
Proponents of searching for and including grey literature in all systematic reviews argue that excluding grey literature may result in biased estimates of the effectiveness of an intervention (7, 12, 13). While the existence of publication bias (that is, significant results are more likely to be published, and more likely to be published in English, than nonsignificant findings) has been well documented (5, 13), the question should really be framed in terms of whether including grey literature removes the potential for bias related to sample size (which includes publication bias as well as bias arising from systematic differences in methodologic quality). Indeed, it could be argued that including grey literature in a systematic review may introduce bias if the search for grey literature is not systematic, if only some of the grey literature is uncovered, or if only low-quality trials are uncovered (5, 10). Thus, it is not surprising that even systematic reviews that include unpublished trials can still demonstrate substantially asymmetric funnel plots (a graphical indication that sample size–related bias may exist) (5).
If we accept that including the grey literature does not remove the potential for sample size–related bias in a systematic review, a second (and perhaps more relevant) question is whether inclusion of grey literature substantially affects the results of systematic reviews. For example, of 159 systematic reviews reporting comprehensive searches for grey literature, only 38% found any unpublished trials and only 9% of the 1635 trials eventually included in these reviews were unpublished (indeed, despite the comprehensive search strategies, only 10% “were published in a journal not indexed in MEDLINE”) (5). Although the amount and importance of grey literature will vary by topic area (Table 2), 3 empirical studies comparing systematic reviews across a wide variety of topic areas that did include grey literature versus those that did not found little difference between the effect estimates derived from published trials and those derived from published and unpublished trials (5, 14, 15). Furthermore, in stratified analyses of 159 systematic review comparisons, nondrug interventions showed less difference between published and unpublished trials than drug interventions (5). Of course, these are only 3 analyses. Given the heterogeneity between the systematic reviews within each of these studies, it is appropriate to acknowledge that in some areas of health care, particularly those in which there is little published evidence or the intervention is new or changing (as is frequently the case for devices or surgical procedures), discrepancies between published and “grey” trials may be sufficient to justify devoting resources to systematically searching for grey literature.
In addition to weighing the often substantial opportunity costs of devoting time and resources to searching for grey literature, 3 other concerns drive us not to recommend routinely including unpublished data in systematic reviews. First, results presented in abstract form may be inaccurate or unhelpful. For example, only 33% of abstracts presented at a pediatric surgical meeting contained the same data as the subsequent publication, and the conclusions were similar in only 70% of the abstract–manuscript pairs (the conclusions were consistently weaker in the full manuscript than in the abstract) (16). Moreover, there may be a “window of opportunity” during which abstracts are potentially useful: Early abstracts may assist in identifying randomized, controlled trials (RCTs) but may have only preliminary, incomplete, or inaccurate data, while late abstracts may not provide any data in addition to those already published (17).
Second, it is sometimes difficult to judge the quality of reports in the grey literature, an essential part of any systematic review since low-quality trials are associated with overestimates of treatment effects (18). Indeed, assessment of trial quality is especially pertinent for surgical trials, which are often small and difficult to blind (19, 20). For example, only 2% of abstracts of randomized trials presented at the American Society of Clinical Oncology conferences reported the method of allocation concealment, and only 14% reported the method of blinding (21). Indeed, reflecting our skepticism about data presented in abstract form, only 3 of the 27 EPC reports on devices and surgical procedures accepted abstracts for inclusion. While some may argue that this problem is unique to abstracts, it has been shown that even FDA reports are less likely to appropriately describe methods of randomization, blinding, and allocation concealment than published journal articles (15).
Third, we believe that readers should be skeptical about unpublished data if the data are provided directly by the manufacturer of the device without the opportunity for peer review (22). It has already been well documented that industry-funded research is less often published or presented (23), takes longer to be published when it is accepted for publication (23), and is almost 4 times more likely to report outcomes favoring the sponsor than nonindustry funded studies (24).
In sum, while in an ideal world reviewers would attempt to identify all relevant unpublished literature (and would be successful in doing so), time and resource constraints often compromise the identification and inclusion of grey literature. The search for unpublished trials should become easier as prospective trial registries (for example, Current Controlled Trials [http://www.controlled-trials.com]) become fully functional; however, until these registries attain their goal of capturing 100% of trials, reviewers must continue to carefully consider whether grey literature is likely to be common and influential for their topic of interest. A particularly useful source of information related to devices is available through unpublished FDA reports. We believe that grey literature, including FDA data, should be sought when little evidence for a topic has been published (15) and when the intervention is new or changing, but that exhaustive searches of the grey literature are less necessary when large trials have already been published (15, 25). Regardless of the approach taken, researchers undertaking a systematic review should explicitly state whether they sought or included grey literature (26), and they should conduct sensitivity analyses to assess the impact of grey literature on treatment effect when they include unpublished studies (10). If reviewers choose to search for grey literature, they should use a systematic approach that targets known sources of grey literature rather than relying on select references to which the reviewer has been alerted in an ad hoc manner. Reviewers should also evaluate the findings with respect to the quality of reports, be they published or unpublished (10), and to the sponsoring agency (24). Regardless of whether a review contains grey literature, reviewers should evaluate and discuss the possibility of sample size–related bias and present the results and recommendations in light of the potential for bias (26).
Nonrandomized studies include experimental studies (such as quasi-randomized trials) and observational studies with controls (such as controlled before–after studies, concurrent cohort studies, and case–control studies) or without concurrent controls (such as before–after studies, cross-sectional studies, and case series) (27). The fact that most published articles (68% to 87% of feature articles and brief communications in Annals of Internal Medicine, BMJ, and The New England Journal of Medicine) are nonrandomized studies underscores the value placed on this type of research by clinicians (28). Historically, there has been little randomized-trial evidence in the areas of devices and procedures, and it is estimated that RCTs account for less than 10% of the evidence base for surgical interventions (19). In fact, of 7295 trials indexed as RCTs in MEDLINE from 1990 to 1996, only 1% had device or devices as key words (29); of 9373 references in MEDLINE for pediatric surgery, only 0.3% were RCTs (30).
The inclusion of study designs other than RCTs in systematic reviews of therapy has been discouraged for decades given concerns about the various and well-described biases inherent in studies with nonrandomized designs (27, 31, 32). The extent of bias associated with different nonrandomized study designs, however, can vary tremendously and is at times unpredictable in direction and magnitude (33). A comprehensive review compared results obtained from randomized and nonrandomized studies for 82 clinical topics (27). Among the 8 meta-epidemiologic studies reviewed, 2 studies found close agreement in their estimates of treatment effect (34, 35), while the other 6 found that randomized and nonrandomized studies produced different results; for 5 of these studies, however, the differences were not consistent in direction (33, 36-40). The results are likewise inconsistent when restricted to surgical interventions; the results of randomized and nonrandomized studies do not consistently differ in their estimates, and the design that produces the more extreme result varies. Little evidence is available to compare results from randomized and nonrandomized studies for devices; therefore, no specific conclusions can be drawn. On the basis of the existing evidence, Deeks and colleagues (27) could not make firm conclusions regarding the value of randomization because of the conflicting results and limitations in the studies reviewed. In their analyses, Deeks and colleagues (27) used a resampling procedure to assess the effect of comparing patients within 2 multicenter trials (1 of which was a large trial of endarterectomy) with nonrandom concurrent or historical controls. They found that the biases associated with these 2 designs can significantly affect the results of a systematic review and that the effects are sensitive to differences in case mix. In the surgical example, historical controlled studies overestimated the benefits of endarterectomy, while the concurrent controlled studies produced results similar to those of the RCTs (Figures 1 and 2). Although multivariate models can be used to adjust for differences in covariates and case mix between comparison groups in nonrandomized studies, even advanced statistical techniques, such as instrumental variable analyses, can never completely remove concerns about confounding by indication (41).
The distribution of results indicates systematic bias (average odds ratios in the randomized, controlled trials and historical controlled studies were 1.23 and 1.06, respectively). Adapted with permission from reference 27: Deeks JJ, Dinnes J, D'Amico R, Sowden AJ, Sakarovitch C, Song F, et al. Evaluating non-randomised intervention studies. Health Technol Assess. 2003;7:iii-x, 1-173.
The distribution of results revealed that 9% of studies within each design had statistically significant findings. Adapted with permission from reference 27: Deeks JJ, Dinnes J, D'Amico R, Sowden AJ, Sakarovitch C, Song F, et al. Evaluating non-randomised intervention studies. Health Technol Assess. 2003;7:iii-x, 1-173.
While RCTs are usually cited as the highest level of evidence for judging the efficacy of therapeutic interventions (42), “randomization should not be seen as a reliable proxy for overall quality” (43). Indeed, well-conducted nonrandomized studies may be more valid than poorly conducted RCTs (36, 44). In some situations, moreover, RCTs are unethical or impractical, and clinicians and policymakers must rely on lower levels of evidence (45). Certainly, nonrandomized studies may provide evidence that complements RCTs (particularly concerning issues of effectiveness in clinical practice versus efficacy in the trial setting) (33, 44, 46). Furthermore, for questions on patient safety of a new intervention, observational studies may in fact be a better source of evidence than RCTs (which are almost never sufficiently powered to detect rare adverse events because of inadequate sample size or duration of follow-up) (47, 48). These RCTS also tend to enroll younger and healthier patients who have conditions other than those usually encountered in clinical practice. This may lead to underestimates of adverse event rates when the intervention is applied in real-world settings. Indeed, 14% of trials published in 7 general medical journals in 1997 did not refer to adverse effects at all, 38% did not provide information on adverse effects by treatment group, and 46% provided no details on the severity of adverse effects (49). Thus, it is not surprising that only 27% of 2467 published systematic reviews evaluated safety as a secondary outcome (and only 4% included it as a primary outcome) (47). Nonrandomized studies were included for safety outcomes in all 18 EPC reports on devices and surgery that evaluated safety (Appendix Table).
Randomized, controlled trials are also often insufficient to assess safety outcomes because of inadequate or differing definitions of adverse events and severity (45, 49-54); variable or inadequate methods of monitoring or detection within trials (51, 52, 55, 56); or poor reporting of the numerators and denominators in safety data (45, 49-51, 53, 54, 57). For example, an analysis of 82 studies reported that these studies used 41 different definitions and 13 different grading scales for surgical wound infection (58). Indeed, definitions of “surgical death” often vary among studies, and deaths after discharge from the hospital are not often reported (59). Recognizing these flaws, trial reporting guidelines now contain specific recommendations for documenting information on adverse events and effects (60).
Many possible sources other than RCTs provide evidence on safety. Although large, observational studies are necessary to evaluate rare, serious adverse effects (61, 62), we believe that studies with noncomparative designs should not be included in systematic reviews because of their inability to establish causality (45). Thus, while postmarketing surveillance studies can be useful (47, 50), information from this source should be used cautiously because reporting may be nonsystematic (50), rare adverse effects may be underreported (61), and the lack of a control group may overestimate the incidence of common, less serious adverse effects that are not directly related to the specific therapy (63).
Evidence from nonrandomized studies may be particularly relevant in areas lacking RCT evidence: Seven percent of systematic reviews in the Cochrane Database of Systematic Reviews (those published in The Cochrane Library, Issue 2, 2004) found no randomized trials and concluded that there was no evidence upon which to make informed decisions. Randomized evidence is especially lacking for devices (29) and surgical procedures (19, 30, 64, 65)—for example, nearly one quarter of systematic reviews of orthopedic surgery reported no RCTs in their specific topic area (66).
The lack of RCT evidence for medical devices or surgical procedures arises from several factors (29). First, the FDA does not require RCTs for new devices (67), in contrast to the general requirement of 2 RCTs for new-drug applications (29). Second, unlike researchers conducting drug trials, clinicians require specific training or skills for device or procedural trials (29). Once a surgical procedure has been developed, refined, and standardized to the point at which it would be appropriate to be evaluated in a randomized trial, collective clinical equipoise may no longer be present (68, 69). Furthermore, despite examples of the usefulness of sham surgeries in demonstrating the lack of efficacy of certain surgical procedures (such as arthroscopic knee surgery , internal mammary artery ligation [71, 72], and intracerebral fetal tissue grafting in older patients with Parkinson disease ), they are controversial because they expose “control” participants to surgical risks without any potential for benefit (74, 75). Finally, 2 facets of the device industry limit rigorous evaluation. In general, the device industry consists of small enterprises that may lack economic resources to do research of any kind (29, 67). Furthermore, the constant and rapid evolution of devices makes it difficult to determine the optimal time for evaluation (67). In the early stages of development, there may still be uncertainties over application of the technology (for example, which patients are most likely to benefit and how the device will interact with other evolving technologies) (69) and less experience and familiarity with the technical skills necessary for optimal outcomes (76). If the evaluation comes too late, the device may have already been adopted into clinical practice and consumer expectations (69).
In 21 EPC reports, reviewers included nonrandomized evidence for evaluating efficacy: 20 of these included studies without concurrent controls. Moreover, 7 EPC reports relied solely on nonrandomized evidence for evaluation of some surgical interventions: liver transplantation (77); total knee replacements (78); and procedures for managing uterine fibroids (79), chronic central neuropathic pain after spinal cord injury (80), breast abnormalities (81), incidental adrenal masses (82), and cataracts or glaucoma (83).
Reviewers should consider several practical issues when deciding whether to include nonrandomized studies in systematic reviews. First, because nonrandomized studies make up most of the medical literature and the indexing of designs other than RCTs is less precise and reliable (84), the number of studies identified by an open literature search is dramatically greater than a search restricted to RCTs. Second, controversy exists over the appropriateness of performing a meta-analysis of nonrandomized data and the concern that such analyses may produce spurious results (85-87). Third, the assessment of methodologic quality, an essential part of any systematic review, is more problematic for nonrandomized studies than for RCTs. For example, 1 report identified 194 quality assessment tools for nonrandomized studies but found that almost all were flawed (27). Indeed, only 2 of the 25 EPC evidence reports evaluating devices or surgery that included nonrandomized studies used validated tools to assess methodologic quality. In most cases, the EPC reviewers generated their own list of quality criteria. Finally, inclusion of nonrandomized studies has implications for data extraction (for example, different forms may be required for different designs) (88). Despite these challenges, the development of reporting guidelines for meta-analyses of observational studies is an important evolution in the area of systematic reviews, and increasing numbers of meta-analyses of nonrandomized studies are being produced (87).
In sum, when enough RCTs are available to examine the efficacy of a given intervention, these should form the evidence base for decision making (27). Nonrandomized studies should be used to complement RCT evidence when information on long-term effects and safety outcomes is required. When RCTs are unavailable, reviewers should identify reasons for lack of RCT evidence and review the evidence from nonrandomized studies. Reviewers should define the designs to be included a priori and document them in the review protocol. Reviewers should also present nonrandomized evidence in the context of potential biases and discuss the likely influence of these biases on treatment effect estimates. When nonrandomized studies are included, methodologically stronger studies (89) should be considered first—in particular, we believe that inclusion of a control group (preferably concurrent) is essential to allow valid conclusions to be drawn from nonrandomized studies. All evidence, be it from randomized trials or not, should be graded for methodologic quality by using components that are informed by empirical evidence and, where possible, by using validated methods. Results from systematic reviews based on nonrandomized studies need to be interpreted on a case-by-case basis and should consider the magnitude and consistency of the observed effects, as well as the biases and limitations inherent in different study designs. Existing guidelines should be followed in reporting the results from systematic reviews of nonrandomized studies (87).
After deciding that a research study (whether a nonrandomized study, an RCT, or a systematic review) is internally valid, the clinician or policymaker then must decide whether this evidence applies to their patients or situation, respectively. This is no easy task, even when the evidence is not hampered by the all-too-common problems of imprecision due to small numbers, brief follow-up periods, surrogate outcomes of limited clinical relevance, and restricted enrollment of highly select subsets of patients. These threats to applicability are common to studies of drugs and medical interventions, as well as those of devices and surgical interventions; other articles discuss these problems fully (90, 91). In this section, we focus on threats to applicability that are unique to studies evaluating devices or procedures.
First and foremost, patient eligibility criteria are of particular importance in interpreting studies (and systematic reviews) of devices and surgical procedures. This encompasses 3 key points. First, devices or procedures should be considered only for patients similar to those in whom they have been tested. For example, an EPC report demonstrated that cardiac resynchronization therapy was efficacious in patients with heart failure who have underlying bundle-branch block and low ejection fraction. Whether this intervention benefits patients with heart failure who do not exhibit these features is unknown (indeed, benefit is doubtful since the device addresses the hemodynamic problems resulting from bundle-branch block).
Second, and perhaps less apparent, the assumption that “overall trial results apply to most patients with that condition” (91) does not hold for surgical or device intervention studies. Thus, while the relative benefits of drugs generally do not differ across patient subgroups, at least across the usual spectrum of underlying risks (92-94), the mortality relative risk reduction associated with such surgical procedures as carotid endarterectomy, coronary artery bypass grafting, and devices (such as those providing cardiac resynchronization therapy) are all greater in patients at higher baseline risk (95-97). This paradox arises because the potential long-term positive effects on the outcome of interest are balanced against short-term negative effects on that same outcome for surgical or device interventions: Since periprocedural risks are absolute and similar irrespective of the patient's long-term baseline risk for the outcome, the long-term relative benefits are greater in patients who are more likely to develop the outcome without intervention. For example, the risk for death with implantation of a cardiac resynchronization device is the same—0.4% (95% CI, 0.2% to 0.7%)—for patients who have a 1% 1-year mortality risk without cardiac resynchronization therapy and for patients with a 10% 1-year mortality risk (97).
Third, while we can often safely assume that drugs beneficial in young patients (such as antihypertensive agents or statins) are also beneficial in older patients with the same target conditions, this assumption does not hold for surgical procedures or devices. For example, while the perioperative mortality rate in 2 trials that proved that carotid endarterectomy prevents stroke in patients with high-grade carotid stenosis was only 0.1% to 0.6% (98, 99), both trials restricted enrollment to younger patients, and population-based studies have shown that average perioperative mortality rates in older patients are substantially higher than in younger patients (1.9% to 3.6%) (100, 101). However, a recent analysis of data from acute care hospitals in 7 states showed that carotid endarterectomy procedure rates increased more in older patients than in younger patients after publication of these trials (100).
In addition to considering the patient eligibility criteria in device or surgical trials, the reviewer should also focus on the eligibility criteria for providers and institutions—abundant literature has shown clear relationships between hospital and physician volume and outcomes (102-104). For example, both of the carotid endarterectomy trials cited earlier were conducted in large-volume hospitals by surgical teams with low perioperative complication rates (98, 99). Indeed, the benefits in both trials were highly sensitive to perioperative complication rates: It is estimated that the relative risk reduction for disabling stroke with carotid endarterectomy decreases by 20% for every 2–percentage point increase in the absolute rate of perioperative stroke (105). Although both groups of trialists explicitly cautioned that “readers not apply our conclusions too broadly . . . the study surgeons were selected only after audits . . . confirmed a high level of expertise” (98, 99), subsequent analyses of carotid endarterectomy procedures in the United States have shown that surgical teams whose complication rates and operative volumes would have rendered them ineligible for the trials now perform most endarterectomies (100, 106). Not surprisingly, the in-hospital mortality rates after carotid endarterectomy are almost 10-fold higher in the “real-world” setting than in the trials included in the systematic review.
Some trials of devices randomly assign patients after the device is implanted. For example, 8 of the 9 trials in an EPC report on cardiac resynchronization therapy used this approach, and only patients who had the device successfully implanted (approximately 90% of those who underwent the procedure) were randomly assigned to have the device turned on or off (97). This design, similar to the run-in period used in some pharmaceutical trials, does not affect the internal validity of the trials since the randomly assigned groups should still be balanced for unmeasured confounders. However, it does affect the tests of statistical significance (leading to narrower CIs and greater chance of type I errors) and may lead to overestimates of treatment benefits and underestimates of adverse effects (since these studies do not include patients who could not tolerate the procedure or those in whom implantation was unsuccessful) (107). Unfortunately, there are no accepted methods for adjusting results for the effects of the “run-in period” before randomization. While some authors advocate recalculating effect estimates as if the run-in had not been used (thus including prerandomization events in the relevant treatment group) (107), we suggest that the effect estimates should be derived from the postrandomization data but the conclusion prominently state that the reported effect estimates are a “best-case” scenario and probably represent the ceiling of what may be expected from a device or surgical procedure when used in clinical practice.
The effects of most technologies, such as devices and surgical procedures, tend to change over time. The benefits of technological innovations should theoretically improve over time (since, as providers become more experienced with the techniques, procedural complications should decline and selection of patients likely to benefit most should improve); however, as outlined above, this trend is often countered by the trend for innovations to diffuse nonselectively beyond those settings in which it was shown to be beneficial (thus increasing complication rates and reducing, if not negating, potential benefits). The uncertain effects of a device or procedure over time are compounded when the design of the device or the features of the procedure have rapidly evolved. Thus, earlier studies may show different outcomes than later studies. Furthermore, our ability to extrapolate from published studies to clinical practice for devices or procedures may be limited by any imprecision in the description of the device or procedure in the literature.
Given periprocedural complication rates, almost all interventions that involve a surgical procedure (including those to implant a device) have survival curves that cross at some point—that is, patients in the procedure group will have worse short-term outcomes but better long-term outcomes (if the procedure is beneficial). Thus, the benefits of the procedure appear smaller the closer to the procedural date one looks. This has implications for the pooling of data in a systematic review (implying that effect estimates from different time periods should not be pooled indiscriminately) as well as applying the results of the systematic review to make projections on long-term patient outcomes.
In sum, in making or drawing conclusions from systematic reviews of devices or surgical procedures, it should not be assumed that the efficacy and safety seen in clinical trials conducted in highly select subsets of patients cared for by highly select providers from highly select institutions will translate into similar safety and effectiveness rates when applied in usual practice, particularly over time as devices and surgical techniques evolve. Thus, the systematic reviewer and the reader must be particularly cautious in highlighting the patient, provider, and institutional eligibility criteria; the type of device or procedure; and the length of follow-up. Furthermore, the reviewer should highlight in the report that the reported effect estimates for any device trials that randomly assigned patients after the procedure had been performed probably represent the best-case estimates for efficacy of the device when used in clinical practice.
While systematic reviews of RCTs are considered the gold standard for evidence, they are not infallible. Indeed, in many instances, large RCTs have disproved the results of previous systematic reviews (108). While advances in the methods of systematic reviews in the past decade should result in more valid conclusions, many of these advances are most relevant to reviews of pharmacologic interventions. Many issues specific to devices and surgery require further evaluation. The merits and drawbacks of including grey literature and nonrandomized studies need to be carefully considered on a case-by-case basis for each clinical topic, and reviewers need to carefully consider the external validity of their findings and comment on issues of applicability for decision makers (Table 3). Unless the issues we raised in this manuscript are explicitly addressed in a systematic review of a therapeutic device or procedure, we believe clinicians faced with extrapolating from the evidence to clinical practice and the policymaker faced with deciding whether to support implementation of that device or procedure should be cautious.
The In the Clinic® slide sets are owned and copyrighted by the American College of Physicians (ACP). All text, graphics, trademarks, and other intellectual property incorporated into the slide sets remain the sole and exclusive property of the ACP. The slide sets may be used only by the person who downloads or purchases them and only for the purpose of presenting them during not-for-profit educational activities. Users may incorporate the entire slide set or selected individual slides into their own teaching presentations but may not alter the content of the slides in any way or remove the ACP copyright notice. Users may make print copies for use as hand-outs for the audience the user is personally addressing but may not otherwise reproduce or distribute the slides by any means or media, including but not limited to sending them as e-mail attachments, posting them on Internet or Intranet sites, publishing them in meeting proceedings, or making them available for sale or distribution in any unauthorized form, without the express written permission of the ACP. Unauthorized use of the In the Clinic slide sets will constitute copyright infringement.
Cardiology, Hospital Medicine, Heart Failure, Prevention/Screening.
Results provided by:
Copyright © 2016 American College of Physicians. All Rights Reserved.
Print ISSN: 0003-4819 | Online ISSN: 1539-3704
Conditions of Use
This PDF is available to Subscribers Only