Linda L. Humphrey, MD, MPH; Mark Helfand, MD, MS; Benjamin K.S. Chan, MS; Steven H. Woolf, MD, MPH
Note: This manuscript is based on a longer systematic evidence review that was reviewed by outside experts and representatives of professional societies. A complete list of peer reviewers is available online at http://www.ahrq.gov/clinic/uspstfix.htm.
Disclaimer: The authors of this article are responsible for its contents, including any clinical or treatment recommendations. No statement in this article should be construed as an official position of the Agency for Healthcare Research and Quality or the U.S. Department of Health and Human Services.
Acknowledgments: The authors thank Stephanie Detlefsen, MD, for her contribution to this evidence review and David Atkins, MD, MPH, from the Agency for Healthcare Research and Quality and members of the U.S. Preventive Services Task Force for their comments on earlier versions of this review. They also thank Kathryn Pyle Krages, AMLS, MA, Susan Carson, MPH, Patty Davies, MS, Susan Wingenfeld, and Jim Wallace for their help with preparation of the manuscript and the full systematic evidence review.
Grant Support: This study was conducted by the Oregon Health & Science University Evidence-based Practice Center under contract to the Agency for Healthcare Research and Quality (contract no. 290-97-0018, task order no. 2), Rockville, Maryland.
Requests for Single Reprints: Reprints are available from the Agency for Healthcare Research and Quality Web site (http://www.preventiveservices.ahrq.gov) or the Agency for Healthcare Research and Quality Publications Clearinghouse (800-358-9295).
Current Author Addresses: Drs. Humphrey and Helfand and Mr. Chan: Oregon Health & Science University, Mailcode BICC, 3181 SW Sam Jackson Park Road, Portland, OR 97201-3098.
Dr. Woolf: Virginia Commonwealth University, 3712 Charles Stewart Drive, Fairfax, VA 22033.
Humphrey LL, Helfand M, Chan BK, Woolf SH. Breast Cancer Screening: A Summary of the Evidence for the U.S. Preventive Services Task Force. Ann Intern Med. 2002;137:347-360. doi: 10.7326/0003-4819-137-5_Part_1-200209030-00012
Download citation file:
Published: Ann Intern Med. 2002;137(5_Part_1):347-360.
Breast cancer is the second leading cause of cancer death among North American women. Approximately 1 in 8.2 women will receive a diagnosis of breast cancer during her lifetime, and 1 in 30 will die of the disease (1). Breast cancer incidence increases with age (1), and although significant progress has been made in identifying risk factors and genetic markers, more than 50% of cases occur in women without known major predictors (2-5).
This review was commissioned to assist the current U.S. Preventive Services Task Force (USPSTF) in updating its recommendations on breast cancer screening. We focus on information that was not available in 1996, when the second USPSTF examined the issue (6). Our goal was to critically appraise and synthesize evidence about the overall effectiveness of breast cancer screening, as well as its effectiveness among women younger than 50 years of age.
The analytic framework, literature search, and data extraction are described in detail in the Appendix. Briefly, we searched the Cochrane Controlled Trials Registry, MEDLINE, PREMEDLINE, and reference lists (6-8) for randomized, controlled trials of screening with death from breast cancer as an outcome. In all, we reviewed 154 publications from eight eligible randomized trials of screening mammography and two trials of breast self-examination (BSE). We abstracted details about patient population, design, quality, data analysis, and published results at each reported length of follow-up. We also evaluated previous meta-analyses of these trials and of screening test characteristics and studies evaluating the harms associated with false-positive test results.
We used predefined criteria developed by the current USPSTF to assess the internal validity of the trials (9). Two authors rated the internal validity of each study as “good,” “fair,” or “poor.” Disagreements were resolved by further review and discussion. In the USPSTF system, a study that meets all the criteria for internal validity is rated as good quality (9). The rating reflects a judgment that the results of the study are very likely to be correct. The fair-quality rating is used for studies that have important but not major flaws and implies that the findings are probably valid. A study that has a major flaw in design or execution—one that is serious enough to invalidate the results of the study—is rated as poor quality. We based our quality ratings on the entire set of publications from a trial rather than on individual articles.
The USPSTF criteria for internal validity are listed in Appendix Table 1. All of the mammography trials met the first three criteria: They clearly defined interventions, measured important outcomes, and used intention-to-treat analysis. Therefore, our quality ratings reflect differences among the studies on the remaining criteria: 1) initial assembly of comparable groups; 2) maintenance of comparable groups and minimization of differential loss to follow-up or overall loss to follow-up; and 3) use of outcome measurements that were equal, reliable, and valid. The Appendix describes our approach to applying these criteria in more detail.
We conducted new meta-analyses to incorporate new information about the quality of the trials and longer follow-up results. Breast cancer is known for its biological heterogeneity (10) as well as for late recurrences (10). Thus, longer follow-up is relevant in evaluating mortality rates, particularly in younger women. In addition, for several of the trials, the most recent analyses correct flaws in earlier reports.
Six of the eight mammography trials were designed to assess the effectiveness of mammography over a broad age range, rather than its comparative effectiveness in various age subgroups. One trial specifically examined women 40 to 49 years of age because the earliest trial seemed to show no benefit in this subgroup. The USPSTF posed these questions for the meta-analysis: 1) Does mammography reduce breast cancer mortality rates among women over a broad range of ages when compared with usual care? and 2) If so, does mammography reduce breast cancer mortality rates among women 40 to 49 years of age when compared with usual care?
We answered each question in two parts. First, using WinBUGS software (MRC Biostatistics Unit, Cambridge, United Kingdom), we constructed a two-level Bayesian random-effects model to estimate the effect size from multiple data points for each study and to derive a pooled estimate of relative risk reduction and credible intervals (CrIs) for a given length of follow-up (11). Second, we pooled the most recent results of each trial to calculate the absolute and relative risk reduction, using the results of the first analysis to estimate the mean length of follow-up.
To avoid bias that could result from excluding any data from valid studies, we included the results of all trials of fair quality or better in the base-case analysis. The disadvantage of this approach is that it combines results from two distinct types of studies.
The six population-based trials randomly assigned women to an invitation-to-screening group or to a control group that received “usual care” and was followed passively. In these trials, women who were invited to screening but chose not to be screened were included in the analysis of the “screened” group. Two trials from Canada, the Canadian National Breast Cancer Screening Study-1 (CNBSS-1) and the Canadian National Breast Cancer Screening Study-2 (CNBSS-2), differed from the other six trials. First, the Canadian trials used mass media to recruit a sample of volunteers, and all women randomly assigned to mammography had mammography at least once (12-13). Second, in CNBSS-2, the control group was screened periodically with clinical breast examination (CBE). To estimate the relative risk reduction and the number needed to invite to screening to prevent one breast cancer death compared with usual care, we reanalyzed the data excluding the results of the Canadian studies.
This study was funded by the U.S. Agency for Healthcare Research and Quality. Agency staff and members of the USPSTF reviewed and made substantive recommendations about the analyses and final manuscript. Agency approval was required before the manuscript could be submitted for publication.
The eight randomized trials of mammography identified in our review (12-23) varied in recruitment of participants, mammography protocol, control groups, and size (Table 1). Six trials examined the effectiveness of screening among women between 40 and 74 years of age; one trial enrolled women in their 40s, and one enrolled only women in their 50s. Four trials from Sweden tested mammography only (14-17, 23-26), and the other four, from Canada, New York, and Edinburgh, Scotland, tested mammography and CBE (12, 13, 18-22, 27.
We found important methodologic limitations in all of the trials and rated all but one as fair, using USPSTF criteria. Table 1 lists the flaws of each trial and indicates how they influenced the overall ratings. The two reviewers rated the Swedish and Canadian trials as fair. Their initial ratings for the Edinburgh study and for the Health Insurance Plan of Greater New York (HIP) study differed. After extensive peer review, and detailed review of these trials' associated publications, the reviewers reached a consensus that the HIP study should be rated as fair and the Edinburgh study should be rated as poor.
The HIP trial (conducted from 1963 to 1966) was the first trial of breast cancer screening. It is difficult to critically appraise because publications that describe it differ in detail from more recent publications. We found several limitations of this trial, including inadequate description of allocation concealment and poor reporting of intervention and control group numbers. In addition, we found better ascertainment of clinical variables (including previous mastectomy) among the invitation-to-screening cohort than among the passively followed control group. However, we viewed this as an expected consequence of a study design in which a control group receives usual care and is not contacted. The screening and control groups differed from each other slightly in education, menopausal status, and previous breast lumps; however, the differences were not systematic and did not favor one group over the other. The strengths of the trial included intention-to-treat analysis, little contamination, and blind review of deaths. We did not find the faults severe enough to rate the study as poor quality and rated it as fair, which signifies that the results were probably valid at the time the study was conducted.
The Canadian trials met all of the USPSTF criteria for a rating of good quality, except for adequacy of allocation concealment. They differed from the other trials because all participants had a history and physical examination before randomization. This design permitted exclusion of patients who had a history of breast cancer and extensive examination of the baseline differences between groups.
The Swedish trials all had limitations that resulted in a rating of fair rather than good. The Stockholm and Malmö trials, which were individually randomized, did not report whether allocation was concealed. The Gothenburg trial and Swedish Two-County Study, which were cluster randomized trials, had small differences in mean age between the invited and control groups. Such differences are expected to occur in a cluster-randomized trial, do not indicate failure of randomization or a problem in the trial execution, and can be adjusted for in statistical analyses (28). Both the Gothenburg trial and the Swedish Two-County Trial provided insufficient data to determine whether randomization distributed other important confounders equally among the groups, but comparison of overall mortality rates in the invited and control groups do not suggest that a major imbalance occurred (29).
As originally conducted, the Swedish trials had important flaws related to measurement of the primary outcome measure, death from breast cancer. In the Swedish Two-County Trial and the Gothenburg and Stockholm trials, review of deaths was unblinded and criteria for the assignment of cause of death were unclear. Another concern about the Swedish trials as a group related to screening of the control groups. Originally, the Swedish trials used the “evaluation” method of analysis, in which mortality rates in the screened population were calculated only for cancer diagnosed between the time of randomization and the last mammographic examination. When the evaluation method of analysis is used, control group screening can introduce bias unless it is performed concurrently with the final instance of mammography in the screened group (30-31). This method is inferior to the “follow-up” method of analysis, in which all deaths that occur after randomization are included in the analysis. The follow-up method of analysis dilutes relative benefit over time, particularly in studies that offered screening to the control group and in areas where widespread screening is adopted.
We considered these flaws to be adequately corrected in subsequent analyses by the trialists. In a 1993 overview of the trials, an independent end point committee used an explicit protocol to perform blind assessment of cause of death (32). Participants were linked to an external cancer registry and were excluded from the analysis if breast cancer had been diagnosed before the trial began. For the Swedish trials as a whole, death from every cause except breast cancer was similar in the compared groups (33). In the Swedish Two-County Trial, the reduction in rates of advanced breast cancer (34), which are not related to judgments about the causes of death, was similar to the reduction in breast cancer mortality rates (35). The overview also reanalyzed the data by using the follow-up method of analysis and found very little difference between the recalculated and original relative risk values. A recent review (8) critical of the Swedish studies raised concern about bias in postrandomization exclusions, as evidenced by variation in the reported number of participants. This concern was effectively addressed in a recent update of these trials, which explained that this variation was due to the use of different methods for estimating the number of women in each birth cohort rather than to manipulation after randomization (23). The update also reported more recent results of the Swedish trials by using both the follow-up and evaluation methods of analysis.
We rated the Edinburgh study as poor quality because of a serious imbalance between the control and screened groups. General practitioners' practices were randomized in clusters without matching for socioeconomic factors. As a result, socioeconomic status, a predictor of stage at diagnosis as well as death from breast cancer, was significantly lower in the control group than in the mammography group. All-cause mortality was dramatically higher in the control group than in the screened group (20.1 more deaths per 10 000 person-years [95% CI, 13.3 to 26.9]) (29). This difference is close to 25 times larger than the difference in breast cancer deaths between the groups and confirms our assessment that the trial was severely flawed.
Since no gold standard can be applied to the entire screened population, the denominator used for estimating sensitivity is the total number of breast cancer cases diagnosed in a given interval. The results of recent, good-quality systematic reviews of the accuracy of mammography in the screening trials are summarized in Table 2(36-37). The overall sensitivity for all rounds of screening was lowest in the HIP trial. Otherwise, one study was not clearly better or worse than another. For a 1-year screening interval, the sensitivity of first mammography ranged from 71% to 96%. Sensitivity was substantially lower for women in their 40s than for older women.
The data in Table 2 cannot be applied to individual patients because they are not adjusted for several factors that are known to affect sensitivity. These include patient factors (use of hormone replacement therapy, mammographic breast density), technical factors (the quality of mammography, the number of mammographic views), and provider factors (the experience of radiologists and their propensity to label the results of an examination abnormal, the choice of follow-up evaluation for abnormal mammograms) (36, 38-42).
In the randomized trials, the specificity of a single mammographic examination was 94% to 97% (36, 43-44). This indicates that 3% to 6% of women who did not have cancer underwent further diagnostic evaluation, typically a clinical examination, more mammographic views, or ultrasonography. The positive predictive value of one-time mammography ranged from 2% to 22% for abnormal results requiring further evaluation and from 12% to 78% for abnormal results requiring biopsy (36, 45-46) (Table 3). Estimates from community settings suggest a graded, continuous increase in predictive value with age. For example, among 31 814 average-risk women screened in California from 1985 to 1992, the positive predictive value for further evaluation was 1% to 4% among those 40 to 49 years of age, 4% to 9% among those 50 to 59 years of age, 10% to 19% among those 60 to 69 years of age, and 18% to 20% among those 70 years of age and older (47).
Table 4 summarizes the most recent results from trials that included at least some participants older than 50 years of age. The four Swedish trials that compared two to six rounds of mammography with usual care (23, 26 reported 9% to 32% reductions in the risk for death from breast cancer. The results of the trials have changed little over time (Figure). The reduction was statistically significant in only one of these trials (the Swedish Two-County Trial) (relative risk, 0.68 [CI, 0.59 to 0.80]) (26). The number of times mammography was performed and the frequency of screening did not seem to explain the variation among the Swedish studies. A previous meta-analysis found little change when the individual trial results were adjusted for type of randomization and degree of adherence (48).
Of the four studies that evaluated the combination of mammography and CBE (Table 4), three were of at least fair quality (12, 13, 18, 27, 49. The HIP trial reported a relative risk reduction that began 5 years after randomization and remained below 1 after 16 or more years of follow-up (relative risk, 0.79). The CNBSS-2, which compared annual mammography and CBE with annual CBE among women 50 to 59 years of age, showed no benefit 13 years after the study began (12, 20. The CNBSS-1, which compared annual mammography and CBE with usual care in women 40 to 49 years of age, also showed no benefit.
In our meta-analysis of results from all age groups combined, we excluded the Edinburgh trial (which we rated as poor) and used the results from both Canadian trials. The summary relative risk was 0.84 (95% CrI, 0.77 to 0.91), equivalent to a number needed to screen of 1224 (CrI, 665 to 2564) an average of 14 years after study entry. To estimate the effectiveness of an invitation to screen compared with usual care, we also excluded the Canadian trials, which recruited volunteers. The relative risk reduction was 0.81 (CrI, 0.73 to 0.89), and the number needed to invite to screening was 1008 (CrI, 531 to 2128). The relative risks by year of observation (including trial plus follow-up time) are shown in the Figure, which suggests a gradual decrease in benefit with longer observation time.
Since 1963, seven randomized, controlled trials have included women 40 to 49 years of age, approximately 200 000 participants. With the exception of one of the Canadian studies, none of the trials was planned to evaluate breast cancer screening in this age group and none had sufficient power. Two trials, the Stockholm trial and CNBSS-1, showed no benefit for this age group even with longer follow-up (Table 5). The other five trials suggest a benefit (risk reduction, 13% to 42%), and one (the Gothenburg trial) observed a statistically significant risk reduction since 1996. These findings reflect results after 11 to 19 years of observation; the median period of active screening was 6 years (range, 4 to 15 years).
In our meta-analysis, excluding the Edinburgh trial, the summary relative risk was 0.85 (CrI, 0.73 to 0.99) after 14 years of observation, with a number needed to screen of 1792 (CrI, 764 to 10 540) to prevent one death from breast cancer. Some might argue that the Canadian study should be excluded in calculating the number needed to invite to screening because its participants were prescreened volunteers who may have differed from the general population. When the Canadian study was excluded, the summary relative risk was 0.80 (CrI, 0.67 to 0.96) and the number needed to invite to screening was 1385 (CrI, 659 to 6060). The Figure shows an increasing screening benefit among this age group with a longer period of observation.
Among women 50 years of age or older, the summary relative risk was 0.78 (CrI, 0.70 to 0.87) after 14 years of observation, with a number needed to screen of 838 (CrI, 494 to 1676) to prevent one death from breast cancer. As shown in the Figure, the benefit has decreased with longer duration of follow-up.
We found seven meta-analyses of the effectiveness of mammography in women 40 to 49 years of age (Table 6) (8, 30, 32, 48, 50-58). Our results, which reflect exclusion of one flawed trial, longer follow-up in six of the trials, and corrected results for the Swedish trials, were consistent with those of most previous meta-analyses. Two meta-analyses (8, 51, including one from the Cochrane Collaboration, produced results that differed substantially from ours. The Cochrane review reported a summary relative risk of 1.03 (CI, 0.77 to 1.38) but based this on only two trials.
Direct evidence of effectiveness among older women is limited to two trials that included women older than 65 years of age. Both of these trials reported relative risk reductions among women 65 to 74 years of age (relative risk, 0.68 [CI, 0.51 to 0.89] [(25)] and 0.79 [(59)] among women 70 to 74 years of age). In the recent Swedish overview, the summary relative risk among women 65 to 74 years of age was 0.78 (CI, 0.62 to 0.99) (23, 60.
The test characteristics of CBE, based on data from trials designed specifically for breast cancer screening, were recently reviewed (61). Sensitivity ranged from 40% to 69%, specificity from 88% to 99%, and positive predictive value from 4% to 50% when mammography and interval cancer were used as the criterion standard. One community study showed that over 10 years of biennial screening, 13.4% of women had false-positive results on CBE at least once; risk for such results was higher among women younger than 50 years of age (62).
No trial has compared CBE alone with no screening. However, two randomized, controlled trials involving the use of mammography and CBE had mortality reductions of 29% and 14% (18, 27, 63. A controlled, nonrandomized United Kingdom trial of CBE and mammography showed a nonsignificant mortality reduction of 14% (relative risk, 0.86 [CI, 0.73 to 1.01]) (64).
What is the contribution of CBE to these reductions in mortality rate? Among studies showing a benefit of screening, mortality reductions in trials of CBE with mammography are similar to those in trials including mammography only. In the CNBSS-2, in which women 50 to 59 years of age were randomly assigned to annual CBE and mammography or to annual CBE (65), the relative risk for death was 0.97 (CI, 0.62 to 1.52) (13). This suggests that mammography has little additive benefit in the setting of a careful, detailed CBE.
Because neither CBE nor mammography is 100% sensitive, BSE has been advised as an important screening method among women older than 20 years of age. However, its effectiveness in decreasing death from breast cancer has been controversial because evidence from clinical trials is limited. Observational studies evaluating BSE and breast cancer stage at diagnosis or death have had mixed results (45, 66.
In two randomized, controlled trials with 5 to 10 years of follow-up, both conducted outside the United States, breast cancer mortality rates were similar in women instructed in BSE and in noninstructed controls (67-69). Both studies involved large numbers of women who were meticulously trained with proper technique and had numerous reinforcement sessions; mammography was not part of routine screening in the countries involved. In both trials, physician visits and biopsy for benign breast lesions increased among those educated in BSE. To date, no studies have evaluated other potential adverse outcomes of BSE, such as anxiety and subsequent screening behavior.
The most frequently discussed adverse effects of mammography are the anxiety, discomfort, and cost associated with positive test results, many of which are false positive, and the diagnostic procedures they generate. For a woman undergoing regular mammography, cumulative specificity may be more relevant than the specificity of a single examination. In one community setting involving 2400 women 40 to 69 years of age, 6.5% of mammography results requiring further evaluation were false positive (specificity, 93.5%). When evaluated on an individual basis, however, approximately 23% of women had at least one false-positive result on mammography requiring further work-up during 10 years of biennial screening (average of 4 mammograms per woman), indicating a 10-year cumulative specificity of 76.2%. For every $100 spent on screening, $33 was spent on the evaluation of false-positive results (62).
Anxiety over an abnormal mammogram is documented in some (70-74) but not all (71, 75 studies. These studies generally suggest that anxiety dissipates after cancer is ruled out, but some studies suggest that some women worry persistently (72, 74-76). The anxiety associated with an abnormal mammogram does not seem to dissuade women from undergoing further screening (77) and may even be associated with improved adherence to recommended screening intervals (70, 78-79). Many women are willing to accept the risk for false-positive results. In one survey, 99% of women understood that false-positive examination results occur with screening, although they underestimated the likelihood. Of importance, 63% stated that they would accept 500 instances of false-positive examination results to save one life (80).
Some view diagnosis and treatment of ductal carcinoma in situ (DCIS) as potential adverse consequences of mammography. There is incomplete evidence regarding the natural history of DCIS, the need for treatment, and treatment efficacy, and some women may receive treatment of DCIS that poses little threat to their health. In a 1992 study, 44% of women with DCIS were treated with mastectomy and 23% to 30% were treated with lumpectomy or radiation (81-82). In one survey, only 6% of women were aware that mammography might detect nonprogressive breast cancer (80).
Radiation exposure is also a potential risk associated with mammography (83). Using risk estimates provided by the Biological Effects of Ionizing Radiation report of the U.S. National Academy of Sciences, and assuming a 4- mGy mean glandular dose from each two-views-per-breast bilateral mammography, Feig and Hendrick estimated that annual mammography of 100 000 women for 10 years beginning at 40 years of age would induce no more than eight deaths from breast cancer (84). Women with an inherited susceptibility to ionizing radiation damage have higher risk for radiogenic breast cancer (10, 85, although this has not been documented in association with mammography.
Fair-quality, relatively consistent evidence suggests that mammography screening reduces breast cancer death among women 40 to 74 years of age. We found no evidence that inclusion of CBE conferred greater benefit than mammography alone. We also found no evidence supporting the role of BSE in reducing breast cancer mortality.
Over the three decades in which mammography trial data have been available, critical reviewers and the investigators themselves have discussed limitations and irregularities in data reporting. One highly publicized review by the Cochrane Collaboration criticized the trials in regard to randomization, postrandomization exclusions, and determination of deaths from breast cancer (8). It found all but two of the trials, the Malmö trial and the Canadian trials, severely flawed or of poor quality and prompted some official bodies to question their support for screening mammography.
We identified many of the same design problems highlighted in the Cochrane review but reached different conclusions about their bearing on the validity of the findings. With the exception of the Edinburgh trial, we found inadequate evidence to conclude that the specific flaws identified introduced biases of sufficient magnitude or direction to invalidate the findings or to cause us to reject the inference that screening mammography reduces breast cancer mortality rates.
The effectiveness of screening in women 40 to 49 years of age is a longstanding controversy. In early years, it centered on the lack of evidence that observed risk reductions were statistically significant (6, 52, 86. That argument has dissipated over time as more evidence has shown a significant separation in survival curves with longer follow-up. The delay in the separation of those curves, however, has prompted some to question whether the observed benefits are due to the detection of cancer after 50 years of age, suggesting little incremental benefit from initiating screening at 40 years of age and exposing women to the harms of screening for an extra decade (87-88). We found little evidence to convincingly address this concern and some evidence that some benefit from screening women 40 to 49 years of age would be sacrificed if screening began at age 50 years (27, 89.
The use of 50 years of age as a threshold is somewhat arbitrary (except that it approximates the age of menopause). The risks for developing and dying of breast cancer are continuous variables that increase with age, and the greatest increase in incidence actually occurs before menopause (90-91). We found that the relative risk reduction achieved with mammography screening does not differ substantially by age, although the time required to obtain the benefit is longer for younger women. On the other hand, younger women have more potential years of life to gain by screening. Thus, the variable most affected by age is absolute risk reduction, which increases as a continuum with age while the number needed to screen decreases. The age of 50 years has no special bearing on this pattern, and some question the scientific rationale for treating women 40 to 49 years of age as a special entity (92).
What emerges as a more important concern, across all age groups, is whether the magnitude of benefit is sufficient to outweigh the harms. The risk for false-positive results and their consequences decreases with age. Thus, although mammography at any age poses a tradeoff of benefits and harms, the balance between increasing absolute risk reduction and decreasing harms grows more favorable over time. The age at which this tradeoff becomes acceptable is a subjective judgment that cannot be answered on scientific grounds, since early evidence suggests that women will tolerate a high risk for false-positive results. As noted earlier, 63% of women in one study stated that they would accept 500 instances of false-positive results to save one life (80). On the basis of the results of our meta-analysis, we calculated that over 10 years of biennial screening among 40-year-old women invited to be screened, approximately 400 women would have false-positive results on mammography and 100 women would undergo biopsy or fine-needle aspiration for each death from breast cancer prevented.
A limitation of our meta-analysis is that we combined studies that used different methods of analysis. In the most recent report from the Swedish trials (23), Nyström and colleagues did not report individual study-level data using the follow-up method. The pooled follow-up analysis reported by Nyström and colleagues in 2002 suggest that the use of the follow-up method would have resulted in a smaller estimate of relative risk reduction.
Women older than 70 years of age have the highest incidence of breast cancer, and test performance in these women is likely to be similar to that in women 50 to 70 years of age. Therefore, theoretically, mammography should be at least as effective for women older than 65 years of age as it is for younger women. Offsetting this potential benefit, however, is the greater comorbidity observed in elderly persons. The potential benefit of early detection is unlikely to be realized in women who have other diseases that diminish life expectancy, in those who would not tolerate evaluation or treatment, and in those with impaired quality of life (for example, dementia) (93). In addition, no data from randomized, controlled trials provide information about the morbidity associated with screening, follow-up, and treatment among women older than 74 years of age. Finally, a major concern in elderly women is the diagnosis and treatment of DCIS, since mortality rates from DCIS are low (1% to 2% at 10 years) and 99% of DCIS is treated surgically (94).
The interval at which mammography was performed in the screening trials varied between 12 and 33 months, but annual mammography was no more effective than biennial mammography. Data from the Swedish Two-County Trial indicate that the period in which breast cancer can be detected before it presents clinically is shorter for women 40 to 49 years of age (95-97). Annual screening may be more important in this age group than in older women, but we found no direct proof for this hypothesis in the controlled trials that have been completed so far.
We found no evidence that CBE or BSE reduces breast cancer mortality. Whether the BSE trials are generalizable to the United States, where the use of CBE and mammography and the incidence of breast cancer are higher, is uncertain. It is also uncertain whether BSE might be beneficial to women who are not in the age ranges at which mammography is recommended or do not avail themselves of mammography. In the setting of CBE and mammography, the probability of finding a significant decrease in mortality rates is likely to be small.
In summary, when judged as population-based trials of cancer screening, most mammography trials are of fair quality. Their flaws reflect tradeoffs in planning that make the trial results widely generalizable but decrease internal validity. In absolute terms, the mortality benefit of mammography screening is small enough that biases in the trials could erase or create it. However, we found that although these trials were flawed in design or execution, there is insufficient evidence to conclude that most were seriously biased and consequently invalid.
Future research should be directed toward developing new screening methods as well as methods of improving the sensitivity and specificity of mammography. Methods of reducing surgical biopsy rates and complications of treatment should also be studied, as should communication of the risks and benefits associated with screening to patients. Finally, efforts to identify breast cancer risk factors with high attributable risk, as well as appropriate prevention strategies, should continue. Even in the best screening settings, most deaths from breast cancer are not currently prevented.
Because of the availability of population-based, randomized trials, mammography has the most direct type of evidence of any cancer screening program (98). Nevertheless, mammography has been controversial since it was first proposed in the 1960s. To understand why, it is helpful to consider the assumptions underlying the steps in the causal chain from screening test to health outcomes. In the analytic framework (Appendix Figure 1), this evidence is shown by the overarching arc connecting screening with the outcomes, reduced morbidity and mortality. Mammography is aimed at early detection of invasive cancer, which is treated by major surgery (mastectomy or tumorectomy). This differs from screening for colorectal cancer and cervical cancer, which is aimed at detecting and removing precancerous lesions to prevent invasive cancer and to preserve the involved organ (colon or uterine cervix). This is one reason why, although it may be reasonable to endorse one cancer screening test (Papanicolaou smear) based on observational, indirect evidence, it may also be reasonable to require experimental evidence before endorsing another (mammography or prostate cancer screening).
It is important to note that the mammography trials do not necessarily provide the highest level of evidence about the efficacy of early treatment. While there is no doubt that screening results in earlier diagnosis of invasive breast cancer, the efficacy of earlier treatment of invasive cancer has not been established independently of the trials (99). That is, there is no direct evidence from trials of surgical therapy (versus watchful waiting) that earlier treatment of invasive cancer reduces mortality. The mammography trials do not attempt to link specific treatments, such as radical mastectomy or adjuvant radiation, to improved outcomes.
The reliance on a theory of treatment rather than on evidence about the efficacy of treatment increases the burden of proof placed on the trials of mammography. It also distinguishes cancer screening from other screening services considered by the USPSTF, such as chlamydia, depression, or osteoporosis screening, for which randomized, placebo-controlled trials of treatment have been done.
The threshold for sufficient evidence about efficacy also depends on the balance of benefits and harms. Because mammography technology, the timing and type of information provided to patients, and treatment approaches have changed over time, the adverse consequences of screening in current practice might be very different from those in the trials. Other sources of data must be used to estimate these consequences.
We identified controlled trials and meta-analyses by searching the Cochrane Controlled Trials Registry (all dates), as well as searching for recent publications in MEDLINE (January 1994 to December 2001). Other sources were a PREMEDLINE search (December 2001 through February 2002); the reference lists of previous reviews, commentaries, and meta-analyses (5, 8, 27, 32, 50, 53, 56, 55, 60, 87, 100-103); the results of a broader search conducted for the systematic evidence review on which this article is based (46); and suggestions from experts.
In the electronic searches, the terms breast neoplasms and breast cancer were combined with the terms mammography and mass screening and with terms for controlled or randomized trials to yield 954 citations. Titles and abstracts were reviewed to identify publications that were randomized, controlled trials of breast cancer screening and had a relevant clinical outcome (advanced breast cancer, breast cancer mortality, or all-cause mortality). In all, the searches identified 146 controlled trials, of which 132 were excluded at the title and abstract phase because they concerned promoting screening rather than the efficacy of mammography (Appendix Figure 2). Four of the remaining 12 trials were excluded. Two were randomized trials of screening with mammography that have not yet presented outcomes of mortality or advanced breast cancer (104-105). The third was a controlled trial that reported a reduction in breast cancer mortality but was not randomized (106-107). The fourth, the Malmö Prevention Study, was apparently a randomized trial of a variety of preventive interventions, including mammography (108). It reported significantly fewer deaths from cancer among women younger than 40 years of age at study entry but provided no information about the mammography protocol, referring reader to another randomized trial, the Malmö Mammographic Screening Program, for further information. We believe that the two trials were in fact separate and that the results of the Malmö Mammographic Screening Program probably do not include results for the 8000 women who participated in the Malmö Prevention Study.
The remaining eight randomized trials of mammography were conducted between 1963 and 1994. Four of these were Swedish studies: the Malmö, Kopparberg, Ostergotland, Stockholm, and Gothenburg studies. (Kopparberg and Ostergotland together are known as the Swedish Two-County Trial.) The remaining studies were the Edinburgh study, the HIP study, and the two Canadian National Breast Screening Studies (CNBSS-1 and CNBSS-2). Using the electronic searches and other sources, we retrieved the full text of 157 publications about these trials (these are listed in the bibliography accompanying the full systematic evidence review [(46)]). We also identified 10 previous systematic reviews of the trials. Seven of these concerned breast cancer mortality, and three addressed test performance (36, 37, 45. The searches identified three nonrandomized, controlled trials (109-111) that are not included in the meta-analysis but are discussed in the larger report (46). Two randomized trials of BSE were identified and reviewed.
Two of the authors abstracted information about each randomized, controlled trial. We compiled an appendix consisting of detailed information about the patient population, design, potential flaws, missing information, and analysis conducted in each trial. For the primary end point of breast cancer mortality, we abstracted results for each reported length of follow-up. Whenever possible, we abstracted data separately for participants by decade of age.
The randomized trials of screening provide little information about morbidity or the adverse effects of screening or treatment. A systematic review of adverse effects was beyond the scope of our review. In examining titles and abstracts, we obtained the full text of and reviewed recent articles reporting the frequency of false-positive results on screening mammography in the community and surveys of women's reactions to positive results on screening tests.
We used predefined criteria developed by the third USPSTF to assess the internal validity of each study (Appendix Table 1) (9). Two authors rated each study as “good,” ”fair,” or “poor,” resolving disagreements by discussion among the authors after review of the data and of comments by 12 peer reviewers of earlier drafts of the report. We tried to apply the same standards to the mammography trials as we have applied to other prevention topics. We based our quality ratings on the entire set of publications from a trial rather than on individual articles.
Appendix Table 1.
The USPSTF criteria were designed to be adaptable to the circumstances of different clinical questions. Like other current systems to assess the quality of trials, the criteria are based as much as possible on empirical evidence of bias in relation to study characteristics. However, although the body of such evidence is growing, it does not permit a high degree of certainty about the importance of specific quality criteria in judging the mammography trials. This is because nearly all empirical evidence of the impact of bias on effect size examined drug treatment or other therapies, rather than screening (112-113). Generalization of these findings to large, population-based trials of screening is not straightforward. In recognition of this fact, cancer screening literature from the 1970s emphasizes that design standards for conventional trials of treatment should not always be applied to cancer screening trials (114).
The quality of reporting of trials limits precision in critical appraisal (115). This is a particular issue in the mammography screening trials, many of which were conducted in the 1960s and 1970s. Their methods were poorly described, which limits precision in critical appraisal. Although some reviewers have promoted extensive query of trial authors to fill in gaps in published articles, the reliability of such data, as well as the appropriate interpretation of query data that contradicts what has been published in multiauthored, peer-reviewed papers, is uncertain. Moreover, authors are often unable to provide clarifying information (116).
All of the trials clearly defined interventions and co-interventions [CBE and BSE], all considered mortality outcomes, and all used intention-to-screen analysis. For this reason, the following received particular emphasis in judging the quality of the mammography trials: 1) initial assembly of comparable groups, 2) maintenance of comparable groups and minimization of differential or overall loss to follow-up, 3) and use of outcome measurements that were equal, reliable, and valid. As described below, we used a systematic approach to assess the flaws of the trials in each of these areas.
In the mammography trials, randomization was done individually or by clusters. Randomization of individuals is preferable because it is less likely to result in baseline differences among compared groups. In individually randomized trials, we classified allocation concealment as adequate, inadequate, or poorly described, according to the criteria used by Schulz and colleagues (115). In a cluster-randomized trial, it is impossible to conceal the assignment of individual patients, and the importance of concealing the allocation of clusters is unclear. Accordingly, we placed more importance on concealment in individually randomized trials.
We rated the way in which each trial compared participants in the screened and control groups. To obtain the highest rating in this category, a trial had to obtain baseline data on possible covariates before randomization, and the distribution of these covariates had to be similar in screening and control groups. In a large, individually randomized trial, baseline differences in sociodemographic variables would suggest that randomization failed, especially if there were opportunities for subversion (that is, if allocation was not concealed).
This standard applies only if baseline data can be reliably collected in all patients in both groups. In several of the mammography screening trials, participants in the usual care group were followed passively, and there was no opportunity to collect baseline data from all of them. The decision not to contact each individual in the control group has logistic advantages and probably reduced contamination, but it limits comparison between the screened and control groups. Moreover, when clusters are used, some baseline differences in the compared groups are almost inevitable.
We evaluated whether the method of identifying clusters (for example, geographic areas, month or year of birth) was likely to result in bias and whether measures such as matching were used to reduce it. If bias in assigning clusters to intervention or control groups seemed likely, we considered this a major flaw that was enough to invalidate the findings and rated the study as “poor.” However, in contrast to individually randomized trials, we did not take small differences in the mean age of compared groups as an indicator that randomization failed to distribute more important confounders equally among the groups.
Several of the trials measured mortality rates from causes other than breast cancer to establish the comparability of the mammography and control groups. We recorded this information when it was available. Although comparable total mortality supports balanced randomization, it does not assure it. However, if there were dramatic differences in death from other causes, we considered it to be evidence that randomization failed.
Exclusions after randomization are considered to be a serious flaw in the execution of randomized trials, although empirical evidence of this bias is inconsistent (112-113). Postrandomization exclusions were poorly described in several of the mammography trials and could have resulted in bias if the exclusions resulted in different levels of risk for death from breast cancer between the groups. In most of the mammography trials, however, exclusion of participants after randomization was an expected consequence of the protocol; some exclusion criteria, such as previous mastectomy, could not be applied to all participants before randomization because participants were not individually contacted. We examined the number of, reasons for, and methods for exclusion of participants after randomization. We based our rating on whether the methods used to ascertain patients were objective and consistent, not on the numbers of exclusions in the compared groups. Since ascertainment of clinical variables that might result in exclusion of a participant will be greater among intervention participants and is an expected consequence of the study design, we did not consider unequal numbers of excluded participants in the treatment and control groups after randomization to be definitive evidence of bias.
Over the duration of most of the trials, death from breast cancer (the primary end point) occurred in 2 to 9 per 1000 participants. The relatively low numbers of events means that misclassification or biased exclusion of a few deaths could change the direction and statistical significance of the trial results. For this reason, selection of cases for review of cause of death on broad criteria, use of reliable sources of information to ascertain vital status (death certificates, medical records, autopsies, registries), and use of independent blinded review of the cause of death are important measures to prevent bias. We considered blinded review of deaths a requirement for a quality rating of fair or better.
The mammography trials have been criticized for decades (99, 117-119), and the trialists have responded by conducting additional analyses intended to address these criticisms. In our assessment of quality, we took into account the results of these supplemental analyses. For example, the cluster-randomized trials have been criticized because they analyzed results using statistical methods appropriate only to individually randomized trials. However, an independent reanalysis using the correct statistical method found that the results were unchanged (48). The Canadian trialists addressed criticisms that women who had palpable nodes might have been enrolled preferentially in the mammography group (120) by reanalyzing their data and showing that the exclusion of these participants did not affect the results (22).
Four of the trials compared mammography alone with usual care, and four compared mammography plus CBE with usual care. Because of lack of certainty that CBE is effective, and in consultation with USPSTF members, we decided that these trials were qualitatively homogeneous. The homogeneity of the trials was also assessed by using the standard chi-square test. The P value was greater than 0.1, indicating the effect sizes estimated by the studies are homogeneous.
We conducted two meta-analyses to address two key questions posed by the USPSTF: 1) Does mammography reduce breast cancer mortality rates among women over a broad range of ages when compared with usual care? and 2) If so, does mammography reduce breast cancer mortality rates among women 40 to 49 years of age when compared with usual care? In the first analysis, we included all data from the seven fair-quality trials, treating the two Canadian studies as one trial in participants 40 to 59 years of age. In the second analysis, we included the six fair-quality trials that reported results for women younger than 50 years of age.
We conducted each meta-analysis in two parts. First, using WinBUGS software, we constructed a two-level Bayesian random-effects model to estimate the effect size from multiple data points for each study and to derive a pooled estimate of relative risk reduction and credible interval for a given length of follow-up (11). The purpose of this analysis was to use repeated measures of the effect over time to estimate the relationship between length of follow-up and effect size. Appendix Table 2 shows the data we used in this analysis. Second, we pooled the most recent results of each trial to calculate the absolute and relative risk reduction, using the results of the first analysis to estimate the mean length of observation. Risks were modeled on the logit scale.
Appendix Table 2.
To model the relationship between length of follow-up and relative risk, a two-level hierarchical model was used. The first level was the result of a trial at a given average or median follow-up time, xij, where i indexes the trial and j indexes the data point within a trial. The second level was the trial itself. The model allows for within-trial and between-trial variability. Specifically, the model was:
α* " Normal(., .)
β* " Normal(., .)
αi · " Normal(α *,σ2α
βi · " Normal(β *,σ2β ·
µij = αi + βixij + τ · zij
τ · " Γ(., .)
log RRij " Normal(µij, s2).
A global regression curve was estimated as log RR = α* + β*x. The random effect was τ · zij. The model to estimate summary risk was
# deathscontrol, i " Binomial(πcontrol,i, ncontrol, i)
# deathsintervention, i " Binomial(πintervention,i, nintervention, i)
logit(πcontrol, i) = α + τ · zi
logit(πintervention, i) = α + β + τ · zi
α " Normal(., .)
Absolute risk difference was calculated as πcontrol, i − πintervention, i. Relative risk was calculated as exp(β).
The models were estimated by using a Bayesian data analytic framework (121). The data were analyzed by using WinBUGS (11), which uses Gibbs sampling to simulate posterior probability distributions. Noninformative (proper) prior probability distributions were used: Normal(0, 106) and Γ (0.001, 0.001). Five separate Markov chains with overdispersed initial values were used to generate draws from posterior distributions. Point estimates (mean) and 95% credible intervals (2.5 and 97.5 percentiles) were derived from the subsequent 5 × 10 000 draws after reasonable convergence of the five chains was attained. The code to model the data in WinBUGS is available from the authors on request.
Our review was begun early in 2000. A first draft was presented to the USPSTF in December 2000. Throughout 2001, the manuscript underwent extensive critical review by a broad range of experts. Subsequent versions were reviewed by the USPSTF in September 2001 and in January 2002.
The In the Clinic® slide sets are owned and copyrighted by the American College of Physicians (ACP). All text, graphics, trademarks, and other intellectual property incorporated into the slide sets remain the sole and exclusive property of the ACP. The slide sets may be used only by the person who downloads or purchases them and only for the purpose of presenting them during not-for-profit educational activities. Users may incorporate the entire slide set or selected individual slides into their own teaching presentations but may not alter the content of the slides in any way or remove the ACP copyright notice. Users may make print copies for use as hand-outs for the audience the user is personally addressing but may not otherwise reproduce or distribute the slides by any means or media, including but not limited to sending them as e-mail attachments, posting them on Internet or Intranet sites, publishing them in meeting proceedings, or making them available for sale or distribution in any unauthorized form, without the express written permission of the ACP. Unauthorized use of the In the Clinic slide sets will constitute copyright infringement.
Results provided by:
Copyright © 2016 American College of Physicians. All Rights Reserved.
Print ISSN: 0003-4819 | Online ISSN: 1539-3704
Conditions of Use
This PDF is available to Subscribers Only