The Effects of Pay-for-Performance Programs on Health, Health Care Use, and Processes of Care A Systematic Review

Background: The beneﬁts of pay-for-performance (P4P) programs are uncertain. examining the effects of level on process-of-care and patient outcomes in ambulatory and inpatient settings. population and program characteristics and incentive targets. Conclusion: Pay-for-performance programs may be associated with improved processes of care in ambulatory settings, but consistently positive associations with improved health outcomes have not been demonstrated in any setting.

P ay-for-performance (P4P) programs provide financial rewards or penalties to individual health care providers, groups of providers, or institutions according to their performance on measures of quality. In theory, if properly targeted and designed, P4P programs would help drive the behavior of providers and health care systems to improve the quality of care delivered, reduce unnecessary use of expensive health care services, and improve patient health outcomes (1). The idea is particularly relevant in the United States, where serious and broad gaps in health care quality have been tied in part to the long-standing fee-forservice system, which may provide incentives for service volume rather than quality (2).
Despite their intuitive appeal, the promise of P4P programs in improving outcomes has not been empirically realized in past studies (3)(4)(5)(6). The most recent systematic review examining the effectiveness of P4P programs in the United States found mixed evidence that P4P was associated with modest improvements in process-of-care outcomes but had little effect on patient outcomes (7). However, the literature has grown considerably since this review (which searched through 2012), and other countries, such as the United Kingdom, have gained considerable experience with large P4P initiatives that may provide information relevant to the United States. The purpose of the current review is to update and expand the prior systematic review in order to summarize current understanding of the effects of P4P programs targeted at physicians, groups, and institutions on process-of-care and patient outcomes in ambulatory and outpatient settings in and outside the United States.

METHODS
This review was conducted according to a protocol that was developed using established reporting standards and posted to a public Web site (8) before the study was initiated (Appendix 1 of the Supplement, available at Annals.org). We used an analytic framework based on work by Damberg and colleagues (7) (Appendix 2 of the Supplement).  29 February 2016). We also performed targeted Google and PubMed searches aimed at wellknown P4P demonstrations. We obtained additional articles from reference lists of pertinent studies, reviews, editorials, and expert recommendations. The search strategies are detailed in Appendix 3 of the Supplement.

Study Selection
Investigators reviewed titles and abstracts identified from literature searches. Two investigators independently assessed each potentially relevant article for inclusion using preestablished criteria (Appendices 4 and 5 of the Supplement). We included Englishlanguage studies of adult patients that evaluated ambulatory care-or hospital-based P4P programs targeting health care providers at the individual, group, managerial, or institutional level and that reported any process-of-care, utilization, health, or intermediate health (clinical measures, such as a laboratory value or blood pressure) outcome. We included studies from other countries that have health systems similar to portions of the U.S. health care system. Studies examining only patient-targeted financial incentives, as well as payment models other than direct P4P, such as managed care, capitation, bundled payments, and accountable care organizations, were excluded. We also excluded studies that were not conducted in hospital or ambulatory settings, such as studies in long-term care facilities or nursing homes.
We included clinical or cluster randomized, controlled trials (RCTs) of any size. We used a bestevidence approach, which is a method of specifying minimum inclusion criteria for nonrandomized studies (9). Inclusion of observational studies was limited to those with a comparison group, interrupted time series (ITS) studies, or large (n > 10 000) cross-sectional or uncontrolled before-after studies. We excluded smaller uncontrolled studies because we had identified a large number of potentially relevant studies during a preliminary search and because the smaller uncontrolled studies were less likely to provide broadly applicable information given their limited scope and inherent methodological deficiencies.

Data Extraction and Quality Assessment
One investigator abstracted data elements from each included study, which were reviewed for accuracy by at least 1 additional investigator. We abstracted information on study design, sample size, country, program description, incentive structure (size and timing), target of the incentive, comparator, and outcomes (grouped as health, intermediate health, process-ofcare, and utilization measures). Appendices 6 and 7 of the Supplement report these data. We classified studies according to 4 broad groupings: RCTs, ITS studies, controlled before-after studies, and uncontrolled before-after studies. Two investigators independently assessed study quality using the Newcastle-Ottawa Scale (10) for observational studies and the Cochrane Risk-of-Bias tool (11) for RCTs (Appendix 8 of the Supplement). Disagreements were resolved by consensus.

Data Synthesis and Analysis
We qualitatively synthesized the results of ambulatory and hospital studies separately and report process-of-care and patient outcomes for each setting. We synthesized results for specific P4P programs whenever possible. The review team evaluated the strength of the evidence according to guidance from the Agency for Healthcare Research and Quality (12). We did not perform meta-analysis because of the marked clinical heterogeneity across studies and the large number of observational studies.

Role of the Funding Source
The U.S. Department of Veterans Affairs Quality Enhancement Research Initiative supported this review but had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Search Results
We reviewed 3418 titles and abstracts, identified 586 potentially eligible full-text articles, and ultimately included 69 studies (Figure). Fifty-eight studies were in   (36,83). Taiwan's DM-P4P program, implemented in 2001, allows physicians to voluntarily enroll in the program, and they in turn are given freedom to choose which patients to enroll (51).

Process-of-Care Outcomes
We found 9 studies from the United States evaluating the effects of P4P on process-of-care outcomes (14, 16 -20, 22-24). Most of these studies examined outcomes over 4 years and had an average follow-up of 2.5 years; very few studies reported longer-term data. One RCT found that individual incentives increased appropriate response to high blood pressure but not use of guideline-recommended antihypertensive medication (14). Of the 6 studies that reported positive results (16,18,19,(22)(23)(24), 1 did not have a control group (24), and selection bias was a serious concern in 3 others because of the way the control group was chosen (18,22,23). Two methodologically sound controlled before-after studies found no improvements in processes of care (17,20).
In general, there was evidence across 17 studies in the United Kingdom (26 -31, 33, 36 -38, 41-47) that the QOF was associated with improvements in process-ofcare measures, although the evidence was mixed among the more methodologically rigorous studies. There were 6 ITS studies. One showed substantial improvements in the prescription of long-acting reversible contraceptives (26), and another showed modest improvement in the initiation of diabetes medications (27). Another study found increased rates of depression screening and diagnoses, but antidepressant prescribing remained unchanged (31). In the other 3 studies, improvements had begun well before QOF implementation, and postintervention trends did not show substantial improvement and, in fact, showed slower or decreased improvement over time (28 -30).
Although many studies of Taiwan's DM-P4P program showed improvement in process-of-care measures, selection bias was a major concern (51-54, 58). Physicians voluntarily enrolled and were given discretion over which patients to enroll. Because the program lacked risk adjustment and, initially, a mechanism to disenroll patients, physicians had a strong incentive to enroll healthier patients (51). Indeed, enrolled patients were much healthier than nonenrolled patients. Moreover, at participating institutions, the pool of nonenrolled patients became sicker over time, indicating that healthier patients were being removed to participate in DM-P4P. Though many studies attempted to adjust for differences in the 2 groups by using propensity score matching, residual confounding was still an important potential issue given the many unmeasured factors that were likely to be related to enrollment decision making.
We found 13 non-U.S. studies that were not part of a larger P4P evaluation. Two of these studies were methodologically sound observational studies from Canada that reported contradictory results on screening and preventive measures (66,67). An ITS study found modest increases in colorectal cancer screening but no effects on cervical and breast cancer screening (66). However, a controlled before-after study found modest increases for colorectal cancer screening, mammography, flu shots, and Papanicolaou smears (67). It was difficult to draw strong conclusions from the other 11 studies because of disparities in the programs' targets and designs and the study settings, as well as the low quality of the study designs (49, 50, 61-65, 68 -71).

Patient Outcomes
Health Outcomes. Ten studies evaluated health outcomes in ambulatory settings (39, 44, 51, 52, 55-57, 59 -61). Eight of the studies (most of which found positive results) were conducted in Taiwan and should be interpreted with caution due to selection bias, as described earlier (51, 52, 55-57, 59 -61). Two large uncontrolled before-after studies of QOF reported no improvements in health outcomes (39,44). One assessed the correlation among regional QOF performance, allcause mortality, and condition-specific mortality (39). It found that better performance on both the aggregate of QOF quality indicators and a subset of intermediate outcome indicators did not correlate with reduced mortality. Another study found that chronic obstructive pulmonary disease (COPD) prevalence actually increased from 1.27% to 1.45% after QOF implementation (44). Given the time needed to develop COPD and

REVIEW
Effects of P4P Programs on Health, Health Care Use, and Processes of Care that most QOF indicators focused on managing COPD rather than preventing it, the implications of these findings are unclear. Studies with high risk of bias generally found positive effects associated with DM-P4P (51,52,(55)(56)(57)59) and the similarly structured tuberculosis P4P program (60,61). However, given the limitations already highlighted, such results are difficult to interpret.
The 6 studies from the United States reported mixed findings on the effects of P4P on utilization, although studies with the strongest designs showed no effect. One rigorously controlled study examined a P4P intervention that provided bonuses to practices that achieved advanced medical home status and found no effect on all-cause hospitalizations, all-cause emergency department (ED) visits, or ambulatory caresensitive ED visits (17). Ambulatory care-sensitive hospitalizations actually increased in the second year of the intervention. Another controlled before-after study examined P4P in 3 state Medicaid programs and found no changes in any of the states for ED visits and inconsistent findings on inpatient utilization (20). A study examining a P4P program in medical homes targeting improved diabetes screenings and care found reductions in ED use and primary care visits but not in 6 other utilization measures (21). One study of a Medicare Advantage plan that rewarded physicians for providing evidence-based care to patients with heart failure found no effect on acute admissions or ED visits (16). Two studies lacking appropriate control groups showed improvement in ED use (22,24).
Studies in Taiwan generally found reductions in hospital use associated with P4P (52-54, 58, 60). Again, due to the high likelihood of selection bias, these studies should be interpreted with caution.
A QOF study found a sustained reduction in ambulatory care-sensitive ED admissions (34).
Intermediate Health Outcomes. Twelve studies reported 1 or more intermediate health outcomes (13,14,25,32,35,37,38,40,41,43,47,48). There were 2 RCTs with low risk of bias conducted in the United States. One RCT (n = 1503) evaluated the effect of a P4P program on low-density lipoprotein cholesterol levels (13). Physicians were given monthly patient progress reports and were eligible for comparatively large P4P bonuses ($256 quarterly per patient) that were separated from other funding sources to highlight their relevance. Physicians received average total incentive payments of $3246. The difference in low-density lipoprotein cholesterol level between patients seen by physicians in the P4P and control groups was not significant (2.8 mg/dL [95% CI, Ϫ1.7 to 7.4 mg/dL]; P = 0.66).
The other RCT (14) was included in the prior review by Damberg and colleagues, but a substudy was recently published (15). The original trial compared the effect of financial incentives earned for controlled blood pressure or response to uncontrolled blood pressure across 4 groups: incentives directed to individual physicians, practices, or both, or no incentives (14). The study included 77 physicians; payments and performance feedback were delivered to physicians at the end of each 4-month performance period. The average total payment for physicians completing the entire program was $2744. A higher proportion of patients achieved one or both measures in the individual physician incentive group than the control group (difference, 8.36% [CI, 2.4% to 13.0%]; P = 0.005), although the differences were not significant in the other 2 intervention groups. The recently published substudy found that the proportion of patients achieving control was not significantly higher in the incentive group (15).
Ten observational studies examining QOF reported mixed findings on intermediate outcomes (25,32,35,37,38,40,41,43,47,48), but methodologically stronger studies suggested that QOF had little effect. Uncontrolled studies suggested large improvements in blood pressure control, cholesterol levels, and hemoglobin A 1c (HbA 1c ) control. However, higher-quality studies that accounted for time trends failed to replicate these findings (25,32). One short-term ITS study found that blood pressure control and cholesterol levels improved but HbA 1c control worsened relative to the underlying trend (32). A longer-term ITS study found that although mean cholesterol and HbA 1c levels and blood pressure control had been improving before QOF implementation, only systolic blood pressure continued to improve afterward. Diastolic blood pressure, mean cholesterol levels, and HbA 1c levels actually worsened relative to the pre-QOF trend (25).

Process-of-Care Outcomes
Eight studies examined process-of-care measures in the hospital setting (74 -77, 79 -82). Controlled before-after studies from the United States and Canada generally failed to find improvements in care processes (74,75), although 1 study from Canada did report modest reductions in ED wait times (80). One controlled study from Taiwan found that P4P-enrolled patients with breast cancer received better-quality care than nonenrolled patients (79). Uncontrolled studies reported larger improvements (76,77,81,82).

Patient Outcomes
Health Outcomes. Pay-for-performance programs generally did not decrease mortality or improve patient experience in 5 studies in hospital settings (73,74,78,79,82). High-quality studies examining the U.K. Hospital Quality Incentive demonstration and the U.S. Hospital Value-Based Purchasing (HVBP) programs did not find a link between mortality and targeted conditions (73,78). One short-term controlled before-after study found no immediate change in patient experience associated with the HVBP program (74). One uncontrolled study found that mortality related to hemorrhagic strokes did not decrease after implementation of P4P (82). A study from Taiwan indicated that P4P patients had improved breast cancer survival (79).
Utilization Outcomes. One ITS study reported utilization outcomes (72) and found that hospital readmissions among Medicare fee-for-service patients decreased sharply for approximately 2 years after implementation of the Hospital Readmissions Reduction Program; improvements continued thereafter but at a substantially lower rate. Although readmission reductions were seen for various conditions, they decreased more among the measures that were specifically targeted by the program than those that were not.

DISCUSSION
This systematic review of 69 studies updated and expanded on a previous review that had focused on U.S. programs and reported similar findings (7). The strength of the evidence and key results are summarized in Table 3. Overall, in the ambulatory setting, we found low-strength evidence that P4P programs may improve process-of-care outcomes over the short term

REVIEW
Effects of P4P Programs on Health, Health Care Use, and Processes of Care (2 to 3 years). Evidence on the longer-term effects of P4P programs was limited. Many of the studies reporting positive findings were conducted in the United Kingdom, where incentives were much larger than any P4P programs in the United States. The largest improvements were seen in areas where baseline performance was poor. We found low-strength evidence that P4P had little to no effect on intermediate health outcomes (changes in laboratory measures), though there were inconsistencies among study results. The evidence examining patient health outcomes was insufficient because few methodologically rigorous studies reported these outcomes. In the hospital setting, lowstrength evidence showed that P4P had a neutral effect on patient health outcomes and a positive effect on reducing hospital readmissions. Although many studies found positive effects associated with P4P programs, the results were inconsistent across studies, the magnitude of effect was often small, and it was difficult to confidently ascribe observed changes in outcomes to the intervention itself because of the observational nature of most studies and their specific methodological flaws. To better characterize the breadth of programs that have been evaluated, we included large uncontrolled studies reporting outcomes before and after program implementation. However, in all of these studies, the 2 measurements potentially reflect the peak and average of normally expected measurement variation (a phenomenon known as regression to the mean). The controlled before-after studies do not have this same issue, but the choice of control group was problematic in many studies because either the patients who qualified for a P4P program differed systematically from those who did not, or the participating providers or practices differed substantially from those that did not participate. The ITS studies were useful because they accounted for trends in outcomes before the intervention. Indeed, several of these studies showed that improvements in outcomes had begun before P4P implementation. It is unclear whether these reflected secular trends in health care or practice changes in anticipation of intervention implementation.
Our findings complement and add to prior reviews, which have also generally found that P4P programs have not been consistently effective in improving patient outcomes (3)(4)(5)(6)(7). There are several reasons why this might be the case. First, especially in the era of modern health reform, P4P programs have been implemented and assessed in settings where other effective quality improvement interventions-such as public reporting, audit and feedback, and electronic decision-support tools-may have been deployed (84). The incremental benefit of P4P may therefore have been more difficult to demonstrate.
Second, it is possible that P4P programs have not tested the "best" incentive structures and payment mechanisms. Experts have suggested the importance of designing P4P programs using the principles of behavioral economics, in which such factors as payment size, timing, and frequency are believed to have important influences on individual behavior (85). In health care, we have not found strong empirical data to help determine the most successful incentive structure (86). It is interesting to consider the United Kingdom's QOF program, which accounted for nearly 40% of the included studies in our review, alongside U.S. efforts. Studies of QOF found that incentivized process-of-care measures can lead to improvements, especially in the early years of program implementation, but the rate of improvement slowed over time and there was no clear evidence that QOF improved patient outcomes. Whereas the P4P programs in the United States tended to be implemented within health systems or payers and involve relatively small incentives, QOF is the largest P4P program ever attempted in health care. It was implemented nationally with a single payer that includes virtually all general practitioners and provides practices with up to 30% of their annual income. Finally, P4P programs are very complex health system interventions that have been implemented in various ways. In a related article, we examined the implementation factors that might mediate the potentially beneficial and harmful effects of P4P programs (86). We systematically reviewed studies of implementation factors and also conducted interviews with experts in the field of P4P. Although direct evidence was inadequate to draw strong conclusions, we found that provider buy-in and alignment of measures with organizational goals were likely to be important in sustaining effective programs. We found that measures that were transparently developed from the evidence base and that were focused on improving clinical processes and patient outcomes rather than measures of efficiency were more likely to be effective. We also found that the overall number of incentives in place at any one time needs to be carefully considered. Given the evidence that the most substantial gains were consistently seen in areas of poor baseline performance, we suggested that organizations use incentives in the most-needed areas, review measures regularly, and discontinue them after achieving sustained improvements.
Our review has several important limitations. The evidence is limited by methodological flaws, variation in program and population characteristics, and limited reporting on secular trends in health care. We chose to include studies from other countries because the breadth of experience with P4P might be informative for some U.S. health systems, but we acknowledge that there are also limitations in applying findings from other countries broadly in the United States. Our review expands on a prior review, so it is possible we did not include some individual studies that are informative, though these probably would not have altered our summary findings.
The policy implications of our findings are open to interpretation. In the absence of strong evidence of benefit, it may be particularly important to consider the potential harms and costs associated with P4P. We recently published a systematic review of the unintended consequences of P4P: There was very limited evidence assessing the extent of gaming, no consistent evidence of a negative effect on health disparities, and a small amount of evidence suggesting the potential for both positive and negative effects on unincentivized measures (87). The costs and burden of documentation and reporting requirements associated with P4P programs are also important to consider but have not been studied extensively. Qualitative studies have found that providers perceive P4P programs as imposing a considerable burden and threatening clinical autonomy (88 -90). A recent survey study found that U.S. health care providers self-report spending about 15 hours per week reporting and interpreting data for measures, which translates into billions of dollars in opportunity cost (91). Indeed, the United Kingdom decided to scale back its QOF program after 10 years of experience, in part because of provider concerns and the inconsistency of data demonstrating long-term benefit (92).
On the other hand, P4P programs have likely been effective in some areas, most notably in improving processes of care. The lack of evidence on patient outcomes may reflect deficiencies in the methods that have been used to study these effects and the likelihood that it takes a long time for process-of-care improvements to translate into large-scale patient outcome improvements (93).
In summary, we found low-strength, contradictory evidence that P4P programs could improve processes of care, but we found no clear evidence to suggest that they improve patient outcomes. Value-based purchasing is a cornerstone of the coming Medicare reform known as the Medicare Access and CHIP Reauthorization Act, so P4P will remain a fixture in U.S. health care for the foreseeable future (94). Whether the inconsistency of positive findings suggests that P4P, broadly speaking, is unlikely to have large effects or is related to the marked differences in program design, patient population, and incentive target is unclear.