We compared mean scores for each of the 3 methods to determine how well vignettes and chart abstraction measured actual quality compared with the standardized patient benchmark. We disaggregated these comparisons by disease, site, case complexity, and physician training level. We evaluated the statistical significance of the differences in mean scores among the 3 methods by using the F test from an analysis of variance (ANOVA) model that considered the matching of vignette, standardized patient, and chart abstraction scores for each physician for each case. Specifically, the 3-way crossed, 1-way nested model included factors for site, physician training level, quality measurement method, and physician (nested within site), plus a site-by-method interaction. Where differences among means for the 3 methods were statistically significant, we used the Tukey–Kramer multiple comparison procedure to evaluate the significance of comparisons between pairs of methods by using a global 5% significance level. We also considered other interaction terms (method by disease, method by case complexity, and method by physician training level) in the ANOVA model to assess the consistency of the results across these factors. We estimated the 95% CIs by using adjusted errors to account for the nested study design.