Karel G.M. Moons, PhD; Douglas G. Altman, DSc; Johannes B. Reitsma, MD, PhD; John P.A. Ioannidis, MD, DSc; Petra Macaskill, PhD; Ewout W. Steyerberg, PhD; Andrew J. Vickers, PhD; David F. Ransohoff, MD; Gary S. Collins, PhD
Disclosures: Disclosures can be viewed at www.acponline.org/authors/icmje/ConflictOfInterestForms.do?msNum=M14-0698.
Grant Support: There was no explicit funding for the development of this checklist and guidance document. The consensus meeting in June 2011 was partially funded by a National Institute for Health Research Senior Investigator Award held by Dr. Altman, Cancer Research UK (grant C5529), and the Netherlands Organization for Scientific Research (ZONMW 918.10.615 and 91208004). Drs. Collins and Altman are funded in part by the Medical Research Council (grant G1100513). Dr. Altman is a member of the Medical Research Council Prognosis Research Strategy (PROGRESS) Partnership (G0902393/99558).
Requests for Single Reprints: Karel G.M. Moons, PhD, Julius Centre for Health Sciences and Primary Care, UMC Utrecht, PO Box 85500, 3508 GA Utrecht, the Netherlands; e-mail, K.G.M.Moons@umcutrecht.nl.
Current Author Addresses: Drs. Moons and Reitsma: Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, PO Box 85500, 3508 GA Utrecht, the Netherlands.
Drs. Altman and Collins: Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Oxford OX3 7LD, United Kingdom.
Dr. Ioannidis: Stanford Prevention Research Center, School of Medicine, Stanford University, 291 Campus Drive, Room LK3C02, Li Ka Shing Building, 3rd Floor, Stanford, CA 94305-5101.
Dr. Macaskill: Screening and Test Evaluation Program (STEP), School of Public Health, Edward Ford Building (A27), Sydney Medical School, University of Sydney, Sydney, NSW 2006, Australia.
Dr. Steyerberg: Department of Public Health, Erasmus MC–University Medical Center Rotterdam, PO Box 2040, 3000 CA, Rotterdam, the Netherlands.
Dr. Vickers: Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, 307 East 63rd Street, 2nd Floor, Box 44, New York, NY 10065.
Dr. Ransohoff: Departments of Medicine and Epidemiology, University of North Carolina at Chapel Hill, 4103 Bioinformatics, CB 7080, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7080.
Author Contributions: Conception and design: K.G.M. Moons, D.G. Altman, J.B. Reitsma, P. Macaskill, G.S. Collins.
Analysis and interpretation of the data: K.G.M. Moons, D.G. Altman, J.B. Reitsma, J.P.A. loannidis, P. Macaskill, E.W. Steyerberg, A.J. Vickers, D.F. Ransohoff, G.S. Collins.
Drafting of the article: K.G.M. Moons, D.G. Altman, J.B. Reitsma, G.S. Collins.
Critical revision of the article for important intellectual content: K.G.M. Moons, D.G. Altman, J.B. Reitsma, J.P.A. loannidis, P. Macaskill, E.W. Steyerberg, A.J. Vickers, D.F. Ransohoff, G.S. Collins.
Final approval of the article: K.G.M. Moons, D.G. Altman, J.B. Reitsma, J.P.A. loannidis, P. Macaskill, E.W. Steyerberg, A.J. Vickers, D.F. Ransohoff, G.S. Collins.
Provision of study materials or patients: K.G.M. Moons, D.G. Altman, J.B. Reitsma, G.S. Collins.
Statistical expertise: K.G.M. Moons, D.G. Altman, J.B. Reitsma, P. Macskill, E.W. Steyerberg, A.J. Vickers, G.S. Collins.
Obtaining of funding: K.G.M. Moons, D.G. Altman, G.S. Collins.
Administrative, technical, or logistic support: K.G.M. Moons, G.S. Collins.
Collection and assembly of data: K.G.M. Moons, D.G. Altman, G.S. Collins.
The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Statement includes a 22-item checklist, which aims to improve the reporting of studies developing, validating, or updating a prediction model, whether for diagnostic or prognostic purposes. The TRIPOD Statement aims to improve the transparency of the reporting of a prediction model study regardless of the study methods used. This explanation and elaboration document describes the rationale; clarifies the meaning of each item; and discusses why transparent reporting is important, with a view to assessing risk of bias and clinical usefulness of the prediction model. Each checklist item of the TRIPOD Statement is explained in detail and accompanied by published examples of good reporting. The document also provides a valuable reference of issues to consider when designing, conducting, and analyzing prediction model studies. To aid the editorial process and help peer reviewers and, ultimately, readers and systematic reviewers of prediction model studies, it is recommended that authors include a completed checklist in their submission. The TRIPOD checklist can also be downloaded from www.tripod-statement.org.
Schematic representation of diagnostic and prognostic prediction modeling studies.
The nature of the prediction in diagnosis is estimating the probability that a specific outcome or disease is present (or absent) within an individual, at this point in time—that is, the moment of prediction (T = 0). In prognosis, the prediction is about whether an individual will experience a specific event or outcome within a certain time period. In other words, in diagnostic prediction the interest is in principle a cross-sectional relationship, whereas prognostic prediction involves a longitudinal relationship. Nevertheless, in diagnostic modeling studies, for logistical reasons, a time window between predictor (index test) measurement and the reference standard is often necessary. Ideally, this interval should be as short as possible without starting any treatment within this period.
Similarities and differences between diagnostic and prognostic prediction models.
Types of prediction model studies.
Types of prediction model studies covered by the TRIPOD statement.
D = development data; V = validation data.
Table 1. Checklist of Items to Include When Reporting a Study Developing or Validating a Multivariable Prediction Model for Diagnosis or Prognosis
Development and validation of a clinical score to estimate the probability of coronary artery disease in men and women presenting with suspected coronary disease (115). [Diagnosis; Development; Validation]
Development and external validation of prognostic model for 2 year survival of non small cell lung cancer patients treated with chemoradiotherapy (116). [Prognosis; Development; Validation]
Predicting the 10 year risk of cardiovascular disease in the United Kingdom: independent and external validation of an updated version of QRISK2 (117). [Prognosis; Validation]
Development of a prediction model for 10 year risk of hepatocellular carcinoma in middle-aged Japanese: the Japan Public Health Center based Prospective Study Cohort II (118). [Prognosis; Development]
Development and validation of a logistic regression derived algorithm for estimating the incremental probability of coronary artery disease before and after exercise testing (119). [Diagnosis; Development; Validation]
Validation of the Framingham coronary heart disease prediction scores: results of a multiple ethnic groups investigation (120). [Prognosis; Validation]
External validation of the SAPS II APACHE II and APACHE III prognostic models in South England: a multicentre study (121). [Prognosis; Validation]
OBJECTIVE: To develop and validate a prognostic model for early death in patients with traumatic bleeding.
DESIGN: Multivariable logistic regression of a large international cohort of trauma patients.
SETTING: 274 hospitals in 40 high, medium, and low income countries.
PARTICIPANTS: Prognostic model development: 20,127 trauma patients with, or at risk of, significant bleeding, within 8 hours of injury in the Clinical Randomisation of an Antifibrinolytic in Significant Haemorrhage (CRASH 2) trial. External validation: 14,220 selected trauma patients from the Trauma Audit and Research Network (TARN), which included mainly patients from the UK.
OUTCOMES: In hospital death within 4 weeks of injury.
RESULTS: 3076 (15%) patients died in the CRASH 2 trial and 1765 (12%) in the TARN dataset. Glasgow coma score, age, and systolic blood pressure were the strongest predictors of mortality. Other predictors included in the final model were geographical region (low, middle, or high income country), heart rate, time since injury, and type of injury. Discrimination and calibration were satisfactory, with C statistics above 0.80 in both CRASH 2 and TARN. A simple chart was constructed to readily provide the probability of death at the point of care, and a web based calculator is available for a more detailed risk assessment (http://crash2.lshtm.ac.uk).
CONCLUSIONS: This prognostic model can be used to obtain valid predictions of mortality in patients with traumatic bleeding, assisting in triage and potentially shortening the time to diagnostic and lifesaving procedures (such as imaging, surgery, and tranexamic acid). Age is an important prognostic factor, and this is of particular relevance in high income countries with an aging trauma population (123). [Prognosis; Development]
OBJECTIVE: To validate and refine previously derived clinical decision rules that aid the efficient use of radiography in acute ankle injuries.
DESIGN: Survey prospectively administered in two stages: validation and refinement of the original rules (first stage) and validation of the refined rules (second stage).
SETTING: Emergency departments of two university hospitals.
PATIENTS: Convenience sample of adults with acute ankle injuries: 1032 of 1130 eligible patients in the first stage and 453 of 530 eligible patients in the second stage.
MAIN OUTCOME MEASURES: Attending emergency physicians assessed each patient for standardized clinical variables and classified the need for radiography according to the original (first stage) and the refined (second stage) decision rules. The decision rules were assessed for their ability to correctly identify the criterion standard of fractures on ankle and foot radiographic series. The original decision rules were refined by univariate and recursive partitioning analyses.
MAIN RESULTS: In the first stage, the original decision rules were found to have sensitivities of 1.0 (95% confidence interval [CI], 0.97 to 1.0) for detecting 121 malleolar zone fractures, and 0.98 (95% CI, 0.88 to 1.0) for detecting 49 midfoot zone fractures. For interpretation of the rules in 116 patients, kappa values were 0.56 for the ankle series rule and 0.69 for the foot series rule. Recursive partitioning of 20 predictor variables yielded refined decision rules for ankle and foot radiographic series. In the second stage, the refined rules proved to have sensitivities of 1.0 (95% CI, 0.93 to 1.0) for 50 malleolar zone fractures, and 1.0 (95% CI, 0.83 to 1.0) for 19 midfoot zone fractures. The potential reduction in radiography is estimated to be 34% for the ankle series and 30% for the foot series. The probability of fracture, if the corresponding decision rule were “negative,” is estimated to be 0% (95% CI, 0% to 0.8%) in the ankle series, and 0% (95% CI, 0% to 0.4%) in the foot series.
CONCLUSION: Refinement and validation have shown the Ottawa ankle rules to be 100% sensitive for fractures, to be reliable, and to have the potential to allow physicians to safely reduce the number of radiographs ordered in patients with ankle injuries by one third. Field trials will assess the feasibility of implementing these rules into clinical practice (124). [Diagnosis; Validation; Updating]
Confronted with acute infectious conjunctivitis most general practitioners feel unable to discriminate between a bacterial and a viral cause. In practice more than 80% of such patients receive antibiotics. Hence in cases of acute infectious conjunctivitis many unnecessary ocular antibiotics are prescribed. … To select those patients who might benefit most from antibiotic treatment the general practitioner needs an informative diagnostic tool to determine a bacterial cause. With such a tool antibiotic prescriptions may be reduced and better targeted. Most general practitioners make the distinction between a bacterial cause and another cause on the basis of signs and symptoms. Additional diagnostic investigations such as a culture of the conjunctiva are seldom done mostly because of the resulting diagnostic delay. Can general practitioners actually differentiate between bacterial and viral conjunctivitis on the basis of signs and symptoms alone? … A recently published systematic literature search summed up the signs and symptoms and found no evidence for these assertions. This paper presents what seems to be the first empirical study on the diagnostic informativeness of signs and symptoms in acute infectious conjunctivitis (130). [Diagnosis; Development]
In the search for a practical prognostic system for patients with parotid carcinoma, we previously constructed a prognostic index based on a Cox proportional hazards analysis in a source population of 151 patients with parotid carcinoma from the Netherlands Cancer Institute. [The] Table … shows the pretreatment prognostic index PS1, which combines information available before surgery, and the post treatment prognostic index PS2, which incorporates information from the surgical specimen. For each patient, the index sums the properly weighted contributions of the important clinicopathologic characteristics into a number corresponding to an estimated possibility of tumor recurrence. These indices showed good discrimination in the source population and in an independent nationwide database of Dutch patients with parotid carcinoma. According to Justice et al, the next level of validation is to go on an international level. … For this purpose, an international database was constructed from patients who were treated in Leuven and Brussels (Belgium) and in Cologne (Germany), where the prognostic variables needed to calculate the indices were recorded, and predictions were compared with outcomes. In this way, we tried to achieve further clinical and statistical validation (131). [Prognosis; Validation]
Any revisions and updates to a risk prediction model should be subject to continual evaluation (validation) to show that its usefulness for routine clinical practice has not deteriorated, or indeed to show that its performance has improved owing to refinements to the model. We describe the results from an independent evaluation assessing the performance of QRISK2 2011 on a large dataset of general practice records in the United Kingdom, comparing its performance with earlier versions of QRISK and the NICE adjusted version of the Framingham risk prediction model (117). [Prognosis; Validation]
The aim of this study was to develop and validate a clinical prediction rule in women presenting with breast symptoms, so that a more evidence based approach to referral—which would include urgent referral under the 2 week rule—could be implemented as part of clinical practice guidance (142). [Diagnosis; Development; Validation]
In this paper, we report on the estimation and external validation of a new UK based parametric prognostic model for predicting long term recurrence free survival for early breast cancer patients. The model's performance is compared with that of Nottingham Prognostic Index and Adjuvant Online, and a scoring algorithm and downloadable program to facilitate its use are presented (143). [Prognosis; Development; Validation]
Even though it is widely accepted that no prediction model should be applied in practice before being formally validated on its predictive accuracy in new patients no study has previously performed a formal quantitative (external) validation of these prediction models in an independent patient population. Therefore we first conducted a systematic review to identify all existing prediction models for prolonged ICU length of stay (PICULOS) after cardiac surgery. Subsequently we validated the performance of the identified models in a large independent cohort of cardiac surgery patients (46). [Prognosis; Validation]
The population based sample used for this report included 2489 men and 2856 women 30 to 74 years old at the time of their Framingham Heart Study examination in 1971 to 1974. Participants attended either the 11th examination of the original Framingham cohort or the initial examination of the Framingham Offspring Study. Similar research protocols were used in each study, and persons with overt coronary heart disease at the baseline examination were excluded (144). [Prognosis; Development]
Data from the multicentre, worldwide, clinical trial (Action in Diabetes and Vascular disease: preterax and diamicron MR controlled evaluation) (ADVANCE) permit the derivation of new equations for cardiovascular risk prediction in people with diabetes. … ADVANCE was a factorial randomized controlled trial of blood pressure (perindopril indapamide versus placebo) and glucose control (gliclazide MR based intensive intervention versus standard care) on the incidence of microvascular and macrovascular events among 11,140 high risk individuals with type 2 diabetes … DIABHYCAR (The non insulin dependent diabetes, hypertension, microalbuminuria or proteinuria, cardiovascular events, and ramipril study) was a clinical trial of ramipril among individuals with type 2 diabetes conducted in 16 countries between 1995 and 2001. Of the 4912 randomized participants, 3711 … were suitable for use in validation. Definitions of cardiovascular disease in DIABHYCAR were similar to those in ADVANCE. … Predictors considered were age at diagnosis of diabetes, duration of diagnosed diabetes, sex, … and randomized treatments (blood pressure lowering and glucose control regimens) (145). [Prognosis; Development; Validation]
We did a multicentre prospective validation study in adults and an observational study in children who presented with acute elbow injury to five emergency departments in southwest England UK. As the diagnostic accuracy of the test had not been assessed in children we did not think that an interventional study was justified in this group (146). [Diagnosis; Validation]
We conducted such large scale international validation of the ADO index to determine how well it predicts mortality for individual subjects with chronic obstructive pulmonary disease from diverse settings, and updated the index as needed. Investigators from 10 chronic obstructive pulmonary disease and population based cohort studies in Europe and the Americas agreed to collaborate in the International chronic obstructive pulmonary disease Cohorts Collaboration Working Group (147). [Prognosis; Validation; Updating]
Selection of predictors in a study of the development of a multivariable prediction model.
This prospective temporal validation study included all patients who were consecutively treated from March 2007 to June 2007 in 19 phase I trials at the Drug Development Unit, Royal Marsden Hospital (RMH), Sutton, United Kingdom. … [A]ll patients were prospectively observed until May 31, 2008 (177). [Prognosis; Validation]
All consecutive patients presenting with anterior chest pain (as a main or minor medical complaint) over a three to nine week period (median length, five weeks) from March to May 2001 were included. … Between October 2005 and July 2006, all attending patients with anterior chest pain (aged 35 years and over; n = 1249) were consecutively recruited to this study by 74 participating GPs in the state of Hesse, Germany. The recruitment period lasted 12 weeks for each practice (178). [Diagnosis; Development; Validation]
The derivation cohort was 397 consecutive patients aged 18 years or over of both sexes who were admitted to any of four internal medicine wards at Donostia Hospital between 1 May and 30 June 2008 and we used no other exclusion criteria. The following year between 1 May and 30 June 2009 we recruited the validation cohort on the same basis: 302 consecutive patients aged 18 or over of both sexes who were admitted to any of the same four internal medicine wards at the hospital (179). [Prognosis; Development]
We built on our previous risk prediction algorithm (QRISK1) to develop a revised algorithm … QRISK2. We conducted a prospective cohort study in a large UK primary care population using a similar method to our original analysis. We used version 19 of the QRESEARCH database (www.qresearch.org). This is a large validated primary care electronic database containing the health records of 11 million patients registered from 551 general practices (139). [Prognosis; Development; Validation]
Table 2. Example Table: Reporting Key Study Characteristics [Diagnosis; Development; Validation]
Table 3. Overview of Different Approaches for Updating an Existing Prediction Model
One hundred and ninety two patients with cutaneous lymphomas were evaluated at the Departments of Dermatology at the UMC Mannheim and the UMC Benjamin Franklin Berlin from 1987 to 2002. Eighty six patients were diagnosed as having cutaneous T cell lymphoma (CTCL) as defined by the European Organisation for Research and Treatment of Cancer classification of cutaneous lymphomas, including mycosis fungoides, Sezary Syndrome and rare variants. … Patients with the rare variants of CTCL, parapsoriasis, cutaneous pseudolymphomas and cutaneous B cell lymphomas were excluded from the study. … Staging classification was done by the TNM scheme of the mycosis fungoides Cooperative Group. A diagnosis of Sezary Syndrome was made in patients with erythroderma and >1000 Sezary cells mm) in the peripheral blood according to the criteria of the International Society for Cutaneous Lymphomas (ISCL) (193). [Prognosis; Development]
Inclusion criteria were age 12 years and above, and injury sustained within 7 days or fewer. The authors selected 12 as the cutoff age because the emergency department receives, in the main, patients 12 years and above while younger patients were seen at a neighboring children's hospital about half a mile down the road from our hospital. In this, we differed from the original work by Stiell, who excluded patients less than 18 years of age. Exclusion criteria were: pregnancy, altered mental state at the time of consultation, patients who had been referred with an x ray study, revisits, multiply traumatized patients, and patients with isolated skin injuries such as burns, abrasions, lacerations, and puncture wounds (194). [Diagnosis; Validation]
Data from the multi-centre, worldwide, clinical trial (Action in Diabetes and Vascular disease: preterax and diamicron-MR controlled evaluation) (ADVANCE) permit the derivation of new equations for cardiovascular risk prediction in people with diabetes. … ADVANCE was a factorial randomized controlled trial of blood pressure (perindopril indapamide versus placebo) and glucose control (gliclazide MR based intensive intervention versus standard care) on the incidence of microvascular and macrovascular events among 11,140 high risk individuals with type 2 diabetes, recruited from 215 centres across 20 countries in Asia, Australasia, Europe and Canada. … Predictors considered were age at diagnosis of diabetes, duration of diagnosed diabetes, sex, systolic blood pressure, diastolic blood pressure, mean arterial blood pressure, pulse pressure, total cholesterol, high-density lipoprotein and non high-density lipoprotein and triglycerides, body mass index, waist circumference, Predictors waist to hip ratio, blood pressure lowering medication (i.e. treated hypertension), statin use, current smoking, retinopathy, atrial fibrillation (past or present), logarithmically transformed urinary albumin/creatinine ratio (ACR) and serum creatinine (Scr), haemoglobin A1c (HbA1c), fasting blood glucose and randomized treatments (blood pressure lowering and glucose control regimens) (145). [Prognosis; Development; Validation]
Outcomes of interest were any death, coronary heart disease related death, and coronary heart disease events. To identify these outcomes, cohort participants were followed over time using a variety of methods, including annual telephone interviews, triennial field center examinations, surveillance at ARIC community hospitals, review of death certificates, physician questionnaires, coroner/medical examiner reports, and informant interviews. Follow up began at enrollment (1987 to 1989) and continued through December 31, 2000. Fatal coronary heart disease included hospitalized and nonhospitalized deaths associated with coronary heart disease. A coronary heart disease event was defined as hospitalized definite or probable myocardial infarction, fatal coronary heart disease, cardiac procedure (coronary artery bypass graft, coronary angioplasty), or the presence of serial electrocardiographic changes across triennial cohort examinations. Event classification has been described in detail elsewhere [ref] (210). [Prognosis; Development]
Definite urinary tract infection was defined as ≥108 colony forming units (cfu) per litre of a single type of organism in a voided sample ≥107 cfu/L of a single organism in a catheter sample or any growth of a single organism in a suprapubic bladder tap sample. Probable urinary tract infection was defined as ≥107 cfu/L of a single organism in a voided sample ≥106 cfu/L of a single organism in a catheter sample ≥108 cfu/L of two organisms in a voided sample or ≥107 cfu/L of two organisms from a catheter sample (211). [Diagnosis; Development; Validation]
Patient charts and physician records were reviewed to determine clinical outcome. Patients generally were seen postoperatively at least every 3–4 months for the first year, semi annually for the second and third years, and annually thereafter. Follow up examinations included radiological imaging with computed tomography in all patients. In addition to physical examination with laboratory testing, intravenous pyelography, cystoscopy, urine cytology, urethral washings and bone scintigraphy were carried out if indicated. Local recurrence was defined as recurrence in the surgical bed, distant as recurrence at distant organs. Clinical outcomes were measured from the date of cystectomy to the date of first documented recurrence at computed tomography, the date of death, or the date of last follow up when the patient had not experienced disease recurrence (212). [Prognosis; Development]
Breast Cancer Ascertainment: Incident diagnoses of breast cancer were ascertained by self-report on biennial follow up questionnaires from 1997 to 2005. We learned of deaths from family members, the US Postal Service, and the National Death Index. We identified 1084 incident breast cancers, and 1007 (93%) were confirmed by medical record or by cancer registry data from 24 states in which 96% of participants resided at baseline (213). [Prognosis; Validation]
All probable cases of serious bacterial infection were reviewed by a final diagnosis committee composed of two specialist paediatricians (with experience in paediatrics infectious disease and respiratory medicine) and in cases of pneumonia a radiologist. The presence or absence of bacterial infection [outcome] was decided blinded to clinical information [predictors under study] and based on consensus (211). [Diagnosis; Development; Validation]
Liver biopsies were obtained with an 18 gauge or larger needle with a minimum of 5 portal tracts and were routinely stained with hematoxylin-eosin and trichrome stains. Biopsies were interpreted according to the scoring schema developed by the METAVIR group by 2 expert liver pathologists … who were blinded to patient clinical characteristics and serum measurements. Thirty biopsies were scored by both pathologists, and interobserver agreement was calculated by use of κ statistics (223). [Diagnosis; Development; Validation]
The primary outcome [acute myocardial infarction coronary revascularization or death of cardiac or unknown cause within 30 days] was ascertained by investigators blinded to the predictor variables. If a diagnosis could not be assigned a cardiologist … reviewed all the clinical data and assigned an adjudicated outcome diagnosis. All positive and 10% of randomly selected negative outcomes were confirmed by a second coinvestigator blinded to the standardized data collection forms. Disagreements were resolved by consensus (224). [Prognosis; Development]
The following data were extracted for each patient: gender, aspartate aminotransferase in IU/L, alanine aminotransferase in IU/L, aspartate aminotransferase/alanine aminotransferase ratio, total bilirubin (mg/dl), albumin (g/dl), transferrin saturation (%), mean corpuscular volume (μm3), platelet count ( × 103/mm3), and prothrombin time(s). … All laboratory tests were performed within 90 days before liver biopsy. In the case of repeated test, the results closest to the time of the biopsy were used. No data obtained after the biopsy were used (228). [Diagnosis; Development]
Forty three potential candidate variables in addition to age and gender were considered for inclusion in the AMI [acute myocardial infarction] mortality prediction rules. … These candidate variables were taken from a list of risk factors used to develop previous report cards in the California Hospital Outcomes Project and Pennsylvania Health Care Cost Containment Council AMI “report card” projects. Each of these comorbidities was created using appropriate ICD 9 codes from the 15 secondary diagnosis fields in OMID. The Ontario discharge data are based on ICD 9 codes rather than ICD 9 CM codes used in the U.S., so the U.S. codes were truncated. Some risk factors used in these two projects do not have an ICD 9 coding analog (e.g., infarct subtype, race) and therefore were not included in our analysis. The frequency of each of these 43 comorbidities was calculated, and any comorbidity with a prevalence of <1% was excluded from further analysis. Comorbidities that the authors felt were not clinically plausible predictors of AMI mortality were also excluded (185). [Prognosis; Development; Validation]
Each screening round consisted of two visits to an outpatient department separated by approximately 3 weeks. Participants filled out a questionnaire on demographics, cardiovascular and renal disease history, smoking status, and the use of oral antidiabetic, antihypertensive, and lipid lowering drugs. Information on drug use was completed with data from community pharmacies, including information on class of antihypertensive medication. … On the first and second visits, blood pressure was measured in the right arm every minute for 10 and 8 minutes, respectively, by an automatic Dinamap XL Model 9300 series device (Johnson & Johnson Medical Inc., Tampa, FL). For systolic and diastolic BP, the mean of the last two recordings from each of the 2 visit days of a screening round was used. Anthropometrical measurements were performed, and fasting blood samples were taken. Concentrations of total cholesterol and plasma glucose were measured using standard methods. Serum creatinine was measured by dry chemistry (Eastman Kodak, Rochester, NY), with intra assay coefficient of variation of 0.9% and interassay coefficient of variation of 2.9%. eGFR [estimated glomerular filtration rate] was estimated using the Modification of Diet in Renal Disease (MDRD) study equation, taking into account gender, age, race, and serum creatinine. In addition, participants collected urine for two consecutive periods of 24 hours. Urinary albumin concentration was determined by nephelometry (Dade Behring Diagnostic, Marburg, Germany), and UAE [urinary albumin excretion] was given as the mean of the two 24 hour urinary excretions. As a proxy for dietary sodium and protein intake, we used the 24 hour urinary excretion of sodium and urea, respectively (229). [Prognosis; Development]
A single investigator blinded to clinical data and echocardiographic measurements performed the quantitative magnetic resonance image analyses. [The aim was to specifically quantify the incremental diagnostic value of magnetic resonance beyond clinical data to include or exclude heart failure] (236). [Diagnosis; Development; Incremental value]
Blinded to [other] predictor variables and patient outcome [a combination of nonfatal and fatal cardiovascular disease and overall mortality within 30 days of chest pain onset], 2 board certified emergency physicians … classified all electrocardiograms [one of the specific predictors under study] with a structured standardized format … (224). [Prognosis; Development]
Investigators, blinded to both predictor variables and patient outcome, reviewed and classified all electrocardiograms in a structured format according to current standardized reporting guidelines. Two investigators blinded to the standardized data collection forms ascertained outcomes. The investigators were provided the results of all laboratory values, radiographic imaging, cardiac stress testing, and cardiac catheterization findings, as well as information obtained during the 30 day follow up phone call (237). [Diagnosis; Validation]
We estimated the sample size according to the precision of the sensitivity of the derived decision rule. As with previous decision rule studies we prespecified 120 outcome events to derive a rule that is 100% sensitive with a lower 95% confidence limit of 97.0% and to have the greatest utility for practicing emergency physicians we aimed to include at least 120 outcome events occurring outside the emergency department (in hospital or after emergency department discharge). Review of quality data from the Ottawa hospital indicated that 10% of patients who presented to the emergency department with chest pain would meet outcome criteria within 30 days. We estimated that half of these events would occur after hospital admission or emergency department discharge. The a priori sample size was estimated to be 2400 patients (224). [Diagnosis; Development]
Our sample size calculation is based on our primary objective (i.e., to determine if preoperative coronary computed tomography angiograph has additional predictive value beyond clinical variables). Of our two objectives, this objective requires the largest number of patients to ensure the stability of the prediction model. … On the basis of the VISION Pilot Study and a previous non-invasive cardiac testing study that we undertook in a similar surgical population, we expect a 6% event rate for major perioperative cardiac events in this study. Table 2 presents the various sample sizes needed to test four variables in a multivariable analysis based upon various event rates and the required number of events per variable. As the table indicates, if our event rate is 6% we will need 1000 patients to achieve stable estimates. If our event rate is 4%, we may need up to 1500 patients. We are targeting a sample size of 1500 patients but this may change depending on our event rate at 1000 patients (242). [Prognosis; Development]
All available data on the database were used to maximise the power and generalisability of the results (243). [Diagnosis; Development]
We did not calculate formal sample size calculations because all the cohort studies are ongoing studies. Also there are no generally accepted approaches to estimate the sample size requirements for derivation and validation studies of risk prediction models. Some have suggested having at least 10 events per candidate variable for the derivation of a model and at least 100 events for validation studies. Since many studies to develop and validate prediction models are small a potential solution is to have large scale collaborations as ours to derive stable estimates from regression models that are likely to generalize to other populations. Our sample and the number of events far exceeds all approaches for determining samples sizes and therefore is expected to provide estimates that are very robust (147). [Prognosis; Validation]
We calculated the study sample size needed to validate the clinical prediction rule according to a requirement of 100 patients with the outcome of interest (any intra-abdominal injury present), which is supported by statistical estimates described previously for external validation of clinical prediction rules. In accordance with our previous work, we estimated the enrolled sample would have a prevalence rate of intra-abdominal injury of 10%, and thus the total needed sample size was calculated at 1,000 patients (244). [Diagnosis; Validation]
We assumed missing data occurred at random depending on the clinical variables and the results of computed tomography based coronary angiography and performed multiple imputations using chained equations. Missing values were predicted on the basis of all other predictors considered the results of computed tomography based coronary angiography as well as the outcome. We created 20 datasets with identical known information but with differences in imputed values reflecting the uncertainty associated with imputations. In total 667 (2%) clinical data items were imputed. In our study only a minority of patients underwent catheter based coronary angiography. An analysis restricted to patients who underwent catheter based coronary angiography could have been influenced by verification bias. Therefore we imputed data for catheter based coronary angiography by using the computed tomography based procedure as an auxiliary variable in addition to all other predictors. Results for the two procedures correlate well together especially for negative results of computed tomography based coronary angiography. This strong correlation was confirmed in the 1609 patients who underwent both procedures (Pearson r = 0.72). Since its data were used for imputation the computed tomography based procedure was not included as a predictor in the prediction models. Our approach was similar to using the results of computed tomography based coronary angiography as the outcome variable when the catheter based procedure was not performed (which was explored in a sensitivity analysis). However this approach is more sophisticated because it also takes into account other predictors and the uncertainty surrounding the imputed values. We imputed 3615 (64%) outcome values for catheter based coronary angiography. Multiple imputations were performed using Stata/SE 11 (StataCorp) (256). [Diagnosis; Development]
If an outcome was missing, the patient data were excluded from the analysis. Multiple imputation was used to address missingness in our nonoutcome data and was performed with SAS callable IVEware (Survey Methodology Program, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI). Multiple imputation has been shown to be a valid and effective way of handling missing data and minimizes bias that may often result from excluding such patients. Additionally, multiple imputation remains valid even if the proportion of missing data is large. The variables included in the multiple imputation model were the 4 outcomes, age, sex, ICD-9 E codes, emergency department Glasgow coma score, out of hospital Glasgow coma score, Injury Severity Score, mechanism of trauma, and trauma team notification. Ten imputed data sets were created as part of the multiple imputation, and all areas under the receiver operating characteristic curve were combined across the 10 imputed data sets with a standard approach. Although there is no reported conventional approach to combining receiver operating characteristic curves from imputed data sets, we averaged the individual sensitivity and specificity data across the 10 imputed data sets and then plotted these points to generate the curves in our results (257). [Prognosis; Validation]
We split the data into development (training) and validation (test) data sets. The development data included all operations within the first 5 years; the validation data included the rest. To ensure reliability of data, we excluded patients who had missing information on key predictors: age, gender, operation sequence, and number and position of implanted heart valves. In addition, patients were excluded from the development data if they were missing information on >3 of the remaining predictors. Any predictor recorded for <50% of patients in the development data was not included in the modeling process, resulting in the exclusion of left ventricular end diastolic pressure, pulmonary artery wedge pressure, aortic valve gradient, and active endocarditis. Patients were excluded from the validation data if they had missing information on any of the predictors in the risk model. To investigate whether exclusions of patients as a result of missing data had introduced any bias, we compared the key preoperative characteristics of patients excluded from the study with those included. Any remaining missing predictor values in the development data were imputed by use of multiple imputation techniques. Five different imputed data sets were created (258). [Prognosis; Development; Validation]
Table 4. Key Information to Report About Missing Data
For the continuous predictors age, glucose, and Hb [hemoglobin], a linear relationship with outcome was found to be a good approximation after assessment of nonlinearity using restricted cubic splines (262). [Prognosis]
Fractional polynomials were used to explore presence of nonlinear relationships of the continuous predictors of age, BMI [body mass index], and year to outcome (258). [Prognosis]
The nonlinear relationships between these predictor variables and lung cancer risk were estimated using restricted cubic splines. Splines for age, pack-years smoked, quit-time and smoking duration were prepared with knot placement based on the percentile distributions of these variables in smokers only. Knots for age were at 55, 60, 64, and 72 years. Knots for pack-years were at 3.25, 23.25 and 63 pack-years. Knots for quit-time were at 0, 15, and 35 years. Knots for duration were at 8, 28, and 45 years (263). [Prognosis]
We used the Cox proportional hazards model in the derivation dataset to estimate the coefficients associated with each potential risk factor [predictor] for the first ever recorded diagnosis of cardiovascular disease for men and women separately (278). [Prognosis]
All clinical and laboratory predictors were included in a multivariable logistic regression model (outcome: bacterial pneumonia) (279). [Diagnosis]
We chose risk factors based on prior meta-analyses and review; their ease of use in primary care settings; and whether a given risk factor was deemed modifiable or reversible by changing habits (i.e., smoking) or through therapeutic intervention; however, we were limited to factors that already had been used in the two baseline cohorts that constituted EPISEM (282). [Prognosis]
Candidate variables included all demographic, disease-related factors and patterns of care from each data source that have been shown to be a risk factor for mortality following an intensive care episode previously. Variables were initially selected following a review of the literature and consensus opinion by an expert group comprising an intensivist, general physician, intensive care trained nurse, epidemiologists, and a statistician. The identified set was reviewed and endorsed by 5 intensivists and a biostatistician who are familiar with the ANZICS APD (283). [Prognosis]
We selected 12 predictor variables for inclusion in our prediction rule from the larger set according to clinical relevance and the results of baseline descriptive statistics in our cohort of emergency department patients with symptomatic atrial fibrillation. Specifically, we reviewed the baseline characteristics of the patients who did and did not experience a 30-day adverse event and selected the 12 predictors for inclusion in the model from these 50 candidate predictors according to apparent differences in predictor representation between the 2 groups, clinical relevance, and sensibility. … [T]o limit colinearity and ensure a parsimonious model, Spearman's correlations were calculated between the clinically sensible associations within our 12 predictor variables. Specifically, Spearman's correlations were calculated between the following clinically sensible associations: (1) history of hypertension status and β-blocker and diuretic use, and (2) history of heart failure and β-blocker home use, diuretic home use, peripheral edema on physical examination, and dyspnea in the emergency department (284). [Prognosis]
We used multivariable logistic regression with backward stepwise selection with a P value greater than 0.05 for removal of variables, but we forced variables [predictors] that we considered to have great clinical relevance back into the model. We assessed additional risk factors [predictors] from clinical guidelines for possible additional effects (286). [Diagnosis]
Clinically meaningful interactions were included in the model. Their significance was tested as a group to avoid inflating type I error. All interaction terms were removed as a group, and the model was refit if results were nonsignificant. Specifically, interactions between home use of β-blockers and diuretics and between edema on physical examination and a history of heart failure were tested (284). [Prognosis]
We assessed internal validity with a bootstrapping procedure for a realistic estimate of the performance of both prediction models in similar future patients. We repeated the entire modeling process including variable selection … in 200 samples drawn with replacement from the original sample. We determined the performances of the selected prediction model and the simple rule that were developed from each bootstrap sample in the original sample. Performance measures included the average area under the ROC curve, sensitivity and specificity for both outcome measures, and computed tomography reduction at 100% sensitivity for neurosurgical interventions within each bootstrap sample (286). [Diagnosis]
To evaluate the performance of each prostate cancer risk calculation, we obtained the predicted probability for any prostate cancer and for aggressive prostate cancer for each patient from the PRC [Prostate Cancer Prevention Trial risk calculator] (http://deb.uthscsa.edu/URO RiskCalc/Pages/uroriskcalc.jsp) and from the SRC [Sunnybrook nomogram–based prostate cancer risk calculator] (www.prostaterisk.ca) to evaluate each prediction model performance (306). [Diagnosis]
To calculate the HSI [Hepatic Steatosis Index], we used the formula given by Lee et al [ref] to calculate the probability of having hepatic steatosis as follows:
with presence of diabetes mellitus (DM) = 1; and absence of DM = 0. ALT and AST indicate alanine aminotransferase and aspartate aminotransferase, respectively (307). [Diagnosis]
Open source code to calculate the QCancer (Colorectal) scores are available from www.qcancer.org/colorectal/ released under the GNU Lesser General Public Licence, version 3 (308). [Prognosis]
Assessing performance of a Cox regression model.
We assessed the predictive performance of the QRISK2- 2011 risk score on the THIN cohort by examining measures of calibration and discrimination. Calibration refers to how closely the predicted 10 year cardiovascular risk agrees with the observed 10 year cardiovascular risk. This was assessed for each 10th of predicted risk, ensuring 10 equally sized groups and each five year age band, by calculating the ratio of predicted to observed cardiovascular risk separately for men and for women. Calibration of the risk score predictions was assessed by plotting observed proportions versus predicted probabilities and by calculating the calibration slope.
Discrimination is the ability of the risk score to differentiate between patients who do and do not experience an event during the study period. This measure is quantified by calculating the area under the receiver operating characteristic curve statistic; a value of 0.5 represents chance and 1 represents perfect discrimination. We also calculated the D statistic and R2 statistic, which are measures of discrimination and explained variation, respectively, and are tailored towards censored survival data. Higher values for the D statistic indicate greater discrimination, where an increase of 0.1 over other risk scores is a good indicator of improved prognostic separation (117). [Prognosis; Validation]
First, we compared the abilities of the clinical decision rule and the general practitioner judgement in discriminating patients with the disease from patients without the disease, using receiver operating characteristic (ROC) curve analysis. An area under the ROC curve (AUC) of 0.5 indicates no discrimination, whereas an AUC of 1.0 indicates perfect discrimination. Then, we constructed a calibration plot to separately examine the agreement between the predicted probabilities of the decision rule with the observed outcome acute coronary syndrome and we constructed a similar calibration plot for the predicted probabilities of the general practitioner. Perfect predictions should lie on the 45-degree line for agreement with the outcome in the calibration plot (318). [Diagnosis; Development]
The accuracy of [the] internally validated and adjusted model was tested on the data of the validation set. The regression formula from the developed model was applied to all bakery workers of the validation set. The agreement between the predicted probabilities and the observed frequencies for sensitization (calibration) was evaluated graphically by plotting the predicted probabilities (x-axis) by the observed frequencies (y-axis) of the outcome. The association between predicted probabilities and observed frequencies can be described by a line with an intercept and a slope. An intercept of zero and a slope of one indicate perfect calibration. … The discrimination was assessed with the ROC area (319). [Diagnosis; Development]
We assessed the incremental prognostic value of biomarkers when added to the GRACE score by the likelihood ratio test. We used 3 complementary measures of discrimination improvement to assess the magnitude of the increase in model performance when individual biomarkers were added to GRACE: change in AUC (ΔAUC), integrated discrimination improvement (IDI), and continuous and categorical net reclassification improvement (NRI). To get a sense of clinical usefulness, we calculated the NRI (>0.02), which considers 2% as the minimum threshold for a meaningful change in predicted risk. Moreover, 2 categorical NRIs were applied with prespecified risk thresholds of 6% and 14%, chosen in accord with a previous study, or 5% and 12%, chosen in accord with the observed event rate in the present study. Categorical NRIs define upward and downward reclassification only if predicted risks move from one category to another. Since the number of biomarkers added to GRACE remained small (maximum of 2), the degree of overoptimism was likely to be small. Still, we reran the ΔAUC and IDI analyses using bootstrap internal validation and confirmed our results (338). [Prognosis; Incremental Value]
We used decision curve analysis (accounting for censored observations) to describe and compare the clinical effects of QRISK2-2011 and the NICE Framingham equation. A model is considered to have clinical value if it has the highest net benefit across the range of thresholds for which an individual would be designated at high risk. Briefly, the net benefit of a model is the difference between the proportion of true positives and the proportion of false positives weighted by the odds of the selected threshold for high risk designation. At any given threshold, the model with the higher net benefit is the preferred model (117). [Prognosis; Validation]
The coefficients of the [original diagnostic] expert model are likely subject to overfitting, as there were 25 diagnostic indicators originally under examination, but only 36 vignettes. To quantify the amount of overfitting, we determine [in our validation dataset] the shrinkage factor by studying the calibration slope b when fitting the logistic regression model … :
logit (P (Y = 1)) = a + b * logit (p)
where [Y = 1 indicates pneumonia (outcome) presence in our validation set and] p is the vector of predicted probabilities. The slope b of the linear predictor defines the shrinkage factor. Well calibrated models have b ≈ 1. Thus, we recalibrate the coefficients of the genuine expert model by multiplying them with the shrinkage factor (shrinkage after estimation) (368). [Diagnosis; Model Updating; Logistic]
In this study, we adopted the [model updating] approach of “validation by calibration” proposed by Van Houwelingen. For each risk category, a Weibull proportional hazards model was fitted using the overall survival values predicted by the [original] UISS prediction model. These expected curves were plotted against the observed Kaplan-Meier curves, and possible differences were assessed by a “calibration model,” which evaluated how much the original prognostic score was valid on the new data by testing 3 different parameters (α, β, and γ). If the joint null hypothesis on α = 0, β = −1, and γ = 1 was rejected (i.e., if discrepancies were found between observed and expected curves), estimates of the calibration model were used to recalibrate predicted probabilities. Note that recalibration does not affect the model's discrimination accuracy. Specific details of this approach are reported in the articles by Van Houwelingen and Miceli et al (369). [Prognosis; Model Updating; Survival]
Results of the external validation prompted us to update the models. We adjusted the intercept and regression coefficients of the prediction models to the Irish setting. The most important difference with the Dutch setting is the lower Hb cutoff level for donation, which affects the outcome and the breakpoint in the piecewise linear function for the predictors previous Hb level. Two methods were applied for updating: recalibration of the model and model revision. Recalibration included adjustment of the intercept and adjustment of the individual regression coefficients with the same factor, that is, the calibration slope. For the revised models, individual regression coefficients were separately adjusted. This was done by adding the predictors to the recalibrated model in a step forward manner and to test with a likelihood ratio test (p < 0.05) if they had added value. If so, the regression coefficient for that predictor was adjusted further (370). [Diagnostic; Model Updating; Logistic]
Once a final model was defined, patients were divided into risk groups in 2 ways: 3 groups according to low, medium, and high risk (placing cut points at the 25th and 75th percentiles of the model's risk score distribution); and 10 groups, using Cox's cut points. The latter minimize the loss of information for a given number of groups. Because the use of 3 risk groups is familiar in the clinical setting, the 3-group paradigm is used hereafter to characterize the model (374). [Prognosis; Development; Validation]
One of the goals of this model was to develop an easily accessible method for the clinician to stratify risk of patients preparing to undergo head and neck cancer surgery. To this end, we defined 3 categories of transfusion risk: low (≤15%), intermediate (15%-24%) and high (≥25%). (375) [Prognosis; Validation]
Patients were identified as high risk if their 10 year predicted cardiovascular disease risk was ≥20%, as per the guidelines set out by NICE (117). [Prognosis; Validation]
Three risk groups were identified on the basis of PI [prognostic index] distribution tertiles. The low-risk subgroup (first tertile, PI ≤8.97) had event-free survival (EFS) rates at 5 and 10 years of 100 and 89% (95% CI, 60–97%), respectively. The intermediate-risk subgroup (second tertile, 8.97 < PI 10.06) had EFS rates at 5 and 10 years of 95% (95% CI, 85–98%) and 83% (95% CI, 64–93%), respectively. The high-risk group (third tertile, PI > 10.06) had EFS rates at 5 and 10 years of 85% (95% CI, 72–92%) and 44% (95% CI, 24–63%), respectively (376). [Prognosis; Development]
Finally, a diagnostic rule was derived from the shrunken, rounded, multivariable coefficients to estimate the probability of heart failure presence, ranging from 0% to 100%. Score thresholds for ruling in and ruling out heart failure were introduced based on clinically acceptable probabilities of false-positive (20% and 30%) and false-negative (10% and 20%) diagnoses (377). [Diagnosis; Development; Validation]
… the summed GRACE risk score corresponds to an estimated probability of all-cause mortality from hospital discharge to 6 months. … [I]ts validity beyond 6 months has not been established. In this study, we examined whether this GRACE risk score calculated at hospital discharge would predict longer term (up to 4 years) mortality in a separate registry cohort … (379). [Prognosis; Different outcome]
The Wells rule was based on data obtained from referred patients suspected of having deep vein thrombosis who attended secondary care outpatient clinics. Although it is often argued that secondary care outpatients are similar to primary care patients, differences may exist because of the referral mechanism of primary care physicians. The true diagnostic or discriminative accuracy of the Wells rule has never been formally validated in primary care patients in whom DVT is suspected. A validation study is needed because the performance of any diagnostic or prognostic prediction rule tends to be lower than expected from data in the original study when it is applied to new patients, particularly when these patients are selected from other settings. We sought to quantify the diagnostic performance of the Wells rule in primary care patients and compare it with the results reported in the original studies by Wells and colleagues (188). [Diagnosis; Different setting]
When definitions of variables were not identical across the different studies (for example physical activity), we tried to use the best available variables to achieve reasonable consistency across databases. For example, in NHANES, we classified participants as “physically active” if they answered “more active” to the question, “Compare your activity with others of the same age.” Otherwise, we classified participants as “not physically active.” In ARIC, physical activity was assessed in a question with a response of “yes” or “no”, whereas in CHS, we dichotomized the physical activity question into “no” or “low” versus “moderate” or “high” (380). [Prognosis; Different predictors]
As the NWAHS did not collect data on use of antihypertensive medications, we assumed no participants were taking antihypertensive medications. Similarly, as the BMES did not collect data on a history of high blood glucose level, we assumed that no participants had such a history (381). [Prognostic; Different Predictors]
Example figure: participant flow diagram.
Reprinted from reference 390, with permission from Elsevier.
Reproduced from reference 377 with permission. NT-proBNP = N-terminal pro-brain natriuretic peptide.
We calculated the 10 year estimated risk of cardiovascular for every patient in the THIN cohort using the QRISK2-2011 risk score … and 292 928 patients (14.1%) were followed up for 10 years or more (117). [Prognosis; Validation]
At time of analysis, 204 patients (66%) had died. The median follow-up for the surviving patients was 12 (range 1-84) months (391). [Prognosis; Development]
Median follow-up was computed according to the “reverse Kaplan Meier” method, which calculates potential follow-up in the same way as the Kaplan–Meier estimate of the survival function, but with the meaning of the status indicator reversed. Thus, death censors the true but unknown observation time of an individual, and censoring is an end-point (Schemper & Smith, 1996) (392). [Prognosis; Development]
Table 5. Example Table: Participant Characteristics
Table 6. Example Table: Participant Characteristics
Table 7. Example Table: Comparison of Participant Characteristics in Development and Validation Data [Development; Validation]
Table 8. Example Table: Comparison of Participant Characteristics in Development and Validation Data [Validation]
Table 9. Example Table: Reporting the Sample Size and Number of Events for Multiple Models
Table 10. Example Table: Reporting the Number of Events in Each Unadjusted Analysis
Table 11. Example Table: Unadjusted Association Between Each Predictor and Outcome
Table 12. Example Table: Presenting the Full Prognostic (Survival) Model, Including the Baseline Survival, for a Specific Time Point
Table 13. Example Table: Presenting the Full Diagnostic (Logistic) Model, Including the Intercept
Table 14. Example Table: Presenting Both the Original and Updated Prediction Model
Table 15. Example Table: Presenting a Full Model, Including Baseline Survival for a Specific Time Point Combined With a Hypothetical Individual to Illustrate How the Model Yields an Individualized Prediction
Table 16. Example Table: A Simple Scoring System From Which Individual Outcome Risks (Probabilities) Can Be Obtained*
Table 17. Example Table: Providing Full Detail to Calculate a Predicted Probability in an Individual
Example figure: a scoring system combined with a figure to obtain predicted probabilities for each score in an individual.
Reproduced from reference 408, with permission from BMJ Publishing Group.
Example figure: a graphical scoring system to obtain a predicted probability in an individual.
For ease of use at the point of care, we developed a simple prognostic model. For this model, we included the strongest predictors with the same quadratic and cubic terms as used in the full model, adjusting for tranexamic acid. We presented the prognostic model as a chart that cross tabulates these predictors with each of them recoded in several categories. We made the categories by considering clinical and statistical criteria. In each cell of the chart, we estimated the risk for a person with values of each predictor at the mid-point of the predictor's range for that cell. We then coloured the cells of the chart in four groups according to ranges of the probability of death: <6%, 6-20%, 21-50%, and >50%. We decided these cut-offs by considering feedback from the potential users of the simple prognostic model and by looking at previous publications. GCS = Glasgow Coma Scale. Reproduced from reference 123, with permission from BMJ Publishing Group.
Example figure: a nomogram, and how to use it to obtain a predicted probability in an individual.
Nomogram for prediction of positive lymph nodes among patients who underwent a standard pelvic lymph node dissection. Instructions: Locate the patient's pretreatment prostate-specific antigen (PSA) on the initial PSA (IPSA) axis. Draw a line straight upward to the point's axis to determine how many points toward the probability of positive lymph nodes the patient receives for his PSA. Repeat the process for each variable. Sum the points achieved for each of the predictors. Locate the final sum on the Total Points axis. Draw a line straight down to find the patient's probability of having positive lymph nodes. ECOG = Eastern Cooperative Oncology Group; CRP = C-reactive protein; Hb = hemoglobin; LDH = lactate dehydrogenase; PS = performance status. Reprinted from reference 410, with permission from Elsevier.
Example figure: a calibration plot with c-statistic and distribution of the predicted probabilities for individuals with and without the outcome (coronary artery disease).
Reproduced from reference 256, with permission from BMJ Publishing Group.
Example figure: a receiver-operating characteristic curve, with predicted risks labelled on the curve.
Receiver operating characteristic curve for risk of pneumonia … Sensitivity and specificity of several risk thresholds of the prediction model are plotted. Reproduced from reference 416, with permission from BMJ Publishing Group.
Example figure: a decision curve analysis.
The figure displays the net benefit curves for QRISK2-2011, QRISK2-2008, and the NICE Framingham equation for people aged between 35 and 74 years. At the traditional threshold of 20% used to designate an individual at high risk of developing cardiovascular disease, the net benefit of QRISK2-2011 for men is that the model identified five more cases per 1000 without increasing the number treated unnecessarily when compared with the NICE Framingham equation. For women the net benefit of using QRISK2-2011 at a 20% threshold identified two more cases per 1000 compared with not using any model (or the NICE Framingham equation). There seems to be no net benefit in using the 20% threshold for the NICE Framingham equation for identifying women who are at an increased risk of developing cardiovascular disease over the next 10 years. NICE = National Institute for Health and Care Excellence. Reproduced from reference 117, with permission from BMJ Publishing Group.
Table 18. Example of a Reclassification Table (With Net Reclassification Improvement and 95% CI) for a Basic and Extended Diagnostic Model Using a Single Probability Threshold*
For the recalibrated models, all regression coefficients were multiplied by the slope of the calibration model (0.65 for men and 0.63 for women). The intercept was adjusted by multiplying the original value by the calibration slope and adding the accompanying intercept of the calibration model (−0.66 for men and −0.36 for women). To derive the revised models, regression coefficients of predictors that had added value in the recalibrated model were further adjusted. For men, regression coefficients were further adjusted for the predictors deferral at the previous visit, time since the previous visit, delta Hb, and seasonality. For women, regression coefficients were further adjusted for deferral at the previous visit and delta Hb … available as supporting information in the online version of this paper, for the exact formulas of the recalibrated and revised models to calculate the risk of Hb deferral) (370). [Diagnostic; Model Updating; Logistic]
The mis-calibration of Approach 1 indicated the need for re-calibration and we obtained a uniform shrinkage factor when we fitted logit(P(Y = 1)) = a + b*logit(p) in Approach 2. We obtained the estimates a = −1.20 and b = 0.11, indicating heavy shrinkage (368). [Diagnostic; Model Updating; Logistic]
Results of the performance of the original clinical prediction model compared with that of different models extended with genetic variables selected by the lasso method are presented in Table 3. Likelihood ratio tests were performed to test the goodness of fit between the two models. The AUC curve of the original clinical model was 0.856. Addition of TLR4 SNPs [single-nucleotide polymorphisms] to the clinical model resulted in a slightly decreased AUC. Addition of TLR9-1237 to the clinical model slightly increased the AUC curve to 0.861, though this was not significant (p = 0.570). NOD2 SNPs did not improve the clinical model (423). [Prognostic; Model Updating; Logistic]
The most important limitation of the model for predicting a prolonged ICU stay is its complexity. We believe this complexity reflects the large number of factors that determine a prolonged ICU stay. This complexity essentially mandates the use of automated data collection and calculation. Currently, the infrequent availability of advanced health information technology in most hospitals represents a major barrier to the model's widespread use. As more institutions incorporate electronic medical records into their process flow, models such as the one described here can be of great value.
Our results have several additional limitations. First, the model's usefulness is probably limited to the U.S. because of international differences that impact ICU stay. These differences in ICU stay are also likely to adversely impact the use of ICU day 5 as a threshold for concern about a prolonged stay. Second, while capturing physiologic information on day 1 is too soon to account for the impact of complications and response to therapy, day 5 may still be too early to account for their effects. Previous studies indicate that more than half of the complications of ICU care occur after ICU day 5. Third, despite its complexity, the model fails to account for additional factors known to influence ICU stay. These include nosocomial infection, do not resuscitate orders, ICU physician staffing, ICU acquired paralysis, and ICU sedation practices. Fourth, the model's greatest inaccuracy is the under-prediction of remaining ICU stays of 2 days or less. We speculate that these findings might be explained by discharge delays aimed at avoiding night or weekend transfers or the frequency of complications on ICU days 6 to 8 (424). [Prognosis; Development; Validation]
This paper has several limitations. First, it represents assessments of resident performance at 1 program in a single specialty. In addition, our program only looks at a small range of the entire population of US medical students. The reproducibility of our findings in other settings and programs is unknown. Second, we used subjective, global assessments in conjunction with summative evaluations to assess resident performance. Although our interrater reliability was high, there is no gold standard for clinical assessment, and the best method of assessing clinical performance remains controversial. Lastly, r2 = 0.22 for our regression analysis shows that much of the variance in mean performance ratings is unexplained. This may be due to limited information in residency applications in such critical areas as leadership skills, teamwork, and professionalism (425). [Prognosis; Development]
The ABCD2 score was a combined effort by teams led by Johnston and Rothwell, who merged two separate datasets to derive high-risk clinical findings for subsequent stroke. Rothwell's dataset was small, was derived from patients who had been referred by primary care physicians and used predictor variables scored by a neurologist one to three days later. Johnston's dataset was derived from a retrospective study involving patients in California who had a transient ischemic attack.
Subsequent studies evaluating the ABCD2 score have been either retrospective studies or studies using information from databases. Ong and colleagues found a sensitivity of 96.6% for stroke within seven days when using a score of more than two to determine high risk, yet 83.6% of patients were classified as high-risk. Fothergill and coworkers retrospectively analyzed a registry of 284 patients and found that a cutoff score of less than 4 missed 4 out of 36 strokes within 7 days. Asimos and colleagues retrospectively calculated the ABCD2 score from an existing database, but they were unable to calculate the score for 37% of patients, including 154 of the 373 patients who had subsequent strokes within 7 days. Sheehan and colleagues found that the ABCD2 score discriminated well between patients who had a transient ischemic attack or minor stroke versus patients with transient neurologic symptoms resulting from other conditions, but they did not assess the score's predictive accuracy for subsequent stroke. Tsivgoulis and coworkers supported using an ABCD2 score of more than 2 as the cutoff for high risk based on the results of a small prospective study of patients who had a transient ischemic attack and were admitted to hospital. The systematic review by Giles and Rothwell found a pooled AUC of 0.72 (95% CI 0.63–0.82) for all studies meeting their search criteria, and an AUC of 0.69 (95% CI 0.64–0.74) after excluding the original derivation studies. The AUC in our study is at the low end of the confidence band of these results, approaching 0.5 (434). [Prognosis]
Our models rely on demographic data and laboratory markers of CKD [chronic kidney disease] severity to predict the risk of future kidney failure. Similar to previous investigators from Kaiser Permanente and the RENAAL study group, we find that a lower estimated GFR [glomerular filtration rate], higher albuminuria, younger age, and male sex predict faster progression to kidney failure. In addition, a lower serum albumin, calcium, and bicarbonate, and a higher serum phosphate also predict a higher risk of kidney failure and add to the predictive ability of estimated GFR and albuminuria. These markers may enable a better estimate of measured GFR or they may reflect disorders of tubular function or underlying processes of inflammation or malnutrition.
Although these laboratory markers have also previously been associated with progression of CKD, our work integrates them all into a single risk equation (risk calculator and Table 5, and smartphone app, available at www.qxmd.com/Kidney-Failure-Risk-Equation). In addition, we demonstrate no improvement in model performance with the addition of variables obtained from the history (diabetes and hypertension status) and the physical examination (systolic blood pressure, diastolic blood pressure, and body weight). Although these other variables are clearly important for diagnosis and management of CKD, the lack of improvement in model performance may reflect the high prevalence of these conditions in CKD and imprecision with respect to disease severity after having already accounted for estimated GFR and albuminuria (261). [Prognosis; Development; Validation]
The likelihood of influenza depends on the baseline probability of influenza in the community, the results of the clinical examination, and, optionally, the results of point of care tests for influenza. We determined the probability of influenza during each season based on data from the Centers for Disease Control and Prevention. A recent systematic review found that point of care tests are approximately 72% sensitive and 96% accurate for seasonal influenza. Using these data for seasonal probability and test accuracy, the likelihood ratios for flu score 1, a no-test/test threshold of 10% and test/treat threshold of 50%, we have summarized a suggested approach to the evaluation of patients with suspected influenza in Table 5. Physicians wishing to limit use of anti-influenza drugs should consider rapid testing even in patients who are at high risk during peak flu season. Empiric therapy might be considered for patients at high risk of complications (181). [Diagnosis; Development; Validation; Implications for Clinical Use]
To further appreciate these results, a few issues need to be addressed. First, although outpatients were included in the trial from which the data originated, for these analyses we deliberately restricted the study population to inpatients, because the PONV [postoperative nausea and vomiting] incidence in outpatients was substantially less frequent (34%) and because different types of surgery were performed (e.g. no abdominal surgery). Accordingly, our results should primarily be generalized to inpatients. It should be noted that, currently, no rules are available that were derived on both inpatients and outpatients. This is still a subject for future research, particularly given the increase of ambulatory surgery (437). [Prognosis; Incremental Value; Implications for Clinical Use]
Our study had several limitations that should be acknowledged. We combined data from 2 different populations with somewhat different inclusion criteria, although the resulting dataset has the advantage of greater generalizability because it includes patients from 2 countries during 2 different flu seasons and has an overall pretest probability typical of that for influenza season. Also, data collection was limited to adults, so it is not clear whether these findings would apply to younger patients. Although simple, the point scoring may be too complex to remember and would be aided by programming as an application for smart phones and/or the Internet (181). [Diagnosis; Development; Validation; Limitations; Implications for Research]
The design and methods of the RISK-PCI trial have been previously published [ref]. Briefly, the RISK-PCI is an observational, longitudinal, cohort, single, center trial specifically designed to generate and validate an accurate risk model to predict major adverse cardiac events after contemporary pPCI [primary percutaneous coronary intervention] in patients pretreated with 600 mg clopidogrel. Patients were recruited between February 2006 and December 2009. Informed consent was obtained from each patient. The study protocol conforms to the ethical guidelines of the Declaration of Helsinki. It was approved by a local research ethics committee and registered in the Current Controlled Trials Register—ISRCTN83474650—(www.controlled-trials.com/ISRCTN83474650) (443). [Prognosis; Development]
User-friendly calculators for the Reynolds Risk Scores for men and women can be freely accessed at www.reynoldsriskscore.org (444). [Prognosis; Incremental Value]