This chapter reviews various data sources available for observational research, examines their strengths and limitations, and identifies opportunities for further improving observational databases by integrating data collected from different sources. The focus is on observational research using existing databases.
Clinical registries have an important role in observational research. In this section, we first review an array of the available clinical registries that are closely or broadly relevant to cardiovascular disease (Table 11–1). We then summarize the strengths and limitations of clinical registries for observational study.
TABLE 11–1
Evolution of cardiovascular clinical registries
Get With The Guidelines® (GWTG) is a national cardiovascular disease registry and quality improvement program developed by the American Heart Association (AHA) and American Stroke Association. The GWTG registry includes three modules—one each for coronary artery disease, heart failure, and stroke (1)—and one outpatient program (2). GWTG uses a Web-based patient management tool (Outcomes, Cambridge, MA) to abstract detailed clinical data, including patient demographics, medical history, symptoms on arrival, in-hospital treatment and events, contraindications to medications, laboratory findings, discharge treatment and counseling, and patient disposition from more than 381,000 patients in the heart failure module and more than 1,000,000 patients in the stroke module. In 2008, the coronary disease module was merged with the American College of Cardiology (ACC) Acute Coronary Treatment and Intervention Outcomes Network (ACTION®) registry to form the National Cardiovascular Data Registry (NCDR) ACTION Registry–GWTG, which has become the most comprehensive data source for acute coronary syndromes in the United States (3). In 2011, the GWTG outpatient program was expanded to become The Guideline Advantage™, in collaboration with the American Cancer Society and American Diabetes Association (2).
This is a comprehensive outcomes-based quality improvement program operated by the ACC since 1997 (4). More than 2,200 hospitals participate in the NCDR across the United States. The NCDR includes the outpatient Practice INNovation And CLinical Excellence (PINNACLE) registry, with more than 1.5 million records, and various hospital-based registries, including ACTION Registry–GWTG, for patients with highrisk myocardial infarction; CathPCI, for cardiac catheterizations and percutaneous coronary intervention procedures; the ICD Registry, for tracking implantable cardioverter defibrillator procedures; the CARE Registry, for carotid artery stenting and endarterectomy; and the IMPACT registry, for pediatric and adult patients with congenital heart disease who are undergoing diagnostic catheterization and interventions.
Can Rapid risk stratification of Unstable patients with angina Suppress ADverse outcomes with Early implementation of the ACC/AHA guidelines? (CRUSADE) is a clinical registry of more than 200,000 patients with acute coronary syndromes at 600 hospitals in the United States (5). Data collected include patient risk factors, presenting symptoms, use of medications/invasive procedures, and in-hospital clinical outcomes focused on the ACC/AHA guidelines.
This is the largest cardiothoracic surgery quality improvement and outcomes database in the world (6). The society provides clinical details and outcomes for more than 4.3 million coronary artery bypass graft surgeries and valvular and other cardiac procedures from 1,014 participating sites in the United States.
The Acute Decompensated Heart failurE national REgistry (ADHERE) is an industry-sponsored, multicenter, national registry evaluating the characteristics, management, and clinical outcomes of patients hospitalized with acute decompensated heart failure (7). More than 150,000 patients from 300 community and academic centers in the United States were enrolled between 2001 and 2006.
The Organized Program To Initiate life-saving treatMent In hospitaliZEd patients with Heart Failure (OPTIMIZE- HF) is a national registry designed to evaluate and improve guideline adherence among hospitalized patients with heart failure (8). Unlike ADHERE and many other clinical registries reporting only in-hospital events, OPTIMIZE-HF captures outcomes data at 60 and 90 days after discharge for selected patients (10%). The OPTIMIZE-HF project transitioned to GWTG-HF in 2005.
The Prospective Registry Evaluating Myocardial Infarction: Event and Recovery (PREMIER) (9) and Translational Research Investigating Underlying disparities in acute Myocardial infarction Patients’ Health status (TRIUMPH) (10) are prospective multicenter registries of 7000 patients with acute myocardial infarction (2,500 in PREMIER and 4,500 in TRIUMPH). A distinct feature of PREMIER and TRIUMPH is that patient-centered outcome measures, including symptoms, functional status, and quality of life, are collected at 1, 6, and 12 months after the index hospitalization.
Among the most important strengths of clinical registries are broader population coverage, solid exposure and outcomes measures, and rich clinical data available to adjust for confounding variables. Clinical registries tend to include a large number of patients; registry-based findings therefore can be more generalizable to a wide range of patients than can those from randomized controlled trials (RCTs), which contain more homogenous populations in ideal situations. Exposure/treatment and outcomes/effectiveness are better defined in a registry than in administrative data. Additionally, rich clinical information, such as risk factors and comorbid conditions, can be used to build precise risk-adjustment models. Finally, clinical registries provide an opportunity to evaluate the effectiveness of interventions in cases in which randomization is unethical or impractical (e.g., studies of surgical procedures).
Despite the clear strengths of clinical registries, several limitations should be noted. Unlike in clinical trials, in which patients are randomized to either a treatment or a control arm, outcomes assessment using registry data can be subject to selection bias. In addition, registries are often limited in terms of their targeted population. Unlike hospital administrative data, a clinical registry often focuses on certain diseases and conditions. For instance, it is unrealistic to use a stroke registry for cancer research. Another drawback of many clinical registries has been the lack of follow-up data beyond hospital discharge or the acute episode of care. Knowledge obtained in hospital may not be applicable to intermediate or long-term outcomes. Methods have been developed to address this limitation by linking clinical registry data with external data sources. These strategies are discussed later in this chapter.
Administrative data are typically generated and gathered for some administrative purpose such as billing upon hospital discharge or provision of other health services. These data often include patient demographics, disease diagnoses, and medications or procedures received, allowing their use for clinical research as well. There is an increasing popularity in using large administrative datasets for observational study and comparative effectiveness research (CER). Among all data sources, administrative data have perhaps the broadest population coverage, an attribute especially appealing to CER. For example, Medicare (Parts A and B) alone was providing health care to more than 43 million people as of July 2010, including more than 36 million aged 65 or older (11).
Medicare datasets contain data for institutional claims (inpatient, outpatient, skilled nursing facility, hospice, or home health agency) and noninstitutional claims (physician and durable medical equipment providers) (12). The 5% sample is created from the 100% Denominator file based on the beneficiary’s Health Insurance Claim (HIC) number. Medicare claims data do not contain laboratory results or medication information.
The Medicare Denominator File contains demographics that include beneficiary unique identifier, state, county, ZIP code, date of birth, sex, race, date of death, program eligibility, and enrollment information (13).
The Medicare Part D File contains information about prescriptions that beneficiaries had filled under Medicare Part D. As of 2008, nearly 25 million Medicare beneficiaries were enrolled in Medicare Part D plans. These data can be linked with Medicare physician and hospital claims under Parts A and B of the program and can serve as a rich data source to describe medication use patterns, health outcomes, and adverse events among aged and disabled Medicare beneficiaries (9).
The Medicare Provider Analysis and Review File (MedPAR) contains the accumulated claims for services provided to a beneficiary from the time of inpatient hospital admission to discharge home or to a skilled nursing facility. It allows tracking of inpatient hospitalizations and patterns of care over time (14).
The Medicare Current Beneficiary Survey (MCBS) is a nationally representative survey of the Medicare population regarding their health status and functioning, healthcare use and expenditures, health insurance coverage, and sociodemographic characteristics. It includes oversampled aged, disabled, and institutionalized beneficiaries (15).
The Medicaid Analytic extract (MAX), formerly known as the State Medicaid Research Files (SMRFs), is a set of person-level data files containing information on Medicaid eligibility, enrollment status, demographic data, utilization summary, inpatient care, long-term care, medication claims, and claims for all noninstitutional Medicaid services (16).
The VA Patient Treatment File contains records for all inpatient care received by all covered persons at VA care facilities. Information collected includes demographics, diagnoses, procedures, and summary information for each episode of care (17).
The VA Outpatient Clinic File contains records for all outpatient care received by all covered persons from 1980 to the present (17).
The VA Pharmacy Benefits Management Services Database contains all prescriptions dispensed within the VA health system from 1999 to the present (18).
VA Decision Support System (DSS) Data Extracts contain cost and use information on inpatient and outpatient laboratory testing, prescription medications, radiological procedures, and demographics for all VA patients (19). Unlike other administrative data, results for a specific list of laboratory tests are included in the DSS.
The VA Vital Status File (VSF) contains the date of death information for all veterans who received VA care after 1992, who were enrolled in the VA, or who received compensation or pension benefits from the VA after 2002 (20).
The Health Care Cost and Utilization Project (HCUP) includes the largest collection of hospital care data in the United States (21). It includes the Nationwide Inpatient Sample (NIS) and State Inpatient Databases (SID). The NIS includes more than 100 clinical and nonclinical data elements for each inpatient stay from a stratified sample of about 20% of community hospitals across the country. Information includes patient demographics, admission and discharge status, principal and up to 30 secondary diagnosis codes, principal and secondary procedure codes, length of stay, and total charges. The SID contains similar data elements but includes 100% inpatient discharge records from participating states. Both the NIS and SID contain hospital identifiers that permit linkage with the American Hospital Association Annual Survey data and county identifiers for linkage to the Area Resource File.
The Medical Expenditure Panel Survey (MEPS) is a set of nationally representative surveys of healthcare use, expenditures, sources of payment, health insurance coverage, and health status for noninstitutionalized Americans (22). A unique feature of MEPS is that it includes data on traditionally underrepresented populations such as poor, older, minority, and uninsured citizens.
The MarketScan Commercial Claims and Encounters database from Thomson Reuters (Healthcare) Inc. contains fully integrated, de-identified, person-specific data on enrollment, clinical use, and expenditures across inpatient, outpatient laboratory, and outpatient prescriptions from about 100 commercial, Medicare supplemental, and Medicaid payers (23). MarketScan can be linked to track-detailed patient information across sites and types of providers and over time.
Kaiser Permanente is a large, integrated healthcare delivery system. Its electronic medical records, laboratory, and pharmacy databases include claims data for both inpatient and outpatient services (24).
The main strength of administrative data is the broad coverage of populations. Findings from administrative databases can be easily implemented in real-world settings. Because most administrative datasets are relatively large, it is possible to focus on groups traditionally underrepresented in other data sources, such as racial/ethnic minority, female, and older patients. Moreover, administrative data are relatively more efficient to obtain because no additional primary data collection is needed. Importantly, many administrative data have common identifiers that make long-term follow-up possible.
Despite these strengths, administrative data can have major limitations. First and foremost, the data can be “dirty.” Administrative data in general are gathered for the purpose of medical billing. Some data elements are likely to be more accurate than others. For instance, providers are not penalized for providing inaccurate information unless it is associated with medical payments. Second, compared with clinical trial and registry databases, important clinical details such as secondary diagnoses and procedures are often undercoded, making risk-adjustment difficult (25). Third, except for a few sources, many administrative databases do not contain laboratory and pharmacy data. Consequently, secondary data analyses using administrative data can seldom give a complete clinical picture.
When talking about data sources for observational studies, people often think of clinical registries and hospital administrative data. However, observational studies also can be done with secondary use of existing data from clinical trials databases. For instance, researchers can transfer information from either the treatment or the control arm and evaluate it for purposes other than that specified in the original RCT design.
Because RCTs are generally accepted as the gold standard for clinical evidence, many attributes of a reliable and robust database for clinical research have been built into such study designs. For example, RCT databases contain well-delineated exposure measures, objective outcome assessments, rich clinical information, and few confounding variables. Despite these advantages, however, secondary analyses of trial data have the same limitations as do primary analyses of RCT data. For instance, RCTs are designed to evaluate the efficacy and safety of a treatment or intervention under ideal circumstances. Thus, the population included in an RCT is often narrowly selected, making the results not necessarily generalizable to patients who will eventually use the medication or treatment in real-world settings.
Given the pros and cons of RCTs and registries, investigators have considered merging them, to create registry–trial hybrids for clinical research. The National Cardiovascular Research Infrastructure (NCRI) is a partnership between the American College of Cardiology Foundation (ACCF) and the Duke Clinical Research Institute (DCRI) to develop a multicenter translational research model based on the data-collection activities of the NCDR. The NCRI recruits registry sites as clinical trial participants, uses a standard registry data-collection system already in use by the hospitals, and adds additional data elements for the specific trials to the NCDR backbone. This creates an efficient clinical research platform for randomized clinical trials based on the existing registry.
The Study of Access site For Enhancement of Percutaneous Coronary Intervention (SAFE-PCI) for Women is a multicenter, randomized, openlabel, active-control study comparing the efficacy and feasibility of the transradial approach versus the transfemoral approach to PCI in women (NCT01406236). It is one of the first proof-of-concept trials to use the NCRI. The study identifies sites within the NCDR system, and patients enrolled at participating sites are randomized to undergo either transfemoral or transradial PCI before diagnostic angiography. Data-collection efforts are further reduced by using the existing NCDR data-collection system. All these efforts improve operational efficiency in clinical trials.
Besides the integration of trials into ongoing registries, there is increasing interest in creating a linked claims–registry database, with rich clinical information as well as longitudinal outcome assessments, that takes advantage of the strengths of both databases and overcomes their limitations. The major challenge in linking different data sources is the availability of the identifying information in each dataset. Some databases contain standard personal identifiers such as patient name and Social Security number, making it possible to link a clinical registry directly with administrative data sources. However, many clinical registries do not collect or distribute unencrypted patient identifiers. Nevertheless, high-quality linkages can be made using various combinations of nonunique fields. Hammill et al. described a method of linking inpatient clinical registry data to Medicare claims data using indirect identifiers (26). They showed that by using a combination of indirect identifiers available in both datasets, such as admission date, discharge date, patient age or date of birth, and patient sex, a high-quality linked database can be created without requiring direct identifiers.
Linking clinical registries and administrative data overcomes the lack of clinical details in administrative databases and the absence of long-term assessments in registry data. Importantly, a linked dataset extends the applications of both data sources, provides a full picture of the care delivered, and approaches the ideal for assessing CER in real-world clinical practice. Linking clinical registry and administrative data also has limitations. Using indirect identifiers can cause inaccurate links between databases; thus, external validation is recommended if the data are available (27). In addition, linkage almost always limits the analysis population, which, in turn, raises concerns about the generalizability of the study findings.
In summary, various data sources, including clinical registries, hospital administrative data, and trial databases, are available for observational study. Researchers must evaluate the strengths versus limitations of each type to determine the most appropriate data source to answer their research questions.
1. Hong Y, LaBresh KA. Overview of the American Heart Association “Get with the Guidelines” programs: coronary heart disease, stroke, and heart failure. Crit Pathw Cardiol. 2006;5(4):179-186.
2. American Heart Association. The Guideline Advantage. 2011. http://www.guidelineadvantage.org/TGA/. Accessed May 7, 2012.
3. National Cardiovascular Disease Registry. NCDR® ACTION Registry®-GWTG™ Home Page. http://www.ncdr.com/WebNCDR/Action/default.aspx. Accessed April 23, 2012.
4. American College of Cardiology. NCDR®. 2011. http://www.ncdr.com/webncdr/common/. Accessed May 7, 2012.
5. Hoekstra JW, et al. Improving the care of patients with non-ST-elevation acute coronary syndromes in the emergency department: the CRUSADE initiative. Acad Emerg Med. 2002;9(11):1146-1155.
6. Society of Thoracic Surgeons. STS National Database | STS. http://sts.org/national-database. Accessed May 7, 2012.
7. Fonarow GC. The Acute Decompensated Heart Failure National Registry (ADHERE): opportunities to improve care of patients hospitalized with acute decompensated heart failure. Rev Cardiovasc Med. 2003;4 (suppl) 7:S21-30.
8. Outcome Sciences, Inc. OPTIMIZE-HF home page. 2005. http://optimize-hf.org/. Accessed May 7, 2012.
9. Spertus JA, et al. The Prospective Registry Evaluating Myocardial Infarction: Events and Recovery (PREMIER)—evaluating the impact of myocardial infarction on patient outcomes. Am Heart J. 2006;151(3):589-597.
10. Arnold S V, et al. Translational Research Investigating Underlying Disparities in Acute Myocardial Infarction Patients’ Health Status (TRIUMPH): design and rationale of a prospective multicenter registry. Circ Cardiovasc Qual Outcomes. 2011;4(4):467-476.
11. Centers for Medicare & Medicaid Services. Medicare Enrollment Reports. http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MedicareEnrpts/index.html. Accessed May 7, 2012.
12. Centers for Medicare & Medicaid Services. Carrier Line Items. 2012. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Carrier_Line_Items.html. Accessed May 7, 2012.
13. Centers for Medicare & Medicaid Services. Denominator LDS. 2012. https://www.cms.gov/Research-Statistics-Data-and-Systems/Files-for-Order/LimitedDataSets/DenominatorLDS.html. Accessed May 7, 2012.
14. Centers for Medicare & Medicaid Services. Medicare Provider Analysis and Review File. 2012. https://www.cms.gov/Research-Statistics-Data-and-Systems/Files-for-Order/IdentifiableDataFiles/MedicareProviderAnalysisandReviewFile.html. Accessed May 7, 2012.
15. Centers for Medicare & Medicaid Services. Medicare Current Beneficiary Survey (MCBS). 2012. http://www.cms.gov/Research-Statistics-Data-and-Systems/Research/MCBS/index.html?redirect=/mcbs. Accessed May 7, 2012.
16. Centers for Medicare & Medicaid Services. MAX General Information. 2012. https://www.cms.gov/Research-Statistics-Data-and-Systems/Computer-Data-and-Systems/MedicaidDataSourcesGenInfo/MAXGeneralInformation.html. Accessed May 7, 2012.
17. US Department of Veterans Affairs. VHA Medical SAS® Datasets— Description. http://www.virec.research.va.gov/DataSourcesName/Medical-SAS-Datasets/SAS.htm. Accessed May 7, 2012.
18. US Department of Veterans Affairs. Pharmacy Benefits Management Services (PBM)—Database. http://www.virec.research.va.gov/DataSourcesName/PBM/PBM.htm. Accessed May 7, 2012.
19. US Department of Veterans Affairs. VHA Decision Support System (DSS)— Introduction. http://www.virec.research.va.gov/DataSourcesName/DSS/DSSintro.htm. Accessed May 7, 2012.
20. US Department of Veterans Affairs. VHA Vital Status File. http://www.virec.research.va.gov/DataSourcesName/VitalStatus/VitalStatus.htm. Accessed May 7, 2012.
21. Agency for Healthcare Research and Quality. Health Care: Healthcare Cost and Utilization Project (HCUP) Subdirectory Page. http://www.ahrq.gov/data/hcup/. Accessed May 7, 2012.
22. Agency for Healthcare Research and Quality. Medical Expenditure Panel Survey Home. http://meps.ahrq.gov/mepsweb/. Accessed May 7, 2012.
23. Cecil G. Sheps Center for Health Services Research, University of North Carolina at Chapel Hill. MarketScan® Commercial Claims and Encounters and Medicare Supplemental Databases. 2012. http://www.shepscenter.unc.edu/marketscan/index.html. Accessed May 7, 2012.
24. Division of Research, Kaiser Permanente. Research Topics. http://www.dor.kaiser.org/external/DORExternal/research/index.aspx. Accessed May 7, 2012.
25. Quan H, Parsons GA, Ghali WA. Validity of information on comorbidity derived from ICD-9-CCM administrative data. Med Care. 2002;40(8):675-685.
26. Hammill BG, et al. Linking inpatient clinical registry data to Medicare claims data using indirect identifiers. Am Heart J. 2009;157(6):995-1000.
27. Méray N, Reitsma JB, Ravelli ACJ, Bonsel GJ. Probabilistic record linkage is a valid and transparent tool to combine databases without a patient identification number. J Clin Epidemiol. 2007;60(9):883-891.