CSTE logo
This website uses cookies to store information on your computer. Some of these cookies are used for visitor analysis, others are essential to making our site function properly and improve the user experience. By using this site, you consent to the placement of these cookies. Click Accept to consent and dismiss this message or Deny to leave this website. Read our Privacy Statement for more.
Tribal Epidemiology Toolkit - Data Linkage

Toolkit Sections
Background and Introduction
Data Collection
Data Linkage
Data Sharing
Collaboration Examples
Communication Resources/Tools

Record linkage is the process of comparing records across datasets to identify individuals contained in both. Linkages can supplement or validate data across datasets and can identify duplicate records for the same individual within one dataset, a fundamental requirement for accuracy and validity of event counts in any disease registry. In Indian Country, a common example is linking a dataset with information on AIAN ancestry with a second dataset to improve the quality of race coding in the second database. This type of linkage is important, because, without accurate race data in health records, the true burden of disease is unknown. Other examples include:
  • Merging death information from a vital statistics file with cancer information from a central center registry.
  • Linking data from death certificates, inpatient hospitalizations, and law enforcement citations to generate crash and injury reports, as in the National Highway Traffic Safety Administration’s CODES program.
Why pursue record linkages?
"At Oregon Public Health Division, we think of linkages as part of regular, sound public health practice. We’ve done linkages between the Northwest Portland Area Indian Health Board’s Northwest Tribal Registry and our state cancer registry annually since 1999 to improve the quality of data for American Indians and Alaska Natives in our state. When we did the first linkage, we were floored: 59% of American Indians and Alaska Natives in our cancer registry had no indication of their native heritage in the database. In our two most recent cancer registry linkages, we still found that an average of 28% of new cases among American Indians and Alaska Natives were not correctly identified in our database.
We’ve also worked with the Health Board to pursue linkages of the state vital statistics and HIV/AIDS databases, among others. These linkages provide a more accurate picture of Indian health in Oregon. This helps both us and our partners in tribal and urban Indian health to recognize disparities and to better promote health among Oregon’s Native population.”
- Richard Leman, Medical Epidemiologist, Oregon Public Health Division
"Tribal health leaders have long recognized the necessity of having complete and accurate race data as a first step to addressing health disparities experienced by American Indians/Alaska Natives (AIAN) and other racial minority populations. Numerous studies have shown high prevalence of race misclassification for AIAN in data sources, such vital statistics and cancer registries. This results in underestimated morbidity and mortality, hampering public health decision-making and the appropriate allocation of disease control resources.
Using the most complete listing of AIAN currently available – a roster of individuals who have registered at tribal, Indian Health Service, and urban Indian clinics in the Northwest – we perform record linkages with health data systems in Idaho, Oregon, and Washington. The prevalence of misclassified and missing race data in this region can range from 30-60%, which, if left uncorrected, would significantly underestimate the burden of adverse health outcomes for this population. Our work directly benefits both state partners and tribes by improving the accuracy of race data in state surveillance data systems, and providing more accurate and complete health status data to Northwest tribal communities. To date, linkages have been conducted with state cancer registries, death records, hospital discharge data, STD surveillance systems, and several tribe-specific projects. This work is widely supported by tribal health leaders and our state partners.”
- Megan Hoopes, Project Director, IDEA-NW/Registry, Northwest Portland Area Indian Health Board
"From the state health department perspective where data linkages have not occurred, improved racial and tribal data would help further identify disparities associated with our tribal communities and their members. In North Dakota, the quality of the data in the race demographic field is very poor for some of the infectious diseases, while others have better race information. Last year, almost 87% of the data for influenza and 40% of pertussis data was of unknown race, while diseases such as HIV and TB had complete race data, where no cases were classified as unknown or left blank.
The North Dakota Statewide Cancer Registry uses Link Plus to link hospital, vital records and out-of-state central cancer registry data files to our system. We also use Link Plus to link with the state Breast and Cervical Cancer Program. As required by the National Program of Cancer Registries and the North American Association of Central Cancer Registries, we participate in annual linkages with national Indian Health Service data to update race misclassification of American Indian/Alaska Native people. We do not change race codes in our cancer database as a result of these linkages but we do retain linkage-identified race information in a specific field, which allows us to account for these matches when we run statistics requiring AI/AN data.
In Chronic Disease, the health department relies heavily on vital records and Behavioral Risk Factor Surveillance System (BRFSS) data. There is a concern that race may not be accurately reported in death records. This may be more of a training issue for the people who complete death certificates than anything else. Since race is self-reported in BRFSS, I have no concerns with that data source. We also work with health claims such as Medicare and Blue Cross Blue Shield of North Dakota (BCBSND). The Center for Medicare and Medicaid Services has made efforts to improve accuracy in race reporting in the Medicare system. BCBSND does not collect race information. Improving the accuracy of racial data is important to all disease programs because of the disparities and inequities related to race. These disparities have been persistent and in order for programs to address them, accurate racial data is crucial. Indicators at the national level and in our surrounding states indicate disparities among tribal populations. However, because some divisions in the health department do not get quality race data, it is difficult to identify the true impact to our tribal populations. The goal of linkages would be to not only improve our infectious disease data but to look at data quality across the whole health department.”
- Tracy Miller, State Epidemiologist, North Dakota Department of Health
Linkages fall into two main categories: deterministic and probabilistic. Deterministic linkage compares data fields to look for exact matches across data fields of a record. While this process is fairly straightforward, it may result in many missed matches if there are coding errors or missing data because of the strict matching requirements, decreasing the ability to detect and correct misidentification.
Probabilistic linkage estimates the probability that two records are for the same person. This process allows for differences between the two files, such the use of nicknames, middle initials versus full middle names, and transposed digits in a social security number or date of birth. Probabilistic matching also offers several other advantages, including the ability to:
  • Account for both the likelihood that two records represent the same person (sensitivity), and the likelihood that they do not (specificity).
  • Account for coding errors, missing data, reporting variations, and duplicate records.
  • Assign score weights depending on the frequency of a value (e.g., your dataset contains many "Smiths” but few "Hoopes” so a match on "Hoopes” would be weighted higher).
  • Allow for phonetic name matching (e.g., NYSIIS and Soundex).
To view a presentation with more detailed information about linkage concepts, click here.
Linkage Software
There are several user-friendly software options available that require little programming knowledge. Three commonly used tools are listed below, but there are others available as well.
  • The Link King is free public domain, probabilistic linkage and de-duplication software (user manual available).
  • Link Plus (a component of Registry Plus) is free, publicly available, probabilistic linkage and de-duplication software designed by CDC for use by central cancer registries, but usable with any fixed width or delimited data type. No user manual is available, but technical support is available by phone and email, and there is also an instructional PowerPoint presentation.
  • LinkSolv is a commercial, probabilistic linkage solution software sold by Strategic Matching. Training and technical support are available.
Sample Documents
The process of linking records varies across states, departments, and institutions, but here are some tools to get you started. First, it is important to contact the manager of the data source with which you wish to link to discuss the project and determine specific approval processes that the organization may have.
A simple IRB protocol is often developed, which can qualify for expedited review, and a data-sharing agreement is negotiated with the linking agency. Confidentiality pledges can be used to specify data handling and disclosure protocols required of staff with access to confidential data.
This sample data sharing agreement contains a sample template for data sharing and use and disclosure of client information. At a minimum, the agreement should specify the following:
  • Parties involved, including contact information.
  • Purpose or need for the data sharing agreement.
  • Nature of the data to be collected.
  • Access and confidentiality of data.
  • How the data are to be used.
  • How and in what situations the agreement can be severed by either party.
  • Relevant legal authorities (e.g., tribal, state, national).
This confidentiality pledge outlines the rules for internal access to a dataset containing direct personal identifiers, such as a patient registration list or tribal enrollment list, which may be used for record linkages. Technical details of data exchange between multiple parties should be detailed separately in a data sharing/use/exchange agreement.
Link Methods Protocol is an example of an IRB protocol describing linkage methods using Link Plus Software.
Examples of Linkages to Improve AIAN Data
National Linkage
As a way to minimize the effects of racial misclassification and provide improved estimates of mortality and disease among AIANs, the Indian Health Service (IHS) and CDC support national linkages between IHS registration records and cancer registries annually and a one-time linkage with the National Death Index (NDI). Linkage results are captured in the IHS Link field (NAACCR item #192). The “improved” data has been used in publications such as the Annual Report to the Nation on the Status of Cancer, 1975-2004, featuring cancer in American Indians and Alaska Natives, and the Cancer Supplement: An Update on Cancer in American Indians and Alaska Natives, 1999-2004.
Regional Linkage
The Improving Data & Enhancing Access — Northwest (IDEA-NW) Project aims to improve the quality of race data for AIANs through annual record linkages with a variety of health-related data systems in Idaho, Oregon, and Washington, and to disseminate health status data in ways that are locally meaningful for tribal health planning. IHS registration records are linked to state databases. A number of reports and publications have resulted from this project.
Tribal Linkage
To best serve their communities, tribes need their own tribe-specific incidence data. The Intertribal Council of Michigan has collaborated to link five tribal registries in Michigan to the state cancer database. Each tribe is provided with a de-identified database of the tribe’s cancer experience for analysis by the tribe, while the state registry retains the information needed to update the racial classification of linked patients within the state registry. Their data use agreement and publications provide additional details.

HomeBackground and IntroductionData CollectionData LinkageData SharingCollaboration ExamplesCommunication Resource/Tools