The Korea National Health and Nutrition Examination Survey data linked Cause of Death data

The Korea National Health and Nutrition Examination Survey (KNHANES) is a national health survey that is conducted annually to assess the health and and health-related behaviors of Korean population. To utilize KNHANES data to studies of mortality risk factors, the Korea Disease Control and Prevention Agency (KDCA) constructed a database linking KNHANES data to cause-of-death statistics in Statistics Korea, made available to researchers since 2020. The KNHANES data were linked to the Cause of Death Statistics based on resident registration numbers for subjects aged 19 years or older who agreed to link the data. The linkage rate between 2007-2015 National Health and Nutrition Examination Survey and 2007-2019 Cause of Death Statistics was 97.1%. In the linked dataset, the total death rate was 6.6%, of which neoplasms accounted for the highest death rate (32.1%), followed by circulatory system disease (22.7%) and respiratory system disease (11.5%). The linked dataset was made available through the Research Data Center of the KDCA after a review of the research proposal, and will be made available after periodical updates.


INTRODUCTION
The Korea Disease Control and Prevention Agency (KDCA) has annually conducted the Korea National Health and Nutrition Examination Survey (KNHANES) to assess the health and and health-related behaviors of Korean population based on article 16 of the National Health Promotion Act. The KNHANES is being used as the basis for health policy establishment and evaluation [1,2]. In addition, data from the KNHANES are made available to the public to be used as a data source for various epidemiologic studies. However, some limitations make it difficult to identify the directionality of the relationship between the health behaviors and nutritional status and chronic diseases due to the characteristics of a cross-sectional study.
To complement the limitations of a cross-sectional study and enhance the utilization of the survey data, other countries link national health survey data to healthcare service utilization and cause of death data, and use these linked datasets in various studies. In the United States, the National Health Interview Survey and the National Health and Nutrition Examination Survey pro-provides linked dataset on diseases, healthcare service utilization, and health examination information and the data collected by researchers upon their requests [7,8]. Also the Statistics Korea (KO-STAT) links and provides the Cause of Death Statistics and researcher' s data [9]. To increase the utilization of KNHANES data, the KDCA obtained consent from KNHANES participants to link the Cause of Death Statistics of KOSTAT, the Korea Central Cancer Registry of the National Cancer Center, and healthcare service utilization data of the NHIS, and the HIRA since 2007. Based on these data, the KDCA has established the KNHANES data linked the Cause of Death Statistics, which has been provided to researchers from 2020 onwards [10]. The present report introduces the composition of the linked dataset, major outcomes, and disclosure procedures.

DATA RESOURCES
The KNHANES has been conducted to produce health statistics of Korean population aged 1 year or older. Sample design, subjects, survey components, and survey methods of the KNHANES are described in the Guidebook for Korea National Health and Nutrition Examination Survey database and related publications [1,2,11]. The KNHANES survey can be briefly summarized as follows (Table 1): To obtain the KNHANES samples, the most recent available data of the Population and Housing Census was used as a sampling frame at the time of the design, in which a stratified multistage probability sampling design with enumeration district, household, etc. as sampling units was used. Subjects for sampling include Korean population who were all family members aged 1 year or older in the selected primary sampling units and households, corresponding to about 10,000 individuals. The survey was composed of a health interview, a health examination survey, and a nutrition survey. In the health survey, household information such as household type, household income, etc., and personal health behavior such as smoking, alcohol use, physical activity, mental health, disease morbidity, etc. are collected by face to face interveiw or self-report method. The health examination survey was composed of body measurements, blood pressure, laboratory tests, etc., in which data were collected through measurements and examinations. In the nutrition survey, dietary behavior, daily food and dietary intake, food security of households, etc., are collected by face to face interveiw method. The health interview and the health examination survey were conducted in mobile examination centers, while the nutrition survey was performed by visiting households.
In accordance with article 18 of the Statistics Act, the Cause of Death Statistics are produced to provide fundamental data not only for the identification of numbers and causes of death among Korean population, but also for the establishment of healthcare policy. The Cause of Death Statistics are published the following year based on reports received from January of the current year to April of the following year, and consist of information on the date and time of death, cause of death, place of death, and residence area at the time of death. Causes of death were selected from underlying causes of death among those recorded on death certificates according to the International Classification of Diseases of the World Health Organization. These causes of death were then further classified following the Korean Standard Classification of Diseases, 7th revision [9].
Raw data from the KNHANES are released in December of the year following the survey, and the Cause of Death Statistics are disclosed during the first half of the following year of data collection. In the linked data, the KNHANES data is updated for each survey cycle by combining 3 years of data since the survey components of the KNHANES data are similar within a survey cycle

POPULATION COVERAGE
The KNHANES and the Cause of Death Statistics were linked based on resident registration numbers of participants that were collected from the KNHANES. Of the participants of the KNHANES health examination, participants aged 19 years or older who agreed to link their Cause of Death Statistics and had valid resident registration numbers were included in the KNHANES data linked Cause of Death Statistics (linked dataset). 53,101 people who were 19 years or older participated in the 2007-2015 KNHANES health examination survey (22,627 men and 30,474 women), of whom 98.9% (98.8% of men and 99.0% of women) consented to link their Cause of Death Statistics. 97.5% of these individuals (97.9% of men and 97.2% of women) had valid resident registration numbers. In total, 51,575 participants who agreed to link their data and had valid registration numbers were included in the linked dataset. The linkage rate was 97.1% (men 97.5%, women 96.8%) of all participants of the health examination survey (Table 2).

MEASURES
The linked dataset contained all variables provided in the KN-HANES raw data. In the linked dataset, since the age of death in the cause of death statistics is calculated based on resident registration, the age based on the actual date of birth was deleted and the age based on resident registration was additionally included. Also, the month of the health examination survey was added to calculate follow-up periods. Household and parental ID informa-tion, which was unable to be analyzed by the linked dataset, and survey items for children and youths were excluded. Death related information includeds the cause of death and the year and month of death. The causes of death in linked data are provided to researcher based on the subcategories of the Korean Standard Classification of Diseases (7th revision). However, certain infectious and parasitic diseases (A00-B99), mental and behavioral disorders (F00-F99), and external causes of morbidity and mortality (V01-Y98) were sensitive information, so subcategories of these causes of death is provided after review of the research proposal.
The currently available linked dataset (linked dataset version 1.2) have the following characteristics: when December 31, 2019, was selected as the cut-off date for the last day of follow-up, the mean follow-up period of 51,575 participants included in the linked dataset was 8.4 years, and the total sum of person-years (8.4 × 51,575) was 419,628 person-years. Of the 51,575 participants included in the linked dataset, 3,426 died between 2007 and 2019 (6.6% in death rate). Death rates of men and women were 8.8% and 5.0%, respectively, indicating that men have a higher mortality rate than women. The death rate was higher when the participating age in the KNHANES was higher and when the income level was lower (Table 3). Since the linked dataset contained at least 97% of KNHANES participants, characteristics were not compared between those who were included or excluded from the linked dataset.
Of the main categories of causes of death, death due to neoplasms accounted for the highest proportion (32.1%), followed by diseases of the circulatory system (22.7%), diseases of the respiratory system (11.5%), external causes of morbidity and mortality (9.6%), and symptoms and signs not elsewhere classified (7.6%) ( Table 4). While the leading causes of death were similar between men and women, the order was different. Men had the same results as the results for Korean population [12], whereas the leading causes of death for women were neoplasms, diseases of the circulatory system, symptoms and signs not elsewhere classified, diseases of the respiratory system, and external causes of morbidity and mortality, in that order. When the causes of death that showed high mortality were analyzed in greater detail, deaths due to malignant neoplasms of the trachea, bronchus, and lungs accounted for the highest proportion of deaths caused by neoplasms (7.5%), followed Values are presented as number or number (%). 1 Age calculated based on the resident registration number, which is different from the age reported in the KNHANES that is calculated based on the actual date of birth. by malignant neoplasms of the liver and intrahepatic bile ducts (4.2%), malignant neoplasms of the stomach (3.9%), and malignant neoplasms of colon, rectum, and anus (3.0%). Among diseases of the circulatory system, deaths due to cerebrovascular diseases (8.3%) had the highest proportion, followed by ischemic heart diseases (6.0%), and other heart diseases (5.1%). Among diseases of the respiratory system, deaths due to pneumonia (6.1%) accounted for the highest proportion. Of the general death categories (56 items) of KOSTAT, the leading causes of death were malignant neoplasms (31.7%), heart diseases (11.1%), cerebrovascular diseases (8.3%), pneumonia (6.1%), and intentional self-harm (4.3%). These results are similar to those of the 2019 annual report on the Cause of Death Statistics, although there were differences in the order of the causes of death (Figure 1) [12].

DATA RESOURCE UTILIZATION
To disclose the linked dataset, pilot studies and disclosure risk assessment for linked dataset were performed [13]. Based on the results of these studies, a disclosure procedure and a guidebook were prepared, and then the data were disclosed for the first time during the first half of 2020. Specific measures for utilization of the linked dataset were prepared through pilot studies (a total of 5), and 7 research papers were published, which described the risk of death depending on various risk factors such as smoking, nutrient intake, blood pressure, sleep, work hours, and heavy metals [14][15][16][17][18][19][20]. From the disclosure of the linked dataset in February 2020 to October 31, 2021, a total of 35 cases were provided and have been utilized for analyses of various research topics.

STRENGTHS AND WEAKNESSES
By linking the Cause of Death Statistics to the KNHANES, it became possible to use the KNHANES, a cross-sectional survey, as a prospective follow-up survey. The KNHANES is the most indepth survey of national health in Korea and contains information concerning roughly 500 items related to socioeconomic status, health behaviors (smoking, alcohol use, physical activity, etc.), nutrition, and chronic disease status (obesity, hypertension, diabetes, pulmonary diseases, ocular diseases, etc.); thus, it is possible    to analyze risk factors for various chronic diseases and deaths through the linked dataset described herein. The KNHANES has assured data quality through data collection by well-trained fulltime field staffs, and internal and external (relevant academic societies) quality control of survey procedures. To supplement missing reports of death, the Cause of Death data additionally reflect data from a supplementary survey on causes of death (direct survey with medical institutions), infant cremation report data, and information about the deceased without family or friends, securing the inclusiveness of the collected data. The Causes of Death data were also reviewed to confirm reported causes of death by examining administrative data, and periodically analyzed logical errors and consistency, enhancing the data validity. Thus, the linked dataset can be considered to provide consistency and accuracy in terms of quantity and quality because of its basis on these two datasets. Additional strengths of the linked dataset include the maintained representativeness of the KNHANES samples due to the high linkage rate between the two datasets (97.1%), and high accuracy of data linkage due to construction based on resident registration numbers. Lastly, both datasets are collected annually and disclosed to the public, enabling timely updates and provision of the updated linked dataset according to the update cycle.
Limitations of the linked dataset are as follows: first, while main categories for causes of death can be analyzed, analyses of causes of death in intermediate categories and subcategories are limited because the follow-up period of the linked dataset was not long enough. It is expected that the data can be analyzed more in-depth with respect to various topics as the follow-up period increases in the future. Second, the KNHANES excluded those who are unable to move and those who reside in institutionalized settings such as hospitals, nursing homes, and care centers; therefore, the total number of deaths or death rates by cause of death might have been either overestimated or underestimated. Third, survey data including socioeconomic status, health-related behaviors, chronic disease having, etc., were collected as part of the KNHANES, and may be changed over the follow-up period. Thus, these limitations should be considered when interpreting the results. Lastly, the linked dataset is provided to researchers through the Research Data Center of the KDCA, which limits accessibility. To improve accessibility to this dataset, a remote analysis system will be established and operated in the future.

DATA ACCESSIBILITY
The linked dataset was made available after reviewing the relevant research proposals. Here, we briefly explain the procedure for providing the linked dataset: a researcher first submits a research proposal to the KDCA for review and requests the KOSTAT Microdata Integrated Service system for linking to the Cause of Death data following review. The KDCA then sends the required KN-HANES raw data to the KOSTAT, and KOSTAT links this to the appropriate the Cause of Death data. Thereafter, the KDCA uploads the linked dataset from KOSTAT to the Research Data Center.
The researcher then visits the Research Data Center in the KDCA, performs the analysis, and submits a request to transfer the results out of the center. After reviewing the analysis results, the KDCA sends them to the researcher by e-mail. A detailed procedure is described in the 'Guidebook for Korea National Health and Nutrition Examination Survey Linked Cause of Death data' on the KNHANES homepage (https://knhanes.kdca.go.kr) [10].

CONCLUSION
To address the limitations of the KNHANES cross-sectional study, the KDCA has linked KNHANES data to the Cause of Death Statistics of KOSTAT and then disclosed the linked dataset. This linked dataset will be used as the basis for health policies and has been made available for various areas of research such as studies on risk factors that influence morbidity and mortality. The currently available linked dataset can be used for analyses of some causes of death such as neoplasms, diseases of the circulatory system, and diseases of the respiratory system, but is limited because the survey follow-up period was not sufficiently long. However, the Cause of Death data and the KNHANES will be updated every 1 year and 3 years, respectively, which will enable more diverse and in-depth studies with longer follow-up periods.