Integrated dataset of the Korean Genome and Epidemiology Study cohort with estimated air pollution data

Public concern about the adverse health effects of air pollution has grown rapidly in Korea, and there has been increasing demand for research on ways to minimize the health effects of air pollution. Integrating large epidemiological data and air pollution exposure levels can provide a data infrastructure for studying ambient air pollution and its health effects. The Korean Genome and Epidemiology Study (KoGES), a large population-based study, has been used in many epidemiological studies of chronic diseases. Therefore, KoGES cohort data were linked to air pollution data as a national resource for air pollution studies. Air pollution data were produced using community multiscale air quality modeling with additional adjustment of monitoring data, satellite-derived aerosol optical depth, normalized difference vegetation index, and meteorological data to increase the accuracy and spatial resolution. The modeled air pollution data were linked to the KoGES cohort based on participants’ geocoded residential addresses in grids of 1 km (particulate matter) or 9 km (gaseous air pollutants and meteorological variables). As the integrated data become available to all researchers, this resource is expected to serve as a useful infrastructure for research on the health effects of air pollution.

INTRODUCTION data because they often do not provide detailed addresses. The Korean Genome and Epidemiology Study (KoGES) and the Korea National Health and Nutrition Examination Survey, collected by the Korea Disease Control and Prevention Agency, are excellent in terms of representativeness, scale, diversity of disease-related variables, and standardization. Building a dataset by linking air pollutant exposure levels to the collected data can provide a data infrastructure for studying ambient air pollution and its health effects.
Accurate estimates of personal exposure to pollutants are important to measure the health risks of air pollution. Data from air pollution monitoring stations based on participants' residential addresses have been widely used because it is difficult to monitor individual exposure levels, especially in a large population group. Although the nationwide air pollution monitoring network has expanded rapidly in Korea, monitoring sites are still limited in areas other than large cities or metropolitan areas. Particulate matter (PM) acts as an indicator of air pollution, and smaller particle sizes can affect human health. Nationwide data on PM ≤ 2.5 microns in diameter (PM2.5) from monitoring stations have been available in Korea since 2015. Therefore, the Community Multiscale Air Quality (CMAQ) model, which includes meteorological and emission modeling, can be a good modeling method because it is useful for estimating air quality levels when data from monitoring stations are not available.
Therefore, we aimed to estimate the air pollution levels nationwide and merge them with the KoGES cohort to establish a data infrastructure for the study of ambient air pollution and its health effects. Herein, we introduced a detailed procedure for estimating exposure to air pollution among the KoGES participants (and how to merge the data with the KoGES cohort) and air pollution levels.

Korean Genome and Epidemiology Study cohort
The KoGES, which includes several large-scale populationbased cohorts funded by the National Research Institute of Health, recruited people over 40 years of age to identify genetic and environmental factors of chronic diseases such as diabetes, hypertension, obesity, hyperlipidemia, metabolic syndrome, and cardiovascular disease. Baseline recruitment for the KoGES was conducted between 2001 and 2013 [9]. The KoGES consists of populationbased and gene-environment model studies. In population-based studies, the KoGES Ansan and Ansung study, the KoGES Cardiovascular Disease Association Study (CAVAS), and the KoGES Health Examinees (HEXA) study were used as epidemiological data to combine with air pollution exposure levels.
For the KoGES Ansan and Ansung study, recruitment was carried out in 2 different regions, representative of industrialized communities (Ansan) and rural areas (Ansung). The basic survey was started between 2001 and 2002, and follow-up with participants continued every 2 years. For the KoGES CAVAS, recruitment was carried out from multiple rural communities. The study completed the fourth follow-up in 2016 after the baseline survey from 2005 to 2011. A total of 211,569 participants provided consent to participate in the cohort study as of May 2022, after excluding those who withdrew from the study. The KoGES cohort data included general characteristics, medical history, smoking and drinking status, diet questionnaire, and clinical measurements, such as vital signs and blood and urine tests. The cohort data continued to produce new variables using collected specimens. A more detailed description of the cohort has been published elsewhere [9], including a comprehensive list of data collection methods in the supplementary file. Genetic information on several platforms [10,11] that can be integrated with various KoGES data is also available. The up-to-date recruitment status of the KoGES population-based studies is presented in Table 1, and the number of participants in the 17 provinces of Korea at baseline cohort is shown in Supplemental Material 1.

Geocoding of the cohort participants
The cohort participants' address information for each follow-up visit was converted into latitude and longitude coordinates using the GeoService-Xr geocoding software (Geoservice, Seoul, Korea). If there was no address in a specific follow-up survey, the address from the previous survey was used. When multiple addresses existed in the same survey, the last address entered was selected. When a detailed address was unavailable or out of date, the address was replaced with a nearby public (administrative) institution. Finally, a total of 353,646 addresses of the cohort participants from baseline to follow-up surveys of 3 population-based studies were geocoded according to the above principle, excluding 13,975 addresses with incorrect information. The geocoded addresses are presented in Figure 1.

Geographic estimation of air pollution
Air pollution concentration was calculated in 9-km grid units using the CMAQ model, which uses meteorological information and emission rates. Thereafter, the data were assimilated using measured data from the monitoring station. For PM data, satellite-derived aerosol optical depth (AOD) was applied to increase the spatial resolution to a 1-km grid unit. Additionally, multiple linear regression was applied to PM and ozone (O3).
The CMAQ model includes a meteorological model, emission model, and chemical transport model. Three-dimensional mete-orological data, such as the hourly wind, temperature, and humidity fields, were generated in each grid using Weather Research Forecast version 3.6.1. The generated values were used as input data for the emission and chemical transport models and as final meteorological variables.
Sparse matrix operator kernel emissions were used to process the emission data for the input of the air quality model. The sources for the emission model were the Clean Air Policy Support System for Korean emissions, the Multi-Resolution Emission Inventory for China, and the Regional Emission Inventory in Asia for emissions from neighboring regions. The chemical transport model integrates the output from the meteorological and emission models to calculate the concentration of air pollutants.
Data assimilation, a method of combining the estimated value with the observed data, was applied by assigning a weight to each observation within the radius of influence. Data were generated in 9-km grid units. However, the PM was calculated in 1-km grid units after applying AOD observations from the National Aeronautics and Space Administration Terra and Aqua satellites. Finally, PM and O3 were adjusted using the normalized difference vegetation index and meteorological data using multiple linear regression. More details on the modeling method can be found in a previous study [12]. The air pollution data by units of smallscale administrative divisions in Korea (dong [neighborhood] or si-gun-gu [city-county-district] levels) were also calculated by summing the concentration of grid cells by weighting the area covering the administrative division. Administrative division data can be used to link epidemiological data that only include administrative division information without the geocoded addresses of participants. O3 is generally calculated as the daily maximum 8-hour mean concentration; however, the 24-hour mean O3 concentration was calculated in this study because of the characteristics of the model used.
The estimated air pollution data were compared with the measured data from monitoring stations using the average values of certain periods (day, week, and annual) in each grid ( Table 2). The concentration at the modeling grid point where the measuring station was located was extracted and evaluated using the following statistical indices: correlation coefficient, the square of the correlation coefficient, index of agreement, and root mean square  error. Although the degree of consistency between the measurement data and the modeling data was quite high, there were some differences depending on the type of air pollutants and the average period, which should be considered when interpreting the data.
The estimated air pollution levels in Korea region from 2005 to 2017 are presented in Table 3. The mean concentrations of 6 air pollutants in Korea region were calculated and compared to the air quality guideline levels of Korea and the WHO. The air quality guideline annual mean levels of PM2.5 and ≤ 10 microns in diameter (PM10) are 15 μg/m 3 and 50 μg/m 3 , according to the enforcement decree of the framework act on environmental policy in Korea, and 5 μg/m 3 and 15 μg/m 3 according to the WHO [2], respectively. The concentrations of PMs have been decreasing over the years, but they are still higher than both the Korean and WHO guideline levels. The levels of gaseous air pollutants were generally lower than recommended by the guidelines, except for nitrogen dioxide (NO2), which was at the WHO levels.

Merging the estimated air pollution data to the Korean Genome and Epidemiology Study cohort
The geocoded addresses of the KoGES participants were spatially matched to the air pollution data from the 1-km or 9-km grid units using the ArcGIS program (ESRI Inc., Redland. CA, USA). Various exposure periods were calculated, including the data on the day of the cohort survey. The exposure periods covered long-term and short-term exposure to air pollutants, according to previously published studies. Meteorological data were also included in the dataset because meteorological variables such as temperature and humidity are highly correlated with air pollution and have their own effects on health.
The air pollution exposure dataset consisted of 210 variables, including each exposure period (35 types) for 6 air pollutants (PM10, PM2.5, NO2, sulfur dioxide [SO2], O3, and carbon monoxide [CO]). The exposure period was the day of the survey (lag0), the day before the survey to 14 days (lag1 to lag14), the moving average of 1 week, 1 month, 3 months, 6 months, 1 year, 2 years, and 3 years before the survey, and the average for each calendar year from 2005 to 2017. Meteorological data such as relative humidity, wind speed, precipitation, cloudiness, insolation, and surface pressure were included in the same exposure periods as the air pollution data, except for wind direction and temperature. Average exposure was not included for wind direction, whereas the highest and lowest temperatures from the day to 14 days before the survey were included. The variables included in the dataset and their descriptions are presented in Supplemental Material 2. These datasets were separately established for each baseline and follow-up survey.
If the address of a cohort participant changed during the follow-up survey, the average exposure concentration was calculated based on the assumption that the participant moved at the midpoint between the previous and current surveys. For example, if a participant's address changed in the following survey and the interval from the previous survey is 2 years, the participant was considered to have moved 1 year ago.  participants with unknown addresses. The mean baseline exposure levels of the cohort participants to the 6 air pollutants on the day of the survey are shown in Table 4. Detailed information on the exposure levels in the baseline survey is shown in Supplemental Materials 3 and 4. The exposure levels were higher in the spring and winter, except for O3. In addition, PM levels were higher in KoGES Ansan and Ansung, whereas NO2 levels were lower in KoGES CAVAS, which is a rural-based cohort (Supplementary Materials 5-10). To compare the relative pollutant levels in Korea, the mean exposure levels of 13 years in each grid, regardless of the survey date, were divided into quintiles. The highest quantile was indicated in red, the lowest in yellow, and "no participant within the grid cell" in white (Figure 2A). The exposure levels of PMs were higher in metropolitan and surrounding areas. The exposure level of NO2, which is related to traffic, was high in urban areas. Areas around harbor regions had high SO2 levels, whereas coastal areas had high O3 exposure levels. The average level of PM2.5 in each year was presented to determine the spatial distribution and trends of relative air pollution exposure levels of the KoGES participants throughout the 13 years ( Figure 2B).

Ethics statement
The study protocol was approved by the Institutional Review Board of the Korea Disease Control and Prevention Agency (No. 2019-05-04-2C-A). Written informed consent for participation in the KoGES cohort was obtained from all study participants and was confirmed by the Institutional Review Board.

STRENGTHS AND WEAKNESSES
A high concentration of air pollution over a short period can have an immediate effect on human health. However, long-term exposure to air pollution at relatively low concentrations may have a delayed effect. Thus, research on the long-term effects on human health is important for the general population. Short-term effects of air pollution on human health can be analyzed using summary results from public data. However, research on the long-term exposure to air pollution requires epidemiological data that includes various health-related information as covariates or outcomes of the study. Thus, an integrated dataset of the KoGES and air pollution has been established. It has the following strengths: First, the KoGES is continually followed-up. In particular, the KoGES Ansan and Ansung study has been followed-up for over 20 years at 2-year intervals, which makes it possible to see the trends and cumulative effects of air pollution on health. During the 9 follow-ups, 91.2% of the baseline cohort participants visited again at least once, while 39.2% (3,995 out of 10,030) completed all the follow-up visits. Furthermore, 59.3% and 40.5% of KoGES CAVAS and KoGES HEXA cohort participants completed at least 1 visit after the baseline, respectively. Second, the KoGES includes large populationbased cohorts and covers a wide range of Korea, including urban, rural, and industrialized areas, as presented in Figure 1. However, a systemic sampling procedure was not applied to represent the Korean population. Third, the KoGES cohort data contain an extensive range of health-related and disease-related information, including lifestyle factors, anthropometric and clinical measurements, medical history, and genetic information, which makes it possible to control for sources of residual confounding other than the risk factors and outcomes of interest. Fourth, the KoGES contains various clinical variables, including inflammation and oxidative stress markers, which are closely related to air pollution. In addition, Korean chip data, which are suitable for genome research in Korea, are also available, making it possible to conduct geneenvironmental interaction studies. Fifth, the KoGES data are linked to health outcome data from the National Statistical Office, the National Health Insurance Service, and the National Cancer Center, making it possible to study the effects of air pollution on the incidence and mortality of various diseases. The cause of death data from the national statistics office can only be accessed by the researchers presently. Finally, the estimated air pollution and meteorological data estimated by the grid and administrative division unit are also available. They can be merged with other epidemiological data for various research purposes.
However, caution should be exercised when using an integrated database, owing to several limitations. First, PM2.5, which is known to have more adverse effects than PM10, may not have been fully adjusted for by the data assimilation method during 2005-2014 due to the lack of monitoring data. However, current air pollution estimations use the CMAQ model, which does not draw upon monitoring data, and additional adjustments were applied with satellite-driven AOD, normalized difference vegetation index, and meteorological variables to assess the precise concentrations of air pollution. Second, the degree of consistency between the measurement data and the modeling data was relatively low for the annual exposure of PM2.5, which may be related to the lack of monitoring data compared to other pollutants. This should be considered when interpreting the long-term exposure data. Third, personal exposure levels were estimated based on the participants' residential addresses without considering personal space-time activity patterns. Lastly, the exposure levels of gaseous-phase air pollution are estimated in 9-km grid units, according to which all participants within each 9-km grid unit would have the same pollution levels, which may cause area-level confounding.
Even within the same area, the air pollution exposure level can vary according to the proximity of traffic routes. In particular, metropolitan areas are highly populated, and residential areas are close to traffic routes. Therefore, we plan to estimate air pollution levels with a smaller spatial resolution, using 100-m grid units, and to establish an integrated dataset with recent data. However, a more accurate method for assessing individual air pollution levels is to use personal monitoring sensors. Air pollution monitoring sensors have been developed to assess personal exposure levels. However, they should be upgraded to be smaller, lighter, and convenient for use with precise assessments, which can be utilized in a large population in the future.

DATA ACCESSIBILITY
Researchers who wish to use the integrated dataset or the estimated air pollution and meteorological data can access the KoG-ES epidemiological data online sharing system (http://www.kdca. go.kr/research/KoGES/data sharing, https://nih.go.kr/ko/main/ contents.do?menuNo = 300566) or the National Biobank of Korea (http://www.nih.go.kr/biobank/assess&sharing). A guidebook for using the integrated dataset is also available in the online sharing system.