Epidemiologic Questionnaire (EPI-Q) – a scalable, app-based health survey linked to electronic health record and genotype data

The Epidemiologic Questionnaire (EPI-Q) was established to collect broad, uniform, self-reported health data to supplement electronic health record (EHR) and genotype information from participants in the University of Michigan (UM) Precision Health cohorts. Recruitment of EPI-Q participants, who were already enrolled in 1 of 3 ongoing UM Precision Health cohorts—the Michigan Genomics Initiative, Mental Health Biobank, and Metabolism, Endocrinology, and Diabetes cohorts—began in March 2020. Of 54,043 retrospective invitations, 5,577 individuals enrolled, representing a 10.3% response rate. Of these, 3,502 (63.7%) were female, and the average age was 56.1 years (standard deviation, 15.4). The baseline survey comprises 11 modules on topics including personal and family health history, lifestyle, and cancer screening and history. Additionally, 11 optional modules cover topics including financial toxicity, occupational exposure, and life meaning. The questions are based on standardized and validated instruments used in other cohorts, and we share resources to expedite development of similar surveys. Data are collected via the MyDataHelps platform, which enables current and future participants to share non-Michigan Medicine EHR data. Recruitment is ongoing. Cohort data are available to those with institutional review board approval; for details, contact the Data Office for Clinical and Translational Research (DataOffice@umich.edu).

studies are linked to other data sources, including cancer registry data, vital statistics, neighborhood-level exposures, and prescription claims data (Supplementary Material 1).
EPI-Q was initially developed for MGI, the largest and longeststanding cohort among the UM Precision Health studies, with the aim of re-engaging individuals (Figure 1).Because MGI recruitment occurs at preoperative or perioperative appointments for surgical and diagnostic procedures requiring anesthesia, the cohort is less healthy overall than the general patient population.This makes it a rich source for studying nearly all disease outcomes [2].Numerous publications, many related to genetics, have been produced using MGI data [3][4][5].Researchers conducting genetic analyses, including those utilizing MGI data, often employ external samples as derivation, validation, or replication cohorts.The UK Biobank [6,7], arguably the most well-known and democratized EHR-linked biobank worldwide, has made its data available to researchers across the globe, making it a popular choice for external samples in genetic analyses.In addition to linking EHR information with genotype data, the UK Biobank has gathered extensive self-reported data across various health-related domains, including behavior and lifestyle.Such data are often incomplete or absent in standard EHR-linked databases (Supplementary Material 2 for a discussion of the limitations of EHR data for research).Thanks to its collection of survey data, the UK Biobank can define phenotypes and adjust for confounders not present in EHR and genotype data.
Considering the standard set by the UK Biobank, the EPI-Q survey was designed to include domains and instruments found in the UK Biobank questionnaire.This not only helps account for potential confounders, but also ensures that the survey items are similar, enabling UM Precision Health cohorts to participate in consortium-wide analyses of survey items and biobank-based meta-analyses.Furthermore, given the numerous UM Precision Health cohorts, the collection of consistent self-reported data across them aids in facilitating multi-cohort analyses.In addition to MGI (originally known as the Anesthesiology Collection Effort) and MY PART, other Precision Health cohorts include the Metabolic, Endocrinology, and Diabetes (MEND), Mental Health Biobank (MHB), and Michigan Predictive Activity and Clinical Trajectory (MIPACT) cohorts.Some of these cohorts were recruited through specific clinics, such as the metabolism, endocrinology, and diabetes clinics for MEND and mental health clinics for MHB (Figure 1), while MIPACT is designed to collect wearable data from MM patients.MY PART is collaborating with partners outside of MM to expand to Detroit (77.9% Black [8]), Flint (56.7% Black [8]), Dearborn (44.9% Arab [9]), and Grand Rapids (15.7% Hispanic [8]).This expansion aims to diversify the cohort and position UM Precision Health as a leading resource for Arab/Middle Eastern/North African health, given the large population residing in Michigan [10].EPI-Q was developed with the future in mind: recruitment, consent, and survey administration, as well as participant-initiated external EHR sharing protocols, were designed to be safe, secure, scalable, and extendable to cohorts under the UM Precision Health umbrella.
In essence, EPI-Q allows users of UM Precision Health cohort data to better adjust for confounders, compare participants across cohorts, and engage in consortium-wide and external meta-analyses, particularly with the UK Biobank.In addition, data collection on topics including financial toxicity, occupational exposures, and life meaning broadens its appeal to researchers beyond the realms of health and genetics.

STUDY PARTICIPANTS
Participants are invited to take part in the study via email.Consent and surveying are managed electronically through the MyDa-taHelps platform (CareEvolution, Ann Arbor, MI, USA).Partici-  The NDI, which is operationalized as quartiles, is the average of the proportions of (i) female-headed families with children, (ii) households using public assistance income, (iii) people with income below the poverty level in the last 12 months, and (iv) the population (age 16 years and older) unemployed at the census tract level; It is based on the participant's residential address as reported in the Michigan Medicine patient portal.
pants have the option to enroll and participate either through the MyDataHelps mobile application or the web browser platform on a computer.Once the consent process is complete, participants may fill out the one-time baseline survey.Upon completion, optional modules become available on the participant's dashboard within the MyDataHelps platform.All survey modules are administered simultaneously (i.e., cross-sectionally).
While the EPI-Q survey data are collected cross-sectionally, the consent and data collection platform allows us to consider new survey modules on timely topics (indeed, an optional coronavirus disease 2019 [COVID-19] module has already been added) as well as potentially longitudinal responses of baseline data.Additionally, the EHR data and other linkable databases are updated on an ongoing basis.
The primary recruitment mechanism targeted adults receiving anesthesiology for surgical or diagnostic procedures (91%), with some participants recruited from mental health (1%) or metabolism, endocrinology, and diabetes (6%) clinics, as well as a wearable data study (2%).Inclusive of the pilot phase (Supplementary Materials 3 and 4), a total of 54,043 invitations were sent to all MGI participants who (1) were alive at the time of EPI-Q recruitment, (2) had an email address registered in the UM Patient Portal, and (3) had a biospecimen on file (not necessarily genotyped yet).Participants were incentivized with an ancestry report upon completion of the baseline survey.Of those invited, 5,577 enrolled, constituting a response rate of 10.3%.Descriptive statistics comparing EPI-Q respondents and non-respondents through June 30, 2022 are summarized in Table 1 (with a comparison of EHR-derived vs. EPI-Q self-report data presented in Supplementary Material 5).
Based on data extracted from EHRs (Supplementary Material 6), the respondents were generally slightly younger, with a mean age of 56.1 compared to 57.5 years (p = 0.03).They were more likely to be female (64 vs. 56%; p < 0.01), White (92 vs. 90%; p < 0.01), and married (57 vs. 54%; p< 0.01).They were also more likely to have consumed alcohol (80 vs. 73%; p < 0.01) and to have never smoked (62 vs. 55%; p < 0.01).Multivariable logistic regression models corroborated the findings that the response likelihood was higher among female participants and married individuals, and lower among non-Hispanic Blacks and both current and former smokers (Supplementary Materials 7 and 8).The respondents were less likely to have qualified through recruitment into MGI (84 vs. 91%; p < 0.01).This is probably because the MGI recruitment process has been ongoing for longer than the other cohorts.In other words, participation rates have tended to increase as the time since enrollment in a qualifying cohort (MGI, MEND, MHB, MIPACT) decreased.
Participants in EPI-Q, who are drawn from the UM Precision Health cohorts, predominantly represent the MM catchment area.This area encompasses the central, lower peninsula of Michigan, and its population is more predominantly White than the overall population of Michigan.Figure 2 provides a county-level map  showing the origins of EPI-Q participants, with a relative distribution detailed in Supplementary Material 9. Notably, some participants, primarily from neighboring states, were eligible for MGI because they traveled to MM for specialized surgical procedures that required anesthesia.

Ethics statement
This study received ethical and regulatory approval from the Institutional Review Boards (IRBs) of the University of Michigan Medical School (IRBMED), under IRB No. HUM00155782.

MEASUREMENTS
The EPI-Q questionnaire was primarily modeled after the tool used by the UK Biobank, with the aim of enabling UM Precision Health cohorts to participate in meta-analyses.The UK Biobank questionnaire collects extensive data on socio-demographic factors, family and early life history, psychosocial aspects, lifestyle, medical history, and cognitive function.It is administered to participants in person.
We began by reviewing the UK Biobank questionnaire regard-ing the relevance and applicability of its questions to a United States context.The questions were then categorized into broad domainbased modules such as alcohol use, sexual history, and feelings and mood.Following this initial review, we compared these questions with those used in other United States-based biobanks and UM cohort studies.These included the National Institutes of Health All of Us Research Program [11], the Veterans Administration Million Veteran Program [12], and the UM Genes for Good initiative [13].These foundational resources were selected due to their potential for (genome-wide association study [GWAS]) metaanalyses, with a particular focus on the UK Biobank.The resulting structure comprised 11 baseline modules and 11 optional modules, with a COVID-19 module added later (Figure 1).Supplementary Material 12 highlights the overlap between questions in EPI-Q and the UK Biobank.As detailed in Supplementary Material 13, several modules encompass multiple domains as defined by the study team.Experts from various departments at UM, including Anesthesiology, Biostatistics, Environmental Health Science, Epidemiology, Health Behavior and Health Education, Occupational Health, Oncology, and Psychiatry, recommended domain-specific survey instruments and reviewed the resulting portions of the survey.The UM Survey Research Center reviewed the drafts to assess participant burden, to ensure consistency with UM panel surveys and widely used national surveys, and to verify the appropriateness of wording and responses in line with evidence-based survey practices (Figure 3).A comprehensive table listing all questions and Data on patients who are not recorded as deceased.
2 Data according to 2020 American Community Survey 5-year estimates.
The EPI-Q questionnaire consists of 22 modules, divided into baseline and optional surveys.The baseline survey includes 11 modules: alcohol use, cancer history and screening, family, feelings and mood, hearing, home and personal details, personal health, physical activity, sexual orientation and history, smoking, and social and recreational activity.The incentive for baseline survey completion is a UM-generated ancestry report, which uses genotyped biospecimens collected from the participant's enrollment in a UM Precision Health cohort that qualifies for EPI-Q.
The optional surveys also include 11 modules: depression, diet and eating habits, healthcare access and utilization, life meaning, life satisfaction, anxiety and stress, occupational exposures, pain, physical activity, substance use, and vision.These modules were selected to reduce the participant burden for baseline completion and incentive receipt, while still addressing areas of interest to the research team and collaborators.They either elaborate on topics covered in the baseline survey (e.g., alcohol use) or collect new information (e.g., healthcare access and utilization).Following completion of the pilot phase, an additional optional COVID-19 module received IRB approval.This module seeks information about diagnoses, exposure and symptom history, and vaccination history and hesitancy.
In Supplementary Material 14, we highlight 3 survey instruments that are less commonly used or represent emerging areas of research: occupational exposure, financial toxicity, and life meaning.

Completion rates
Completion rates, defined as the proportion of enrollees who completed any part of a given survey to which they had access, were high among the 5,498 participants for the incentivized baseline modules.These rates ranged from 80.2% for the personal health module to 94.4% for the personal and family attributes module.An impressive 79.9% (n= 4,393) of participants completed all baseline modules.Completion rates were lower for the non-incentivized optional modules, although they remained relatively high.These rates ranged from 62.7% for life satisfaction to 75.8% for vision.The optional occupational exposure module had a noticeably lower completion rate of 44.2%.Among those who completed any of the optional modules, 38.0% (n = 1,841) finished all 11 of the original optional modules (excluding the COVID-19 module, which was added later).The completion rates by module are summarized in Supplementary Material 15.

Survey length (time to complete)
Survey lengths for each module, as well as the baseline and optional surveys, were assessed in the original sample of 601 participants, following the distribution of the initial 5,000 invitations.After removing outliers (those outside of the interquartile range; IQR ± 1.5 × IQR), the average completion time for the baseline survey was determined to be 20.8 minutes (95% confidence interval [CI], 11.5 to 30.0), with an average of 163.3 questions answered across 10.8 of the 11 modules.The optional survey had an average completion time of 19.1 minutes (95% CI, 5.1 to 33.1).The aver- age time to complete each module ranged from 0.7 minutes for the baseline sexual orientation and history module to 5.2 minutes for the optional occupational exposure module.The average survey lengths for each module are summarized in Supplementary Material 16.

Concordance between Epidemiologic Questionnaire and electronic health record for sex and cancer history variables
Several variables are recorded in both the EHR and EPI-Q.In this study, we examined the concordance of sex as reported in these 2 sources: self-reported sex at birth via EPI-Q and archived sex as documented in the EHR.We assessed concordance using the Cohen kappa (κ), a measure of the proportion of cases in agreement while accounting for the number of agreements expected to occur by chance.Despite some discrepancies, we generally noted extremely high consistency, with κ = 0.986 (self-report vs. EHRrecorded; Supplementary Material 17).
Unlike sex, a substantial disagreement was observed regarding cancer history (Supplementary Materials 18 and 19).Among the 1,850 respondents who reported a history of cancer, 96.7% (n= 1,789) had a cancer diagnosis documented in their health records (see qualifying phecodes in Supplementary Material 20).This discrepancy could be due to several factors: (1) certain types of cancer may not have been included in our EHR-based classification, (2) individuals may have inaccurately reported their cancer history, such as misinterpreting non-positive cancer screening results or including precancerous lesions, or (3) individuals may have received a cancer diagnosis from another healthcare provider, and thus, the diagnosis was not recorded in their MM EHR.Among individuals with a history of cancer as documented in their EHR, self-reported cancer history varied based on the type of cancer, ranging from 100% (e.g., myeloid leukemia, chronic) to 17% (neurofibromatosis) (Supplementary Material 20).More nuance is explored in Supplementary Material 18, including the distribution of time since cancer diagnosis (Supplementary Material 21), self-report by time since cancer diagnosis (Supplementary Material 22), and exploratory models for cancer self-report (Supplementary Material 23).

Differences in occupational exposures, financial toxicity, and life meaning by self-reported history of cancer
The unique data fields of EPI-Q have the potential to generate novel research questions across scientific fields.Initially, we analyzed responses to questions about occupational exposures based on self-reported cancer history (Supplementary Material 24).We observed only 2 statistically significant differences: individuals without a self-reported history of cancer were more likely to report (1) exposure to a cramped workspace and (2) walking or running as part of their job.
Second, we compared responses to questions regarding financial toxicity by self-reported history of cancer (Supplementary Material 25).Unlike occupational exposure, we observed many statistically significant differences in the responses.However, a self-reported history of cancer did not consistently correlate with greater financial toxicity.Importantly, questions about financial toxicity were posed only to individuals who reported receiving "treatment for a new or ongoing illness or condition in the past 7 days." Therefore, many people with a self-reported cancer history were likely responding to these questions in relation to a noncancer illness or condition.Furthermore, we are only presenting the mean differences between those with and without a self-reported history of cancer, without considering potential confounding factors such as age, employment, or income.
Third, we analyzed responses to questions about life meaning based on self-reported cancer history (Supplementary Material 26).The Comprehensive Measure of Meaning instrument, which we refer to as "life meaning" in this context, is divided into 3 domains: coherence (6 questions), significance (6 questions), and direction (9 questions).Each question was rated on a 7-point Likert scale, with 1 indicating strong disagreement and 7 indicating strong agreement.For each domain, we calculated a simple average of the non-missing responses, both by domain and overall for each individual.In each domain, as well as overall, individuals with a self-reported history of cancer reported significantly higher levels of life meaning.It is important to reiterate that we are reporting mean differences in life meaning scores based on self-reported cancer history, without considering potential confounding factors such as age, sex, education, or income.The distribution of average scores, both overall and for each domain, is depicted in Supplementary Material 27.

Life meaning: genome-wide association study
We conducted a proof-of-concept GWAS focused on life meaning.The study of life meaning, along with other psychosocial measures such as life purpose, life satisfaction, and happiness, has been increasing [15,16].This includes studies that employ genetic analyses [17,18].Our sample consisted of 2,433 participants, primarily of European ancestry (as inferred from available genotype data), who answered at least 1 question in the life meaning module.We conducted a separate GWAS for each domain of life meaning and for overall life meaning (Supplementary Material 28) using EN-CORE, a web-based analysis tool used for GWAS at UM.All GWAS analyses were executed with an inverse normalized outcome using a fast linear mixed model with kinship adjustment (SAIGE [19]).All models were adjusted for age at survey, sex, qualifying study, genotyping batch, and the first 10 principal components of the genotype data.We considered variants with a minor allele frequency greater than 0.1% and a minor allele count exceeding 20.
Notably, these results are solely for demonstrative purposes.We observed no clear association signal that deviated significantly from random findings.Future research in this field should care-fully consider the operationalization of the questions in the life meaning module.When used alongside MGI, the EPI-Q study provides a gateway to a wide array of research questions that may not be answerable through other means, particularly questions that involve creative applications of genetic data.

STRENGTHS AND WEAKNESSES
This survey has several strengths.First, the survey benefits from construction on an existing, large, and institutionally supported cohort base.Participants in the EPI-Q survey are individuals who are interested and actively involved in research, drawn from a growing and sustainable pool of cohorts.Second, due to its administration electronically in the form of a mobile app, the survey is scalable.It can be administered to tens of thousands of current and future participants in a relatively short period, and thanks to remote electronic consent, it can reach a geographically diverse area.Furthermore, the setup and consent processes are flexible and can accommodate future longitudinal data capture through the existing data collection infrastructure.Third, this survey incorporates established survey instruments that include questions from the UK Biobank, the National Institutes of Health All of Us initiative [11], the Million Veteran Program [12], and the 9-item Patient Health Questionnaire (PHQ-9) [20].This, coupled with the numerous potential data linkages, positions the EPI-Q resource favorably for meta-analyses and validation studies.Fourth, Michigan is home to one of the largest Middle Eastern and North African populations in the United States.Our initiative is uniquely positioned to establish a large health-related database on the Arab/Middle Eastern/North African community, a group for which data are currently lacking for health research [21].Fifth, this survey addresses a data gap and is crucial for enhancing our holistic understanding of health and well-being.This gap exists due to the imperfections and inaccuracies in our EHR data, bearing in mind that EHRs are not designed for research purposes.Additionally, the MyDataHelps platform enables users of non-MM healthcare systems to share their primary care EHR record with us, thereby minimizing known issues associated with using academic medical center-based EHR data for research [2].
The EPI-Q study also has several weaknesses.First, the current participant base lacks diversity, being predominantly White.This exceeds the proportion of White individuals both in the state and within MM.Historically, health research, especially genomic research, has been disproportionately focused on White/European males [22].Our team, along with Precision Health, is actively working to increase diversity so that both the cohort and the research outcomes can better serve the people of Michigan and beyond.Second, in engaging with a broad group of diverse researchers, decisions had to be made regarding the nature of the survey questions included.We recognize that our broad epidemiologic questionnaire may not delve into preferred instruments in sufficient detail for many well-defined, domain-specific research questions.Third, the various study populations and recruitment mechanisms used to acquire the participant cohorts for EPI-Q present analytical challenges in obtaining internally consistent and externally valid results (Table 1 and Figure 1, Supplementary Material 8).Fourth, our cross-tables of EHR-based and self-reported cancer histories reveal a significant number of individuals who have qualifying cancer diagnoses in their EHR but did not self-report this information.This discrepancy could stem from the qualifying cancer PheWAS codes (phecodes) used (listed in Supplementary Material 20) being too broad, meaning that some qualifying cancer phecodes may not have been recognized as a cancer diagnosis by the patient.We have reported the proportion of individuals who self-report cancer by qualifying cancer phecode (Supplementary Material 20), but 94.5% of individuals with an EHR-defined history of cancer have multiple qualifying cancer phecodes on their EHR, complicating this area for future work.Fifth, while the financial toxicity module is based on the validated FACIT-COST instrument, it was originally validated in cancer patients [23,24] and was modified to consider any recent chronic condition for use in EPI-Q.However, since its development and EPI-Q use, the instrument has been validated in patients with diabetes [25] and in those with chronic conditions [26].

Figure 1 .
Figure 1.A simplified schematic representation of EPI-Q contents by domain and relationship between Michigan Medicine clinics and related University of Michigan (UM) Precision Health cohorts.EHR, electronic health record; COVID-19, coronavirus disease 2019.

Figure 2 .
Figure 2. A county-level map representing Epidemiologic Questionnaire (EPI-Q) participants (n=5,498) as raw counts (A) and per 100,000 of total county population (B) based on their residential address as reported in their Michigan Medicine patient portal.EPI-Q participants who do not have available residential addresses are excluded.(A) Out of 5,498 participants.Participants without county information (n=66) or who live out of state (n= 366) not shown.(B) County population data from 2020 Census Demographics and Housing Survey.

Figure 3 .
Figure 3.A diagram summarizing different surveys that were consulted in the development of the Epidemiological Questionnaire along with their respective domains (in italics).MGI, Michigan Genomics Initiative; MM, Michigan Medicine.

Table 1 .
Comparison of EHR-based characteristics among those who did and did not respond to the Epidemiological Questionnaire (EPI-Q)1 2Calculated via the Welch two-sample t-test for numeric variables and the Pearson chi-square test for categorical variables.3Alcoholconsumption is captured through Clarity Social History in the electronic health record.

Table 2
presents a comparison of the state of Michigan, MM, and UM Precision Health cohorts regarding size, age, sex, and race/ethnicity.A more detailed comparison between UM Precision Health and EPI-Q participants can be found in Supplementary Material 10.Future participants from UM Precision Health cohorts, both within and beyond MM, will be invited to participate, as shown in Supplementary Material 11.

Table 2 .
Comparison of Michigan, Michigan Medicine, and UM Precision Health cohorts by size, age, sex, and race/ethnicity UM, University of Michigan; MGI, Michigan Genomics Initiative; ACE, Anesthesiology Collection Effort; MEND, Metabolism, Endocrinology & Diabetes; MHB, Mental Health Biobank; MIPACT, Michigan Predictive Activity and Clinical Trajectories study; MY PART, Michigan and You -Partnering to Advance Research Together.1