Forget Neandertals – this Science study is about genotype-phenotype correlations in electronic health records (EHRs).
A study published last week in Science described a large-scale genetic association study of Neandertal-derived alleles with clinical phenotypes from electronic health records (EHRs). Here, I focus less on the Neandertal aspect of the study – which to me is really just a gimmick and not medically relevant – and more on the ability to use EHR data for unbiased association studies against a large number of clinical traits captured in real-world datasets. I also provide some thoughts on how this same approach could be used for drug discovery.
[Disclaimer: I am a Merck/MSD employee. The opinions I am expressing are my own and do not necessarily represent the position of my employer.]
The study used clinical data from the Electronic Medical Records and Genomics (eMERGE) Network, a consortium that unites EHR systems linked to patient genetic data from nine sites across the United States. The clinical data was primarily from ICD9 billing codes, an imperfect but decent way to capture clinical data from EHRs. In total, a set of 28,416 adults of European ancestry from across the eMERGE sites had both genotype data and sufficient EHR data to define clinical phenotypes (n=13,686 in the Discovery set; n=14,730 in the replication set).
First, 1495 genotyped common Neandertal SNPs were tested for association with a set of 46 high-prevalence EHR phenotypes. After replication and corrections for a non-Neandertal genetic relationship matrix (GRM), three traits – depression (P = 0.031), mood disorders (P = 0.029), and actinic keratosis (P = 0.036) – remained significant by the study’s statistical criteria.
Second, they performed a phenome-wide association study (PheWAS) of these 1495 Neandertal SNPs with 1152 EHR-derived phenotypes. Four Neandertal SNP–phenotype associations were identified (Table 2): rs3917862 [a SNP associated with hypercoagulable state (P<10-9)], rs12049593 [associated with protein-calorie malnutrition], rs11030043 [associated with a phenotype encompassing incontinence, bladder pain, and urinary tract disorders], and rs901033 [associated with a tobacco use disorder].
Third, they compared the distribution of replicating phenotype associations for a set of 1056 LD-pruned (r2 < 0.5) Neandertal SNPs with the associations found in five allele frequency matched non-Neandertal SNP sets. As shown in Figure 2, Neandertal SNPs were consistently associated with more neurological and psychiatric phenotypes and fewer digestive phenotypes.
Again, I am not so interested in the Neandertal component of the study, but I am quite interested in how genetic data can be linked to real-world clinical data to derive insight into causal human biology.
For drug discovery, imagine the following: you have a gene whose protein product you think is a good drug for a specific indication. You search existing genetic databases (e.g., GWAS, OMIM, ExAC) and find a coding variant that is associated with that same indication in a way that is encouraging for drug development (e.g., loss-of-function protects from disease). You then test this same variant in an EHR-derived database of diverse clinical phenotypes, using the PheWAS framework described in the Science Neandertal study. You find that the variant is clearly associated with the disease-phenotype of interest (i.e., you replicate the phenotype in the EHR data) AND you find that the same variant is NOT reproducibly associated with phenotypes that may be considered proxies for adverse drug events (e.g., bleeding, cardiovascular disease, infection). Furthermore, you search other databases (e.g., East London Genes & Health) and find that humans who are complete loss-of-function for the gene are relatively healthy. This seems like a pretty good starting point for a drug discovery program.
Conversely, consider the following: again, you have a target of interest for a drug discovery program. You find functional variants in the gene (e.g., pQTLs, eQTLs, loss-of-function) but there are no published genetic data linking these variants to clinical data. Using the PheWAS framework described in the Science Neandertal study, you test these functional variants for association to EHR-derived clinical data…and you find nothing! Furthermore, the protein quantitative trait loci (pQTL) data are statistically robust to the point where you are completely confident that the functional variant influences the target of interest. This seems like a very bad target for drug discovery.
As these EHR-derived databases grow in size, and as more individuals have their genomes sequenced and genotyped, these real-world datasets will likely play an important role in the drug discovery process. The current Science Neandertal study was in “only” 28K individuals. Imagine the power of the approach once genotype-phenotype data are available on tens-of-millions of geographically-diverse individuals from across the world (e.g., US Precision Medicine Initiative, UK’s Genomics England, Finland’s genome initiative).
That day is not far away.