This blog post pertains to the Systems Immunology graduate course at Harvard Medical School (Immunology 306qc; see here), which is led by Drs. Christophe Benoist, Nick Haining and Nir Hacohen. My lecture is on the role of human genetics as a tool for understanding the human immune system in health and disease. What follows is an informal description of my lecture. The slide deck for the lecture can be downloaded here. Throughout, I have added key references, with links to the manuscripts and other web-based resources embedded within the blog (and also listed at the end). I highlight five key manuscripts (#1, #2, #3, #4, and #5), which should be reviewed prior to the lecture; the other references, while interesting, are optional.
It is increasingly clear that humans serve as the best model organism for understanding human health and disease. One reason for this paradigm shift is the lack of fidelity of most animal models to human disease. For systems immunology, the mouse is a powerful model organism to understand fundamental mechanisms of the immune system. However, studies in humans are required to understand how these mechanisms can be translated into new biomarkers and drugs.
The “cycle” of discovery research should start and end with the patient. Genetics (through GWAS, next-gen sequencing, etc.) and genomics (e.g., expression profiling) performed on samples from patients linked with clinical data can uncover novel mechanisms of disease. Because datasets are large and complex, sophisticated computational modeling is required to derive meaningful biological pathways from genetics and genomics data. However, this is just a first step. These computational models generate hypotheses that must be tested directly in patient samples. If the goal is drug and biomarker discovery, then this will focus the biological experiments for a desired outcome. For example, one biological experiment might be to test whether a genetic mutation is gain-of-function or loss-of-function, so that a high-throughput assay can be designed for a small molecule drug screen. As new discoveries are made – and new technologies are developed – this process is iterated to advance scientific knowledge.
Genetics is a powerful tool for several reasons: (1) Through mutation, genetics serves as nature’s perturbation of most genes in the human genome; (2) genetics links physiological states in humans to specific genes in a manner that differentiates cause from consequence; (3) genetics indicates whether a mutation or pathway is gain- or loss-of-function, thereby providing a direction of effect for therapeutic modulation; (4) genetics provides an allelic series to estimate the range of effect on perturbing a molecule, thereby mimicking a dose-response curve for drug discovery; (5) genetics uncovers previously unsuspected biological pathways that are important in disease in a manner that is unbiased; and (6) there is a wealth of genetic data being generated from GWAS and next-generation sequencing studies (which will only increase going forward).
However, genetics is only the first step towards understanding human biology. A major challenge is to use human genetics in a strategic way to gain insight into biological processes…and this is not straightforward! This lecture should shed light on ways in which human genetics can be used as a tool – not the only tool, but one that is very powerful – to understand the complex human immune system in health and disease.
Genetic architecture of common complex traits and rare Mendelian diseases
For a useful discussion on genetic strategies to identify variants associated with human traits, please see a nice review article from my wonderful colleague, Soumya Raychaudhuri, published in Cell in 2011 (download key paper #1 here). Several of the concepts highlighted below are explained in this review.
It is useful to understand basic features of population genetics. I like to discuss these features in terms of what is often referred to as the “genetic architecture” of human traits – the number, frequency and effect size of alleles that contribute to a biological trait (whether that is a common disease, rare disease, or non-disease quantitative phenotype such as T cell proliferation or CRP levels).
Mendelian diseases segregate faithfully within a family according to Mendel’s laws. For a given family, the underlying genetic architecture is generally a single mutation (i.e., causal allele) in one gene that is rare in the general population and highly penetrant in family members who inherit the mutation (i.e., large effect size). Often, the causal mutation disrupts the protein-coding structure of a gene, thereby pinpointing the causal gene. Different families with the same phenotype may have distinct mutations, but very often these mutations fall within one gene (monogenic) or a small number of genes (oligogenic). Examples of Mendelian diseases include the autosomal recessive disease cystic fibrosis (>1000 mutations in the CFTR gene, although the delta-F508 is the most frequent mutation) and the autosomal dominant disease Marfan’s syndrome (mutations in FBN1).
In contrast, complex diseases do not segregate within families according to Mendel’s rules. In a population of affected individuals, the underlying genetic architecture for a given disease is often highly polygenic (many alleles within many different gene loci, each with a relatively small effect size), with substantial influence by environmental and stochastic factors. Examples include rheumatoid arthritis, type 2 diabetes, and myocardial infarction.
How do we know the genetic architecture of complex traits is highly polygenic? This has evolved from a confluence of important discoveries: a draft sequence of the human genome; a catalog of common DNA polymorphisms (such as HapMap); high-throughput methods to genotype hundreds of thousands of single nucleotide polymorphisms (SNPs); and statistical methods to analyze extremely large datasets (e.g., PLINK). These advances led to the first-generation of genome-wide association studies (GWAS), which identified alleles associated with a wide-variety of complex traits. To date, GWAS and related methods have identified nearly 3000 loci for approximately 300 complex human traits, as reported in the National Human Genome Research Institute (NHGRI) GWAS catalog. As an example for one common autoimmune disease – rheumatoid arthritis – GWAS has grown the list of genetic risk loci from 1 (the MHC) to over 100 (see here and unpublished).
Several themes have emerged from GWAS that shed light on the genetic architecture of complex traits: hundreds (if not thousands) of alleles contribute to risk of any given complex disease; each allele has a small effect on risk; and most alleles discovered to date are common in the general population (although this is a biased estimate, as only common alleles have been tested by contemporary GWAS).
In contrast to Mendelian diseases, it has been more challenging for GWAS to shed light on causal alleles and causal genes. This is because the best signal from a GWAS generally falls outside of protein-coding sequences, there are often many SNPs highly correlated with the top SNP (known as linkage disequilibrium, or LD), there is no obvious causal allele that can be identified from the SNPs in LD with each other, and there are often many genes in the region (or genetic locus). A few themes have emerged, however: the majority of causal alleles from GWAS likely influence gene expression rather than protein sequence; occasionally one allele is an obvious functional allele (e.g., changes the protein-coding structure of a gene), which helps pinpoint the causal allele and causal gene; by comparing genes across multiple risk loci for a given disease, it is often possible to select the most likely causal gene; and some loci may contain independent variants associated with disease, providing an “allelic series” which helps to identify the causal gene and enables exploration of biology. These concepts are explained in more detail below.
While these themes are insightful for geneticists, they fall short of what most biologists desire. In other words, How do we derive biological insight from GWAS and next-generation sequencing studies?
Deriving biological insight from the genetics of complex traits
Here, I start with the premise that most complex traits are highly polygenic, with a mixture of common alleles of small / modest effect, and a smaller number of low-frequency and rare variants, also of small / modest effect. [Note: Very few rare variants have been identified to date, although this may change over time with next-generation sequencing studies in large populations. For a theoretical assessment of how I arrive at the relative proportion of common, low-frequency and rare alleles, see our modeling paper published in Nature Genetics (download here)].
Below I offer at least five strategies to gain biological insight from the genetics of complex traits. A recent paper published in Nature (2012) on inflammatory bowel disease (IBD) highlights many of these approaches (download key paper #2 here). As you read the IBD paper, see if you can pick out the different approaches that were used (as described below).
Approach #1: Straightforward annotation – After a GWAS, it is possible to annotate whether SNPs in LD with the index SNP (i.e., the SNP most strongly associated in a GWAS) has biological relevance. This is relatively easy for protein-coding variants. For example, it is often assumed that a missense variant in LD with an index SNPs is the causal allele. While this might be true for variants in some genes associated with different autoimmune diseases (e.g., PTPN22, IL6R), it is dangerous to assume this is always true without more complete information (e.g., functional data on the missense variant, conditional analysis after fine-mapping in large patient populations). Once identified, a series of biological experiments are required – often in primary cells derived directly from humans that carry the alleles – to understand whether an allele is gain-of-function or loss-of-function and in what cell type the allele exerts it’s effect. [For a recent manuscript on the IL6R variant published in PLoS Genetics by John Todd’s group, see here.]
Non-coding variants are more difficult to annotate. There are tools such as HaploReg that can help assign biological function to non-coding variants based on sequence motifs. Further support that a non-coding variant influences biological function comes from studies of gene expression. Termed expression quantitative trait loci (or eQTLs), this approach can indicate if a variant influences gene expression in the general population. While this cannot determine which SNP is causal, eQTL data provide strong support that at least one of the SNPs in LD with the index SNP exerts its effect through gene expression. Further, eQTLs can indicate whether an allele increases or decreases gene expression, a first step towards understanding gain- or loss-of-function. Large eQTL databases are being generated in immune cells, include a database, ImmVar, led by Christophe Benoist and colleagues. [For a nice eQTL paper, see a publication by Fairfax et al published in Nature Genetics here.]
Approach #2: integrate across associated loci – Another approach to gain biological insight is focused on analysis of associated loci in aggregate. This differs from Approach #1 by focusing not on a single gene / variant, but on all associated variants. External databases can be used to systematically analyze GWAS loci together. For some computational methods, this allows one to pick the most likely causal gene from a region of LD that contains multiple genes. (Note: I use “causal gene” here very loosely. What I mean is that one gene in the region influences a trait of interest through the effect of a nearby genetic variant. I do not mean that alleles in this gene are the sole cause of disease.) For example, text from published PubMed abstracts is one very powerful approach to pick which gene in the region of LD is the most likely causal gene (see link to GRAIL here). Another approach is to examine protein-protein interaction networks (DAPPLE, see here) or gene sets linked to biological pathways (MAGENTA, see here). Once a causal gene is selected, downstream functional experiments can be conducted on that gene, or the pathway implicated by the genes in aggregate.
Computational methods that integrate data across loci also allow one to identify specific cell types or key biological pathways. In addition to the approaches described above, it is possible to use gene expression data (for an example, download AJHG manuscript here) or epigenetic data (for an example, download Nature Genetics manuscript here). Using these approaches, CD4+ memory T cells have been implicated as a key pathogenic cell type in RA. This now enables follow-up functional studies in the relevant primary human cells, human stem cells or transformed human cell lines to understand the impact of variants in aggregate on biological processes. As one example, we are performing systematic RNAi of RA genes implicated by GWAS in primary CD4+ memory T cells, in order to understand the impact of these genes on different aspects of T cell biology (proliferation, plasticity, etc.).
Approach #3: regional pleiotropy – The term pleiotropy refers to one allele, gene, or locus that has an effect on multiple traits (see Wikipedia entry here). This can be a powerful tool to gain insight into complex traits. For example, there are now many examples of genes that, when completely knocked-out in humans (homozygous nulls), cause human primary immunodeficiency, or PID (for review, see key paper #3 here). For many IBD and other common autoimmune diseases, there appears to be overlap between genes that cause PID and genes that influence risk of autoimmunity. Thus, by using genetics to link disease physiology in one set of diseases (e.g., IBD) to another set of diseases that have been studied in detail (e.g., PID), it is possible to understand basic mechanisms.
Approach #4: allelic pleiotropy – This is one of my favorite observation in human genetics: the same allele can influence different phenotypes. For autoimmunity, it is now well-accepted that alleles may alter distinct diseases (see here for a review I wrote in NEJM back in 2008 on the overlap between celiac disease and type 1 diabetes). What is more interesting, however, is when an allele influences two apparently distinct phenotypes. One of my favorite examples is an allele in IL6R, which influences risk of both cardiovascular disease (CVD) and rheumatoid arthritis, but protects from asthma (see here and here). What is even more biologically interesting is that the same allele influences inflammation (as measured by C-reactive protein, CRP) in healthy controls. This provides strong evidence that inflammation is indeed a causal factor in risk of CVD. Further, as the molecular mechanism by which the IL6R allele exerts it’s effect is understood, then this provides a more detailed map of the biology that links a gene (IL6R), pathway (inflammation) and disease (CVD, RA, asthma).
The concept of Mendelian Randomization is important (as discussed here). It is a method of using measured variation in genes of known function (e.g., IL6R) to examine the causal effect of a modifiable exposure on disease in non-experimental studies. It will be increasingly used in human genetics.
Approach #5: allelic series – Finally, a series of alleles in the same gene can provide important biological insight. First, it provides strong genetic evidence that the gene as causal, rather than just being a gene in a region of LD (see the example of IFIH1 and type 1 diabetes here). And second, an allelic series provides distinct genetic perturbations that allows one to derive genotype-function dose-response curves. If some alleles are gain-of-function (GOF) and other alleles are loss-of-function (LOF), and these alleles are associated with a spectrum of human traits of interest, then this allows a direct connection between function and phenotype in the model organism that matters the most…humans. This principle has been used to predict the effect of target perturbation for drug discovery, as is the case for alleles in the gene PCSK9 and risk of CVD (see here). I think this approach will used more and more often in drug discovery (see Future Directions, below).
Deriving biological insight from the genetics of Mendelian traits
Here, I use human primary immunodeficiency (PID; for review, see key paper #3 here) as an example of how human genetics can be used to derive biological insight from rare, Mendelian diseases. As a specific example, I highlight mutations in the X-linked gene MAGT1 that cause PID (see key paper #4 here). If you are interested and especially motivated, more information about inferring causality and functional significance of human coding DNA variants can be found in a review published in Human Molecular Genetics in 2012 (see key paper #5 here). And if you are really, really motivated, read about rare mutations associated with autoinflammatory diseases (see here).
Returning briefly to the genetic architecture of Mendelian diseases…recall that monogenic diseases segregate faithfully within a family according to Mendel’s laws. For a given family, the underlying genetic architecture is generally a single mutation in one gene that is rare in the general population and highly penetrant in family members who inherit the mutation (i.e., large effect size). Often, the causal mutation disrupts the protein-coding structure of a gene, thereby pinpointing the causal gene.
In the past (1980’s until mid-2000’s), linkage analysis was performed in families, followed by lots of work (biology and sequencing) to find the causal mutation and causal gene. Now, thanks to advances in genome sequencing, it is possible to sequence the entire genome at a relatively affordable cost (thousands of dollars). While it still requires a lot of work to find the causal mutation and then to perform functional studies to understand biology, next-generation sequencing has truly revolutionized the discovery process.
Today, sequencing is no longer the bottleneck – interpretation of the sequence is rate limiting. Interpretation is facilitate by large databases of genome sequences in the human population (see here), as discussed in key paper #5.
As a general rule of thumb, a gene must have distinct mutations in unrelated families that share a rare phenotype such as PID. The MAGT1 story, published in Nature in 2011, illustrates this concept (see key paper #4 here). Two mutations in MAGT1 were found in two different families. Moreover, functional follow-up in cells derived from patients with these mutations revealed new biological insight into T cell biology.
Genetic discoveries in rare diseases can often be extrapolated to common traits. As discussed in the section on complex traits, some genes that harbor rare mutations that lead to PID (when inherited in a homozygous state) also harbor common variants that lead to autoimmunity. This further underscores the concept gene-based pleiotropy. One of the more interesting examples is the development of drugs to treat common traits based on genetic discoveries from rare, monogenic diseases. One of the best examples is mutations in JAK3, which cause PID (see review that I wrote with John O’Shea in Immunity on the JAK-STAT pathway). Based largely on this observation, it was hypothesized that drugs that inhibit JAK3 would be an effective immunosuppressive therapy for diseases such as RA. Indeed, a small molecule inhibitor of JAK3, tofacitinib (which is actually non-selective inhibitor of JAK1-3), is approved to treat RA (see here).
One of the more promising aspects of human genetics is the ability to discover, in an unbiased way, novel drug targets.
Generating a “therapeutic hypothesis” – i.e., a prediction that perturbing a target in a given manner would benefit patients in the human population with minimal toxicity – is a critical component of target validation. As discussed above, and reiterated here, human genetics has features that make it a valuable tool for generating novel therapeutic hypotheses: (1) through mutation, it serves as nature’s perturbation of many drug targets in the human genome; (2) it links physiological state in humans to a target perturbation in a manner that differentiates cause from consequence; (3) it indicates whether a target perturbation is gain- or loss-of-function, thereby providing a direction of effect for therapeutic modulation; (4) it provides an allelic series for range of effect on perturbing a potential drug target, thereby mimicking a dose-response curve; and (5) it uncovers biological pathways that are important in disease.
A major challenge, however, is to develop a systematic strategy to use human genetics for target validation. Here, I propose two complementary strategies. I focus on inflammatory traits not just because it is the topic of this course, but for more objective reasons: (1) there is a wealth of GWAS data from a wide-variety of inflammatory traits (e.g., psoriasis, RA, T1D, IBD); (2) inflammation is a biological mechanism that spans a broad-range of disease beyond autoimmune diseases (e.g., CVD); (3) the relevant human disease is available from a simple blood draw, which enables functional studies to assess gain-of-function (GOF) or loss-of-function (LOF); (4) primary human immunodeficiencies can uncover rare alleles that provide an extreme example of target perturbation; and (5) decades of basic science research have established a detailed view of immunological pathways, even though the pathways have not yet been definitively linked to human disease physiology.
The two complementary strategies, which are based on the concepts discussed above, are:
(1) single genes – single drug targets: Human genetics can identify genes with alleles that are associated with a clinical trait of interest. As discussed above, this allows dose response curves to be estimated at the time of target validation. One of the best examples is LDL cholesterol. In unpublished data, we have an example of how this approach works for RA.
(2) multiple genes – biological pathways: Human genetics can identify biological pathways that are altered in human disease. Genes implicated by GWAS, when analyzed with other genomic datasets, point to discrete biological pathways and critical cell types (as discussed above). Pathways identified through human genetics overcome limitations of other pathway-based approaches such as genome-wide expression profiling in patients with disease, which cannot easily distinguish between cause and consequence. However, a challenge of the pathway-based approach is that it is not immediate obvious from genetic data alone whether a pathway is up-regulated or down-regulated; functional follow-up studies in peripheral blood or affected tissue are required. In RA, we have used this approach to identify important biological pathways (e.g., up-regulation of CD40 signaling in B cells). As proof-of-concept, we have conducted a pilot cell-based phenotypic drug screen (in press PLoS Genetics).