I have come across three reports in the last few days that help me think about the question: How many genomes is enough? My conclusion – we need a lot! Here are some thoughts and objective data that support this conclusion.
(1) Clinical sequencing for rare disease – JAMA reported compelling evidence that exome sequencing identified a molecular diagnosis for patients (Editorial here). One study investigated 2000 consecutive patients who had exome sequencing at one academic medical center over 2 years (here). Another study investigated 814 consecutive pediatric patients over 2.5 years (here). Both groups report that ~25% of patients were “solved” by exome sequencing. All patients had a rare clinical presentation that strongly suggested a genetic etiology.
(2) Inactivating NPC1L1 mutations protect from coronary heart diease – NEJM reported an exome sequencing study in ~22,000 case-control samples to search for coronary heart disease (CHD) genes, with follow-up of a specific inactivating mutation (p.Arg406X in the gene NPC1L1) in ~91,000 case-control samples (here). The data suggest that naturally occurring mutations that disrupt NPC1L1 function are associated with reduced LDL cholesterol levels and reduced risk of CHD. The statistics were not overwhelming despite the large sample size (P=0.008, OR=0.47). However, the same gene had been implicated previously from GWAS, which increases the probability of a true positive association. If there was no prior evidence implicating this gene, then these statistics would not differ from that expected by chance.
(3) Venter estimates 5M genomes by 2020 – Speaking at Singularity University’s Exponential Medicine conference, Craig Venter stated that while we’ve sequenced around 225,000 genomes worldwide to date, we’ll have sequenced something like 20 times that total (or roughly five million complete human genomes) by 2020 (here). Further, Venter estimates that “health hubs” will gather a vast hoard of physiological information about each genome they sequence, and that approximately 1M people will have genetic data linked to clinical data by 2020 via his company, Human Longevity, Inc..
So let’s put this all together.
If you have a rare disease that is likely to be genetic in etiology, then genome sequencing will lead to a molecular diagnosis ~25% of the time, meaning the majority of cases remain unsolved. It is likely that more cases will be “solved” as more exomes are sequenced (e.g., information can be learned by comparing across families), as more genomes are sequenced (e.g., disease-causing mutations will fall outside of protein-coding genes), and as we learn more about how genetic variants contribute to rare phenotypes (e.g., it is likely that more complicated combinations of genetic variants [including interactions with the environment] contribute to rare traits). Thus…more genomes are needed.
If you have a common disease such as CHD, it is possible to find rare variants that protect from disease – which are very useful in developing novel therapeutics (here). However, these rare variants offer only partial protection (e.g., 50% reduction in risk), which means extremely large sample sizes are required (e.g., likely >100,000 patients plus a larger number of controls). Similar conclusions have been reached for other complex traits such as schizophrenia (Nature editorial here). Thus…more genomes are needed.
If you want to sequence genomes from the general population and you want to gather health information to link with genomic data, then the composition of clinical phenotypes will greatly impact the ability to “predict all that is predictable”. For example, if 100,000 patients (of your 1M with phenotype data) have CHD, then you will begin to approach the necessary statistical power to discovery rare, inactivating mutations in genes such as NPC1L1 in a completely unbiased way (i.e., without any prior knowledge). But this requires accurate clinical phenotype data and phenotypes that are prevalent in the general population…or a strategy to enrich for phenotypes of interest, especially those that are rare in the general population. Thus…more genomes are needed.
This is my very long way of saying: we need a lot of genomes with clinical data, folks.
I love the ambitious number of 5M genomes in 5 years…and I bet that it will happen. This may sound like an audacious number, but it really isn’t when you consider that the price of genome sequencing will continue to drop and will approach that of a routine medical test (e.g., complete blood count, x-ray, newborn screening). This suggests that clinical genome sequencing can be repurposed for discovery research. Further, there are countries such as England that are proposing national genome sequencing projects (Genomics England), and there are companies such as Venter’s new venture, Human Longevity, Inc., that are proposing to sequence large numbers of individuals.
My prediction is that we will need – and the world will get – many millions of genomes in the near future. We should not stop at 5M, but we should continue to sequence until each of us has our own personal genome.
We should anticipate this day, as it will come soon. The public should be prepared to make these data available in way that is appropriately accessible (for example, see Global Alliance for Genomics & Health). Clinicians should be prepared to incorporate this information into clinical practice. Pharmaceutical companies should be prepared to use this information to develop novel therapies (see slide deck here). Digital companies should be prepared for linking personal genomes to personal health data.