Disease Gene Identification
One interest of my lab is to identify disease-causing genes using genome-wide data. Through collaborations, I am conducting several studies aiming at understanding the genetic etiology of Mendelian and complex diseases using genome sequencing data. Here are a few examples:
Chronic obstructive pulmonary disease (COPD) is characterized by an irreversible airflow limitation in response to inhalation of noxious stimuli, such as cigarette smoke. However, only 15-20% smokers manifest COPD, suggesting a role for genetic predisposition. We performed whole exome sequencing in 62 highly susceptible smokers and 30 exceptionally resistant smokers to identify rare variants which contribute to disease risk or resistance to cigarette smoke.
Using an integrative approach including in silico (whole exome sequencing and airway transcriptomics) and in vitro (cigarette smoke-induced cytotoxicity in miRNA knockdown cell lines), we identified two candidate genes (TACC2 and MYO1E) that augment cigarette smoke-induced cytotoxicity, and potentially COPD susceptibility.
Identifying candidate COPD genes through genomic and functional approaches.
The neuronal ceroid lipofuscinoses (NCLs) are a group of fatal, typically recessive neurodegenerative lysosomal storage diseases. While clinically similar, they are genetically distinct and result from mutations in at least twelve different genes. We investigated mutations in twelve NCL genes in ~61,000 individuals represented in the Exome Aggregation Consortium (ExAC) whole exome sequencing database.
Variants extracted from ExAC were separated into pathogenic alleles and neutral polymorphisms using a decision flowchart. The analysis identified numerous variants that are annotated as pathogenic in public repositories but have a predicted frequency that is not consistent with patient studies. After filtering out the neutral polymorphic variants, carrier frequencies calculated from ExAC vary across populations and correlate well with incidence estimated from numbers of living NCL patients in the US.
Decision flowchart for identifying pathogenic NCL variants.
Whole exome sequencing identifies genes associated with Tourette’s Disorder in multiplex families
Tourette’s Disorder (TD) is a neurodevelopmental disorder that affects about 0.7% of the population and is one of the most heritable neurodevelopmental disorders. Nevertheless, because of its polygenic nature and genetic heterogeneity, the genetic etiology of TD is not well understood. In this study, we combined the segregation information in 13 TD multiplex families with high-throughput sequencing and genotyping to identify genes associated with TD. Using whole-exome sequencing and genotyping array data, we identified both small and large genetic variants within the individuals. We then combined multiple types of evidence to prioritize candidate genes for TD, including variant segregation pattern, variant function prediction, candidate gene expression, protein-protein interaction network, candidate genes from previous studies, etc. From the 13 families, 71 strong candidate genes were identified, including both known genes for neurodevelopmental disorders and novel genes, such as HTRA3, CDHR1, and ZDHHC17. The candidate genes are enriched in several gene ontology categories, such as dynein complex and synaptic membrane. Candidate genes and pathways identified in this study provide biological insight into TD etiology and potential targets for future studies.
Protein-protein interaction (PPI) networks. (A) PPI network of the 71 TD top candidate genes. Only genes that can be connected are shown. (B) PPI networks of the 71 TD top candidate genes not in (A). Other NDD_all genes were added as intermediate nodes if they interact with more than one TD top candidate genes. For intermediate nodes, only interactions with top candidate genes were included. (C) PPI network formed by NDD_all genes identified in axoneme (GO:0005930). (D) PPI networks formed by NDD_all genes in synaptic membrane (GO:0097060). To simplify the network, interactions between non-candidate genes were removed. PPI networks were defined by three databases, ConsensusPathDB, STRING, and GIANT_v2. Genes were colored by the gene lists (see the published manuscript for details).