Mobile DNA elements are discrete DNA sequences that have the remarkable ability to transport or duplicate themselves to other regions of the host genome. Because of the ubiquity of this process, mobile elements account for at least 40-50% of the content of mammalian genomes, including human.

Approximately 45% of the human genome can currently be recognized as being derived from transposable elements.


In human genomes, retrotransposons, especially Alus and L1s, have played significant role in shaping human genomic diversity and evolution. These repeat elements tend to promote unequal crossover and are an important factor contributing to genomic instability. Mobile element insertions can also cause disease either directly by interrupting a gene, or by mediate nonhomologous recombination, resulting in disease-causing insertions and deletions. In addition to their genomic impact, mobile elements are highly useful as genetic markers in tracing relationships of populations and species.

Despite the profound impact of mobile elements on the human genomic diversity, the repetitive nature of the mobile elements makes them particularly difficult to study at the whole genome level. We have recently developed a high-throughput, low-cost method (ME-Scan) to genotype mobile DNA element insertions using next-generation sequencing technology. We are currently applying this method to human samples from world-wide populations. Using the data generated, We will be able to answer questions related to several aspects of mobile element biology, and assess their impact on the human genomic diversity.

ME-Scan library construction procedure. (a) DNA fragmentation; (b) end repair; (c) A-tailing; (d) adaptor ligation; (e) first PCR amplification; (f) beads capture; (g) second PCR amplification; (h) library validation; (i) high-throughput sequencing




Polymorphic mobile element insertions contribute to gene expression and alternative splicing in human tissues


Mobile elements are a major source of human structural variants and some mobile elements can regulate gene expression and alternative splicing. However, the impact of polymorphic mobile element insertions (pMEIs) on gene expression and splicing in diverse human tissues has not been thoroughly studied. The multi-tissue gene expression and whole genome sequencing data generated by the Genotype-Tissue Expression (GTEx) project provide a great opportunity to systematic determine pMEIs’ role in gene expression regulation in human tissues.


Using the GTEx whole genome sequencing data, we identified 20,545 high-quality pMEIs from 639 individuals. We then identified pMEI-associated expression quantitative trait loci (eQTLs) and splicing quantitative trait loci (sQTLs) in 48 tissues by joint analysis of variants including pMEIs, single-nucleotide polymorphisms, and insertions/deletions. pMEIs were predicted to be the potential causal variant for 3,522 of the 30,147 significant eQTLs, and 3,717 of the 21,529 significant sQTLs. The pMEIs associated eQTLs and sQTLs show high level of tissue-specificity, and the pMEIs were enriched in the proximity of affected genes and in regulatory elements. Using reporter assays, we confirmed that several pMEIs associated with eQTLs and sQTLs can alter gene expression levels and isoform proportions.


Overall, our study shows that pMEIs are associated with thousands of gene expression and splicing variations in different tissues, and pMEIs could have a significant role in regulating tissue-specific gene expression/splicing. Detailed mechanisms for pMEIs’ role in gene regulation in different tissues will be an important direction for future human genomic studies.   



Counts of nrMEIs (nrAlu, nrL1, nrSVA) and rMEIs (rAlu, rL1, rSVA) relative to the reference genome (GRCh38) in each individual



Overview of the ME-only eQTL analysis. (a) The number of detected eQTLs with Benjamini-Hochberg FDR < 10% in each tissue. Bars are colored by tissue clusters based on cis-eQTL as shown in (b, tree). (b) Similarity (Spearman’s correlation coefficient ρ) between different tissues based on cis-eQTL FDR values (lower triangle) and gene expression TPM values (upper triangle). Gene-pMEI pairs with FDR < 10% in at least one tissue is selected for the analysis. Tree on the left of the plot was based on the hierarchical clustering of the cis-eQTL results and the branches are colored to five groups. Tissue text colors in (a, b) were based on hierarchical clustering tree of TPM results (data not shown). (c) The relationship between the eQTL count (FDR < 10%) and the individual count in different tissues. Tissue text is colored by tissue clusters based on cis-eQTL in (b, tree). The axes are in log scale. (d) Gene-pMEI pair count and the number of tissues they were detected as significant for coding and noncoding genes. (e) Effect size (beta value) distribution for coding and noncoding eQTLs of different types of pMEIs.