Phenome-wide analysis of copy number variants in 470,727 UK Biobank genomes
Summary
This Nature study presents a large-scale whole-genome sequencing (WGS) phenome-wide association study (PheWAS) of copy number variants (CNVs) in 470,727 UK Biobank participants. Using DRAGEN-based CNV calls (>10 kb) with stringent QC and orthogonal validation, the authors link deletions and duplications to 13,336 binary traits, 1,911 quantitative traits and 2,941 plasma proteins (Olink) to map CNV-driven protein quantitative trait loci (pQTLs), phenotype associations and gene-level effects. They also introduce CNV + PTV collapsing models to integrate CNV deletions with protein-truncating variants and perform pan-ancestry meta-analyses to find ancestry-specific signals.
Author take
Punchy: This is one of the biggest WGS CNV PheWAS datasets yet — high-resolution CNV mapping, proteomics links and multi-model analyses that actually point to druggable dosage effects. If you care about gene dosage, biomarkers or target validation, this paper is a must-see.
Key Points
- Dataset: 470,727 high-quality UKB WGS genomes; 102,717 unique deletions and 80,147 unique duplications after QC (size >10 kb).
- Multimodal phenotypes: associations tested against 13,336 binary traits, 1,911 quantitative traits and 2,941 plasma proteins in ~49,736 samples with proteomics.
- pQTLs: identified many rare and common CNV cis- and trans-pQTLs with expected dosage effects (loss→lower protein, gain→higher protein).
- Variant-level results: 189 significant CNV–binary associations and 892 significant CNV–quantitative associations (P < 1e-8); rare CNVs typically had larger effects.
- Gene-level collapsing: aggregating CNVs overlapping the same gene(s) increased power and revealed associations missed by per-CNV tests (eg. MSH2–colorectal cancer; APOB duplications and LDL phospholipids).
- CNV + PTV model: integrating deletions with protein-truncating variants found many extra signals and helped prioritise causal genes (eg. HBB vs HBD for thalassaemia).
- Drug-target and biomarker leads: PDZK1 duplication linked to raised urate (gout), PMP22 duplication recapitulated CMT1A, HNF1B duplication associated with kidney biomarkers; SLC2A9 enhancer deletion linked to reduced gout risk.
- Pan-ancestry analyses: revealed ancestry-specific CNV associations (for example, deletions associated with sickle-cell and α-thalassaemia in non-European groups).
Content summary
The team called CNVs with DRAGEN v3.7.8, applied stringent sample- and variant-level QC (exclusions for low-quality regions, recalibrated QUAL threshold to 35, merging highly overlapping calls), and validated the call set via trio Mendelian checks, Hardy–Weinberg equilibrium and concordance with curated CNV resources. Most CNVs were very rare (≈99.8 <1%), and duplications tended to be larger than deletions.
They ran three genetic models per CNV (dominant deletion, recessive deletion, dominant duplication) and used conservative study-wide significance (P < 1×10−8). Proteomics pQTL mapping confirmed cis dosage effects and revealed trans effects; CNVs often had larger effect sizes than SNV pQTLs. Variant-level PheWAS identified hotspots (eg. 16p11.2–16p13.3, 17p12, 21p11.2).
Gene-level collapsing aggregated distinct CNVs by gene to boost carrier counts and detect associations not apparent at the CNV-level. The CNV + PTV collapsing model further increased power and clarified causality in multi-gene loci. Integration of CNV→protein→phenotype relationships helped nominate proteins as biomarkers or therapeutic targets (TMPRSS5, MSR1 network, PDZK1 among others).
Limitations noted by authors: analyses restricted to autosomes, CNVs >10 kb (shorter events remain less well captured), complexities from multigene CNVs and possible passenger effects; they mitigated some of these with the augmented CNV+PTV analyses but emphasised need for functional follow-up.
Context and relevance
This study expands PheWAS beyond SNVs and small indels by systematically incorporating CNVs at WGS resolution, linking dosage changes to proteomic and clinical phenotypes. It strengthens the case that dosage-increasing CNVs (duplications) can highlight drug-inhibition targets and that integrating CNVs with PTVs improves gene prioritisation. The dataset and summary statistics (and an online portal) are provided as a community resource for target discovery and mechanistic follow-up.
Why should I read this
Short answer: because it saves you time. If you work on drug targets, biomarkers or complex-trait genetics, this paper gives you a gigantic, validated CNV-to-protein-to-trait map — with clear examples where copy gain points to inhibition strategies (and copy loss to mimicking LoF). It flags new loci, sharpens known ones and provides a ready resource to mine for follow-up experiments or target validation.
