The 1000 Chinese Pangenome empowers medical and population genetics
Summary
The 1KCP project built 1,116 high-quality diploid genome assemblies from Chinese individuals (55 de novo, 1,061 pangenome-informed) and created a 1KCP pangenome and catalogue of variants. Using a new PIGA workflow to leverage hybrid modest-coverage long- and short-read data, the team produced a pangenome that adds 405.3 Mb of non-reference sequence (277.5 Mb novel relative to prior Chinese and HPRC pangenomes) and resolved 35.4 million small-variant sites plus ~110.5k structural-variant (SV) sites, 0.86 million nested variants and extensive tandem-repeat (TR) variation. They annotated non-reference functional elements, performed pan-variant eQTL mapping (identifying 15,722 eGenes and 2.68 million unique eVariants), demonstrated medically relevant genic and cluster-level variation (including exon-overlapping SVs, TR expansions and high-resolution HLA alleles), and produced a pan-variant imputation panel and public data portal for browsing and imputation.
Key Points
- The 1KCP dataset comprises 1,116 diploid assemblies representing 2,232 haplotypes and captures extensive Chinese genomic diversity.
- PIGA, a pangenome-informed assembly workflow, enables cost-effective joint diploid assembly from modest-coverage hybrid sequencing.
- The 1KCP pangenome contains 405.3 Mb of non-reference sequence; 34.6 Mb of this is common (AF > 0.05) while much is rare or singleton, highlighting the need for large samples to detect rare sequences.
- Variant catalogue: ~35.4M small variants, 110,530 SV sites, 1.03M candidate TR sites and 0.86M nested variants; ~33% of SVs are novel and most novel SVs are rare (AF ≤ 0.01).
- SV sites are often multiallelic (80.3%); merging similar alleles reduced complexity and produced 164,216 merged SV alleles suitable for population genetics.
- Pan-variant eQTL mapping shows complex variants (SVs, TRs, nested variants) explain a meaningful portion of cis-heritability and identify many lead regulatory variants — TRs are highly enriched among fine-mapped causal candidates.
- Medical findings include 5,239 exon-overlapping SVs across 3,326 genes (1,013 SVs affect 623 clinically flagged genes), TR expansions linked to fragile sites and gene dysregulation, and high-resolution four-field HLA typing revealing much greater allele diversity.
- The 1KCP pan-variant imputation panel (26.3M small variants, ~101k SVs, 1.48M nested variants, many TR and HLA alleles) shows strong imputation performance and enables imputation of previously inaccessible variant classes.
Content summary
The study scales pangenomics to >1,000 Chinese genomes by combining high-coverage HiFi assemblies for a subset with a new PIGA workflow to assemble the remainder from modest-coverage hybrid data. Quality metrics indicate most assemblies have contig NG50 > 40 Mb and low base error rates; PIGA assemblies show slightly lower QV and more structural errors in repetitive/unreliable regions but overall good variant concordance for SNVs and SVs. A path-guided pangenome annotation finds non-reference genic and regulatory sequences enriched for TRs. The authors carefully represent multiallelic and nested variation, apply merging to produce population-friendly SV alleles, and deeply characterise TR motif and length diversity.
The resource was used for medical-focused analyses: identification of exon-disrupting SVs (many rare and under purifying selection), unbiased TR expansion screening (2,427 expansions found, some in genic and fragile-site regions), resolution of complex gene-cluster haplotypes (example: HP–HPR cluster with deletions linked to cholesterol traits), and four-field HLA typing showing far more non-coding diversity than previously observed. Pan-variant eQTL mapping across ~1,101 individuals integrates SNVs, indels, SVs, nested variants and TRs, finding thousands of eGenes and pinpointing complex variants as credible causal regulators; colocalisation with BioBank Japan GWAS highlights examples such as an 18 kb GSTM1 deletion affecting platelet count. Finally, a comprehensive imputation panel and an online portal make the data accessible for future association studies and clinical interpretation.
Context and relevance
This work addresses major gaps left by short-read cohorts and small pangenomes: rare and complex variants (SVs, TRs and nested alleles) are difficult to detect without assembly-based pangenomes and larger cohorts. By scaling to over a thousand assemblies and providing a pan-variant imputation panel, 1KCP substantially improves representation of low-frequency and rare alleles in an East Asian context and enables researchers to test the contribution of complex variants to expression and disease. The resource is directly relevant to people running genetic-association studies, clinical variant interpretation for East Asian patients, and method developers working on multiallelic variant representation, pangenome graphs and imputation of complex loci.
Why should I read this?
Short version: because this is a proper, large-scale assembly-based pangenome for Chinese genomes — it finds loads of rare and complex variants that short-read studies miss, ties some of them to gene expression and disease signals, and gives you an imputation panel and portal so you can actually use the data. If you work in human genetics, population genomics or clinical genomics (especially with East Asian cohorts) this paper saves you time and expands what you can test.
Author note (style)
Punchy: this is a major resource rather than a niche methods note — the dataset, the PIGA approach and the pan-variant analyses together make the 1KCP a must-have reference for anyone studying Chinese genomic variation or complex structural and repeat-mediated genetic effects.
