Tutorial: LD Clumping and Identifying Independent Signals with cellink#
This tutorial demonstrates how to perform linkage disequilibrium (LD) clumping to identify independent genetic signals from QTL analysis results. LD clumping is essential for distinguishing truly independent associations from those driven by correlation between nearby variants due to linkage disequilibrium. When performing genome-wide QTL mapping, you often identify multiple significant variants for the same gene or phenotype. Many of these variants may be in high LD with each other, meaning they tag the same underlying causal variant. Clumping helps identify “index” or “lead” variants that represent independent signals, which is crucial for reporting interpretable results with non-redundant associations, fine-mapping causal variants and prioritizing variants for functional validation.
This tutorial shows two complementary approaches:
Post-QTL clumping: Clump significant results after QTL analysis
Pre-analysis LD pruning: Reduce variant sets before analysis using reference LD
We’ll use PLINK for clumping operations and demonstrate integration with cellink data structures. This tutorial assumes you’ve already performed eQTL analysis following earlier tutorials.
Environment Setup#
We begin by importing necessary libraries and defining parameters for clumping analysis.
import numpy as np
import pandas as pd
import subprocess
from pathlib import Path
import scanpy as sc
from cellink.resources import get_dummy_onek1k
from cellink.utils import column_normalize, gaussianize
from cellink.io import to_plink
chrom = 22
cis_window = 500_000
cell_type = "CD8 Naive"
celltype_key = "predicted.celltype.l2"
pb_gex_key = f"PB_{cell_type}"
original_donor_col = "donor_id"
# Clumping parameters
clump_p1 = 0.0000005 # Significance threshold for index variants
clump_p2 = 0.00001 # Secondary significance threshold
clump_r2 = 0.1 # LD threshold
clump_kb = 500 # Physical distance window (kb)
dd = get_dummy_onek1k(config_path="../../src/cellink/resources/config/dummy_onek1k.yaml", verify_checksum=False)
dd
[2025-12-29 03:59:35,372] INFO:root: /Users/larnoldt/cellink_data/dummy_onek1k/dummy_onek1k.dd.h5 already exists
[2025-12-29 03:59:35,373] WARNING:root: No checksum provided, skipping verification
[2025-12-29 03:59:36,560] INFO:root: Loaded dummy OneK1K dataset: (100, 146939, 125366, 34073)
╔═ DonorData(n_donors=100, n_cells_per_donor=[613-2,731], donor_id='donor_id') ═══════════════════════════════╗ ║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║ ║ ┃ G (donors) ┃ C (cells) ┃ ║ ║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║ ║ │ AnnData object with n_obs × n_vars = 100 × 146,939 │ View of AnnData object with n_obs × n_vars = │ ║ ║ │ │ 125,366 × 34,073 │ ║ ║ │ var: 'chrom', 'pos', 'a0', 'a1', 'AC', │ obs: 'orig.ident', 'nCount_RNA', │ ║ ║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', │ 'nFeature_RNA', 'percent.mt', 'donor_id', │ ║ ║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', │ 'pool_number', 'predicted.celltype.l2', │ ║ ║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', │ 'predicted.celltype.l2.score', 'age', │ ║ ║ │ 'quality', 'pos_hg19', 'id_hg19' │ 'organism_ontology_term_id', │ ║ ║ │ │ 'tissue_ontology_term_id', │ ║ ║ │ │ 'assay_ontology_term_id', │ ║ ║ │ │ 'disease_ontology_term_id', │ ║ ║ │ │ 'cell_type_ontology_term_id', │ ║ ║ │ │ 'self_reported_ethnicity_ontology_term_id', │ ║ ║ │ │ 'development_stage_ontology_term_id', │ ║ ║ │ │ 'sex_ontology_term_id', 'is_primary_data', │ ║ ║ │ │ 'suspension_type', 'tissue_type', 'cell_type', │ ║ ║ │ │ 'assay', 'disease', 'organism', 'sex', 'tissue', │ ║ ║ │ │ 'self_reported_ethnicity', 'development_stage', │ ║ ║ │ │ 'observation_joinid' │ ║ ║ │ uns: 'kinship' │ var: 'vst.mean', 'vst.variance', │ ║ ║ │ │ 'vst.variance.expected', │ ║ ║ │ │ 'vst.variance.standardized', 'vst.variable', │ ║ ║ │ │ 'feature_is_filtered', 'feature_name', │ ║ ║ │ │ 'feature_reference', 'feature_biotype', │ ║ ║ │ │ 'feature_length', 'feature_type', 'start', 'end', │ ║ ║ │ │ 'chrom' │ ║ ║ │ obsm: 'gPCs' │ uns: 'cell_type_ontology_term_id_colors', │ ║ ║ │ │ 'citation', 'default_embedding', │ ║ ║ │ │ 'schema_reference', 'schema_version', 'title' │ ║ ║ │ varm: 'filter' │ obsm: 'X_azimuth_spca', 'X_azimuth_umap', │ ║ ║ │ │ 'X_harmony', 'X_pca', 'X_umap' │ ║ ║ │ │ varm: 'PCs' │ ║ ║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
Load Data and Perform QTL Analysis#
First, we’ll load the OneK1K dataset and perform a basic eQTL analysis to generate results for clumping. If you’ve already run eQTL analysis from Tutorial 1, you can skip this section and load your existing results.
dd.G.obs["donor_id"] = dd.G.obs.index
sc.pp.normalize_total(dd.C)
sc.pp.log1p(dd.C)
dd = dd[..., dd.C.obs[celltype_key] == cell_type, :].copy()
dd = dd.sel(G_var=dd.G.var.chrom == str(chrom), C_var=dd.C.var.chrom == str(chrom)).copy()
dd.aggregate(key_added=pb_gex_key, sync_var=True, verbose=True)
dd.aggregate(obs=["sex", "age"], func="first", add_to_obs=True)
print(f"Dataset shape: {dd.shape}")
[2025-12-29 03:59:38,002] INFO:cellink._core.donordata: Aggregated X to PB_CD8 Naive
[2025-12-29 03:59:38,003] INFO:cellink._core.donordata: Observation found for 100 donors.
Dataset shape: (100, 136776, 4756, 871)
Quick eQTL Analysis#
We’ll run a simplified eQTL analysis to generate association statistics for clumping demonstration.
from cellink.at.gwas import GWAS
from tqdm.auto import tqdm
F = np.concatenate(
[
np.ones((dd.shape[0], 1)),
dd.G.obs[["sex"]].values - 1,
dd.G.obs[["age"]].values,
dd.G.obsm["gPCs"].values[:, :10],
],
axis=1,
)
F[:, 2:] = column_normalize(F[:, 2:])
if hasattr(dd.G.X, "compute"):
dd.G.X = dd.G.X.compute()
results = []
for gene, row in tqdm(list(dd.C.var.iterrows())[:20], desc="Running eQTL tests"):
Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values + 1e-5 * np.random.randn(dd.shape[0], 1))
start = max(0, row.start - cis_window)
end = row.end + cis_window
_G = dd.G[:, (dd.G.var.pos >= start) & (dd.G.var.pos <= end)]
_G = _G[:, (_G.X.std(0) != 0)]
if _G.shape[1] == 0:
continue
gwas = GWAS(Y, F)
gwas.test_association(_G.X)
for i, snp_id in enumerate(_G.var.index):
results.append(
{
"SNP": snp_id,
"CHR": chrom,
"BP": _G.var.iloc[i]["pos"],
"GENE": gene,
"BETA": gwas.getBetaSNP().ravel()[i],
"P": gwas.getPv().ravel()[i],
}
)
eqtl_results = pd.DataFrame(results)
print(f"Total associations tested: {len(eqtl_results)}")
print(f"Significant associations (P < 1e-4): {(eqtl_results['P'] < 1e-2).sum()}")
eqtl_results.head()
Total associations tested: 57544
Significant associations (P < 1e-4): 510
| SNP | CHR | BP | GENE | BETA | P | |
|---|---|---|---|---|---|---|
| 0 | 22_16388891_G_A | 22 | 16388891 | ENSG00000233866 | 0.304794 | 0.37518 |
| 1 | 22_16388968_C_T | 22 | 16388968 | ENSG00000233866 | 0.304794 | 0.37518 |
| 2 | 22_16389525_A_G | 22 | 16389525 | ENSG00000233866 | 0.304794 | 0.37518 |
| 3 | 22_16390411_G_A | 22 | 16390411 | ENSG00000233866 | 0.304794 | 0.37518 |
| 4 | 22_16391555_G_C | 22 | 16391555 | ENSG00000233866 | 0.304794 | 0.37518 |
Approach 1: Post-QTL Clumping with PLINK#
After obtaining QTL results, we identify independent signals by clumping variants based on LD patterns in our data. This approach uses your actual study genotypes to compute LD. Step 1: Export Genotype Data to PLINK Format PLINK requires binary format files (.bed, .bim, .fam). We’ll export our genotype data from the DonorData object.
plink_prefix = "chr22_genotypes"
to_plink(dd.G, plink_prefix)
Writing FAM... done.
Writing BIM...
done.
Step 1: Prepare Summary Statistics for Clumping#
PLINK clumping requires summary statistics in a specific format. We’ll prepare our eQTL results accordingly.
def prepare_clump_input(eqtl_results, output_file):
"""
Prepare summary statistics for PLINK clumping.
Parameters
----------
eqtl_results : pd.DataFrame
DataFrame with columns: SNP, CHR, BP, P (and optionally GENE, BETA)
output_file : str
Path to output file
"""
clump_input = eqtl_results[["SNP", "CHR", "BP", "P"]].copy()
clump_input = clump_input.sort_values("P")
clump_input.to_csv(output_file, sep="\t", index=False)
print(f"Prepared {len(clump_input)} associations for clumping")
print(f"Saved to: {output_file}")
return output_file
sumstats_file = prepare_clump_input(eqtl_results, "eqtl_sumstats.txt")
Prepared 57544 associations for clumping
Saved to: eqtl_sumstats.txt
Step 2: Run PLINK Clumping#
Now we’ll run PLINK’s clumping algorithm to identify independent signals. Here we perform clumping on the original genotype data. When your study sample is small or you want to use standard reference LD patterns, you can perform clumping using external genotype data, such as the 1000 Genomes Project data. This is particularly useful for comparing results across studies. The cellink package provides convenient access to 1000 Genomes data via the function cellink.resources.get_1000genomes.
def run_plink_clump(
bfile, sumstats_file, output_prefix, clump_p1=0.00000005, clump_p2=0.00001, clump_r2=0.1, clump_kb=500
):
"""
Run PLINK clumping to identify independent signals.
Parameters
----------
bfile : str
Prefix for PLINK binary files (.bed/.bim/.fam)
sumstats_file : str
Path to summary statistics file
output_prefix : str
Prefix for output files
clump_p1 : float
P-value threshold for index variants
clump_p2 : float
Secondary significance threshold
clump_r2 : float
LD r² threshold
clump_kb : int
Physical distance window in kb
"""
cmd = [
"plink",
"--bfile",
bfile,
"--clump",
sumstats_file,
"--clump-p1",
str(clump_p1),
"--clump-p2",
str(clump_p2),
"--clump-r2",
str(clump_r2),
"--clump-kb",
str(clump_kb),
"--out",
output_prefix,
]
print("Running PLINK clumping...")
print(f"Command: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print("PLINK stderr:", result.stderr)
raise RuntimeError(f"PLINK clumping failed with return code {result.returncode}")
print("Clumping completed successfully")
return f"{output_prefix}.clumped"
clumped_file = run_plink_clump(
bfile=plink_prefix,
sumstats_file=sumstats_file,
output_prefix="eqtl_clumped",
clump_p1=0.0001,
clump_p2=0.001,
clump_r2=0.1,
clump_kb=500,
)
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump eqtl_sumstats.txt --clump-p1 0.0001 --clump-p2 0.001 --clump-r2 0.1 --clump-kb 500 --out eqtl_clumped
Clumping completed successfully
Step 3: Parse and Interpret Clumping Results#
PLINK clumping output identifies index SNPs and their associated clumped SNPs. We’ll parse this and integrate it with our original results.
def parse_clumped_results(clumped_file):
"""
Parse PLINK clumped output file.
Returns
-------
pd.DataFrame
DataFrame with index SNPs and their properties
"""
clumped = pd.read_csv(clumped_file, delim_whitespace=True)
print(f"Identified {len(clumped)} independent signals")
print("\nIndex SNPs:")
print(clumped[["CHR", "SNP", "BP", "P", "TOTAL", "NSIG", "S05", "S01"]].head(10))
return clumped
clumped_results = parse_clumped_results(clumped_file)
Identified 1 independent signals
Index SNPs:
CHR SNP BP P TOTAL NSIG S05 S01
0 22 22_16589219_A_G 16589219 0.000089 147 140 7 0
Step 4: Annotate Results with Clumping Information#
We’ll add clumping information back to our original eQTL results, marking which variants are independent index SNPs.
def annotate_with_clumping(eqtl_results, clumped_results):
"""
Annotate eQTL results with clumping information.
Parameters
----------
eqtl_results : pd.DataFrame
Original eQTL association results
clumped_results : pd.DataFrame
PLINK clumped output
Returns
-------
pd.DataFrame
Annotated results with clumping status
"""
index_snps = set(clumped_results["SNP"])
eqtl_results["is_index_snp"] = eqtl_results["SNP"].isin(index_snps)
clumped_mapping = {}
for _, row in clumped_results.iterrows():
index_snp = row["SNP"]
if pd.notna(row.get("SP2")):
clumped_snps = row["SP2"].split(",")
for snp in clumped_snps:
clumped_mapping[snp.strip()] = index_snp
eqtl_results["index_snp"] = eqtl_results["SNP"].map(
lambda x: x if x in index_snps else clumped_mapping.get(x, None)
)
clump_sizes = clumped_results.set_index("SNP")["TOTAL"].to_dict()
eqtl_results["clump_size"] = eqtl_results["index_snp"].map(clump_sizes)
return eqtl_results
annotated_results = annotate_with_clumping(eqtl_results, clumped_results)
print(f"\nTotal associations: {len(annotated_results)}")
print(f"Independent signals (index SNPs): {annotated_results['is_index_snp'].sum()}")
print(f"Associations assigned to clumps: {annotated_results['index_snp'].notna().sum()}")
print("\nIndex SNPs and their associations:")
index_results = annotated_results[annotated_results["is_index_snp"]].sort_values("P")
print(index_results[["SNP", "GENE", "BP", "BETA", "P", "clump_size"]].head(10))
Total associations: 57544
Independent signals (index SNPs): 17
Associations assigned to clumps: 17
Index SNPs and their associations:
SNP GENE BP BETA P \
6699 22_16589219_A_G ENSG00000189295 16589219 2.676809 0.000089
4918 22_16589219_A_G ENSG00000267338 16589219 0.928965 0.219391
44421 22_16589219_A_G ENSG00000177663 16589219 -0.864473 0.246456
16946 22_16589219_A_G ENSG00000237689 16589219 -0.820279 0.271732
26449 22_16589219_A_G ENSG00000253460 16589219 -0.783818 0.298028
114 22_16589219_A_G ENSG00000273362 16589219 0.721878 0.320765
14041 22_16589219_A_G ENSG00000172967 16589219 -0.713094 0.339452
8832 22_16589219_A_G ENSG00000288024 16589219 -0.662068 0.377129
20010 22_16589219_A_G ENSG00000230481 16589219 -0.657744 0.384286
33059 22_16589219_A_G ENSG00000253481 16589219 -0.522308 0.494430
clump_size
6699 147.0
4918 147.0
44421 147.0
16946 147.0
26449 147.0
114 147.0
14041 147.0
8832 147.0
20010 147.0
33059 147.0
Approach 2: Pre-Analysis LD Pruning#
For some analyses, you may want to reduce your variant set before analysis by removing variants in high LD. This is called LD pruning and creates a set of approximately independent variants.
def run_ld_pruning(bfile, output_prefix, window_size=50, step_size=5, r2_threshold=0.5):
"""
Perform LD pruning to create an independent variant set.
Parameters
----------
bfile : str
Prefix for PLINK binary files
output_prefix : str
Prefix for output files
window_size : int
Window size in variant count
step_size : int
Step size for window sliding
r2_threshold : float
r² threshold for pruning
"""
cmd = [
"plink",
"--bfile",
bfile,
"--indep-pairwise",
str(window_size),
str(step_size),
str(r2_threshold),
"--out",
output_prefix,
]
print("Running LD pruning...")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"LD pruning failed: {result.stderr}")
prune_in = pd.read_csv(f"{output_prefix}.prune.in", header=None, names=["SNP"])
print(f"Retained {len(prune_in)} variants after LD pruning")
return prune_in
pruned_variants = run_ld_pruning(
bfile=plink_prefix, output_prefix="ld_pruned", window_size=50, step_size=5, r2_threshold=0.5
)
dd_pruned = dd[:, dd.G.var.index.isin(pruned_variants["SNP"])].copy()
print(f"Original variants: {dd.shape[1]}")
print(f"After LD pruning: {dd_pruned.shape[1]}")
print(f"Reduction: {(1 - dd_pruned.shape[1]/dd.shape[1])*100:.1f}%")
Running LD pruning...
Retained 30977 variants after LD pruning
Original variants: 136776
After LD pruning: 30977
Reduction: 77.4%
Gene-Level Analysis with Clumping#
For eQTL analysis, it’s often useful to identify independent signals per gene. Here’s how to perform gene-stratified clumping.
def clump_per_gene(eqtl_results, bfile, output_dir, clump_p1=0.00000005, clump_r2=0.1, clump_kb=500):
"""
Perform clumping separately for each gene.
This identifies independent eQTL signals per gene.
"""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
gene_clumped_results = []
for gene in eqtl_results["GENE"].unique():
gene_data = eqtl_results[eqtl_results["GENE"] == gene]
if len(gene_data) < 2:
continue
gene_sumstats = output_dir / f"{gene}_sumstats.txt"
gene_data[["SNP", "CHR", "BP", "P"]].to_csv(gene_sumstats, sep="\t", index=False)
gene_clumped = run_plink_clump(
bfile=bfile,
sumstats_file=str(gene_sumstats),
output_prefix=str(output_dir / f"{gene}_clumped"),
clump_p1=clump_p1,
clump_r2=clump_r2,
clump_kb=clump_kb,
)
try:
gene_clumped_df = parse_clumped_results(gene_clumped)
gene_clumped_df["GENE"] = gene
gene_clumped_results.append(gene_clumped_df)
except Exception:
continue
all_gene_clumped = pd.concat(gene_clumped_results, ignore_index=True)
return all_gene_clumped
gene_clumped = clump_per_gene(
eqtl_results=eqtl_results,
bfile=plink_prefix,
output_dir="gene_clumped",
clump_p1=0.000001,
clump_r2=0.1,
clump_kb=500,
)
print("\nGene-level clumping summary:")
print(gene_clumped.groupby("GENE").size().describe())
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000233866_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000233866_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000273362_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000273362_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000198445_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000198445_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000287285_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000287285_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000267338_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000267338_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000189295_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000189295_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000288024_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000288024_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000235343_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000235343_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000172967_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000172967_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000237689_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000237689_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000230481_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000230481_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000253691_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000253691_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000253460_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000253460_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000254264_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000254264_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000253481_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000253481_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000215568_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000215568_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000273442_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000273442_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000177663_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000177663_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000183307_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000183307_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000235478_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000235478_clumped
Clumping completed successfully
Identified 4 independent signals
Index SNPs:
CHR SNP BP P TOTAL NSIG S05 S01
0 22 22_17197417_A_G 17197417 0.00124 39 26 9 4
1 22 22_16940689_T_TGTGTGC 16940689 0.00164 50 6 20 24
2 22 22_17395142_G_A 17395142 0.00365 26 14 9 3
3 22 22_17511574_A_G 17511574 0.00835 46 39 5 2
Gene-level clumping summary:
count 1.0
mean 4.0
std NaN
min 4.0
25% 4.0
50% 4.0
75% 4.0
max 4.0
dtype: float64
Integration with cellink workflows#
Finally, let’s demonstrate how to integrate clumped results back into the cellink ecosystem.
def add_clumping_to_gdata(gdata, clumped_results, key="clumping_info"):
"""
Add clumping information to the variant annotations in gdata.
Parameters
----------
gdata : AnnData
Genotype data object (dd.G)
clumped_results : pd.DataFrame
PLINK clumped output
key : str
Key to store clumping info in gdata.var
"""
index_snps = set(clumped_results["SNP"])
gdata.var[f"{key}_is_index"] = gdata.var.index.isin(index_snps)
clump_size_map = clumped_results.set_index("SNP")["TOTAL"].to_dict()
gdata.var[f"{key}_clump_size"] = gdata.var.index.map(clump_size_map)
nsig_map = clumped_results.set_index("SNP")["NSIG"].to_dict()
gdata.var[f"{key}_n_sig_secondary"] = gdata.var.index.map(nsig_map)
print(f"Added clumping annotations to gdata.var with prefix '{key}_'")
print(f"Index SNPs marked: {gdata.var[f'{key}_is_index'].sum()}")
add_clumping_to_gdata(dd.G, clumped_results, key="eqtl_clump")
dd.G
Added clumping annotations to gdata.var with prefix 'eqtl_clump_'
Index SNPs marked: 1
AnnData object with n_obs × n_vars = 100 × 136776
obs: 'donor_id', 'sex', 'age'
var: 'chrom', 'pos', 'a0', 'a1', 'AC', 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', 'quality', 'pos_hg19', 'id_hg19', 'eqtl_clump_is_index', 'eqtl_clump_clump_size', 'eqtl_clump_n_sig_secondary'
uns: 'kinship'
obsm: 'gPCs', 'PB_CD8 Naive'
varm: 'filter'