Skip to main content
Ctrl+K

cellink

  • Tutorials
    • Tutorial: Pseudobulk eQTL Analysis with cellink
    • Tutorial: eQTL Analysis with JaxQTL and TensorQTL using cellink
    • Tutorial: Annotating Genetic Variants with cellink
    • Tutorial: Rare Variant Association Testing with cellink
    • Tutorial: LD Clumping and Identifying Independent Signals with cellink
    • Tutorial: Colocalization Analysis - Linking eQTLs to GWAS Signals with cellink
    • Tutorial: Integrating GWAS with Single-Cell Data using cellink
    • Tutorial: Spatially Resolved GWAS Mapping with gsMap
    • Tutorial: eQTL Analysis with SAIGE-QTL using cellink
    • Tutorial: Using EHR Data as Donor-Level Input in cellink
    • Tutorial: Using the MILDataset and PyTorch DataLoader in cellink
  • API
    • DonorData
      • cellink.DonorData
    • Preprocessing pp
      • cellink.pp.variant_qc
      • cellink.pp.cell_level_obs_filter
      • cellink.pp.donor_level_obs_filter
      • cellink.pp.donor_level_var_filter
      • cellink.pp.log_transform
      • cellink.pp.low_abundance_filter
      • cellink.pp.missing_values_filter
      • cellink.pp.normalize
    • Input-Output io
      • cellink.io.from_sgkit_dataset
      • cellink.io.read_plink
      • cellink.io.read_bgen
      • cellink.io.read_sgkit_zarr
      • cellink.io.read_pgen_zarr
      • cellink.io.stream_pgen_to_zarr
      • cellink.io.to_plink
      • cellink.io.write_variants_to_vcf
    • Tools tl
      • cellink.tl.get_snp_df
      • cellink.tl.run_favor
      • cellink.tl.run_snpeff
      • cellink.tl.run_vep
      • cellink.tl.add_vep_annos_to_gdata
      • cellink.tl.combine_annotations
      • cellink.tl.aggregate_annotations_for_varm
      • cellink.tl.run_burden_test
      • cellink.tl.run_skat_test
      • cellink.tl.beta_weighting
    • External tools tl.external
      • cellink.tl.external.calculate_ld
      • cellink.tl.external.run_jaxqtl
      • cellink.tl.external.read_jaxqtl_results
      • cellink.tl.external.run_mixmil
      • cellink.tl.external.calculate_pcs
      • cellink.tl.external.run_tensorqtl
      • cellink.tl.external.read_tensorqtl_results
      • cellink.tl.external.run_scdrs
      • cellink.tl.external.run_seismic
      • cellink.tl.external.run_magma_pipeline
      • cellink.tl.external.run_saigeqtl
      • cellink.tl.external.configure_saigeqtl_runner
      • cellink.tl.external.get_saigeqtl_runner
      • cellink.tl.external.make_group_file
      • cellink.tl.external.read_saigeqtl_results
      • cellink.tl.external.load_gsmap_results
      • cellink.tl.external.format_gsmap_sumstats
    • Plotting
      • cellink.pl.locus
      • cellink.pl.manhattan
      • cellink.pl.qq
      • cellink.pl.expression_by_genotype
      • cellink.pl.volcano
    • Machine Learning ml
      • cellink.ml.MILDataset
      • cellink.ml.mil_collate_fn
      • cellink.ml.DonorMILModel
    • Association Testing at
      • cellink.at.acat_test
      • cellink.at.compute_acat
      • cellink.at.GWAS
      • cellink.at.Skat
    • Utils
      • cellink.utils.column_normalize
      • cellink.utils.gaussianize
      • cellink.utils.one_hot_encode_genotypes
      • cellink.utils.dosage_per_strand
    • Resources
      • cellink.resources.get_1000genomes
      • cellink.resources.get_1000genomes_grch38
      • cellink.resources.get_dummy_onek1k
      • cellink.resources.get_onek1k
      • cellink.resources.get_eqtl_catalog_dataset_associations
      • cellink.resources.get_eqtl_catalog_datasets
      • cellink.resources.get_gwas_catalog_studies
      • cellink.resources.get_gwas_catalog_study
      • cellink.resources.get_gwas_catalog_study_summary_stats
      • cellink.resources.get_pgs_catalog_score
      • cellink.resources.get_pgs_catalog_scores
      • cellink.resources.get_1000genomes_ld_scores
      • cellink.resources.get_1000genomes_ld_weights
  • Changelog
  • Contributing guide
  • References
  • .ipynb

Tutorial: LD Clumping and Identifying Independent Signals with cellink

Contents

  • Environment Setup
  • Load Data and Perform QTL Analysis
  • Quick eQTL Analysis
  • Approach 1: Post-QTL Clumping with PLINK
    • Step 1: Prepare Summary Statistics for Clumping
    • Step 2: Run PLINK Clumping
    • Step 3: Parse and Interpret Clumping Results
    • Step 4: Annotate Results with Clumping Information
  • Approach 2: Pre-Analysis LD Pruning
    • Gene-Level Analysis with Clumping
    • Integration with cellink workflows

Tutorial: LD Clumping and Identifying Independent Signals with cellink#

This tutorial demonstrates how to perform linkage disequilibrium (LD) clumping to identify independent genetic signals from QTL analysis results. LD clumping is essential for distinguishing truly independent associations from those driven by correlation between nearby variants due to linkage disequilibrium. When performing genome-wide QTL mapping, you often identify multiple significant variants for the same gene or phenotype. Many of these variants may be in high LD with each other, meaning they tag the same underlying causal variant. Clumping helps identify “index” or “lead” variants that represent independent signals, which is crucial for reporting interpretable results with non-redundant associations, fine-mapping causal variants and prioritizing variants for functional validation.

This tutorial shows two complementary approaches:

  • Post-QTL clumping: Clump significant results after QTL analysis

  • Pre-analysis LD pruning: Reduce variant sets before analysis using reference LD

We’ll use PLINK for clumping operations and demonstrate integration with cellink data structures. This tutorial assumes you’ve already performed eQTL analysis following earlier tutorials.

Environment Setup#

We begin by importing necessary libraries and defining parameters for clumping analysis.

import numpy as np
import pandas as pd
import subprocess
from pathlib import Path

import scanpy as sc
from cellink.resources import get_dummy_onek1k
from cellink.utils import column_normalize, gaussianize
from cellink.io import to_plink

chrom = 22
cis_window = 500_000
cell_type = "CD8 Naive"
celltype_key = "predicted.celltype.l2"
pb_gex_key = f"PB_{cell_type}"
original_donor_col = "donor_id"

# Clumping parameters
clump_p1 = 0.0000005  # Significance threshold for index variants
clump_p2 = 0.00001  # Secondary significance threshold
clump_r2 = 0.1  # LD threshold
clump_kb = 500  # Physical distance window (kb)
dd = get_dummy_onek1k(config_path="../../src/cellink/resources/config/dummy_onek1k.yaml", verify_checksum=False)
dd
[2025-12-29 03:59:35,372] INFO:root: /Users/larnoldt/cellink_data/dummy_onek1k/dummy_onek1k.dd.h5 already exists
[2025-12-29 03:59:35,373] WARNING:root: No checksum provided, skipping verification
[2025-12-29 03:59:36,560] INFO:root: Loaded dummy OneK1K dataset: (100, 146939, 125366, 34073)
╔═ DonorData(n_donors=100, n_cells_per_donor=[613-2,731], donor_id='donor_id') ═══════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 100 × 146,939 │ View of AnnData object with n_obs × n_vars =       │ ║
║ │                                                    │ 125,366 × 34,073                                   │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     uns: 'kinship'                                 │     var: 'vst.mean', 'vst.variance',               │ ║
║ │                                                    │ 'vst.variance.expected',                           │ ║
║ │                                                    │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │                                                    │ 'feature_is_filtered', 'feature_name',             │ ║
║ │                                                    │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     obsm: 'gPCs'                                   │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title'      │ ║
║ │     varm: 'filter'                                 │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │                                                    │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Load Data and Perform QTL Analysis#

First, we’ll load the OneK1K dataset and perform a basic eQTL analysis to generate results for clumping. If you’ve already run eQTL analysis from Tutorial 1, you can skip this section and load your existing results.

dd.G.obs["donor_id"] = dd.G.obs.index

sc.pp.normalize_total(dd.C)
sc.pp.log1p(dd.C)

dd = dd[..., dd.C.obs[celltype_key] == cell_type, :].copy()
dd = dd.sel(G_var=dd.G.var.chrom == str(chrom), C_var=dd.C.var.chrom == str(chrom)).copy()

dd.aggregate(key_added=pb_gex_key, sync_var=True, verbose=True)
dd.aggregate(obs=["sex", "age"], func="first", add_to_obs=True)

print(f"Dataset shape: {dd.shape}")
[2025-12-29 03:59:38,002] INFO:cellink._core.donordata: Aggregated X to PB_CD8 Naive
[2025-12-29 03:59:38,003] INFO:cellink._core.donordata: Observation found for 100 donors.
Dataset shape: (100, 136776, 4756, 871)

Quick eQTL Analysis#

We’ll run a simplified eQTL analysis to generate association statistics for clumping demonstration.

from cellink.at.gwas import GWAS
from tqdm.auto import tqdm

F = np.concatenate(
    [
        np.ones((dd.shape[0], 1)),
        dd.G.obs[["sex"]].values - 1,
        dd.G.obs[["age"]].values,
        dd.G.obsm["gPCs"].values[:, :10],
    ],
    axis=1,
)
F[:, 2:] = column_normalize(F[:, 2:])

if hasattr(dd.G.X, "compute"):
    dd.G.X = dd.G.X.compute()

results = []
for gene, row in tqdm(list(dd.C.var.iterrows())[:20], desc="Running eQTL tests"):
    Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values + 1e-5 * np.random.randn(dd.shape[0], 1))

    start = max(0, row.start - cis_window)
    end = row.end + cis_window
    _G = dd.G[:, (dd.G.var.pos >= start) & (dd.G.var.pos <= end)]
    _G = _G[:, (_G.X.std(0) != 0)]

    if _G.shape[1] == 0:
        continue

    gwas = GWAS(Y, F)
    gwas.test_association(_G.X)

    for i, snp_id in enumerate(_G.var.index):
        results.append(
            {
                "SNP": snp_id,
                "CHR": chrom,
                "BP": _G.var.iloc[i]["pos"],
                "GENE": gene,
                "BETA": gwas.getBetaSNP().ravel()[i],
                "P": gwas.getPv().ravel()[i],
            }
        )

eqtl_results = pd.DataFrame(results)
print(f"Total associations tested: {len(eqtl_results)}")
print(f"Significant associations (P < 1e-4): {(eqtl_results['P'] < 1e-2).sum()}")
eqtl_results.head()
Total associations tested: 57544
Significant associations (P < 1e-4): 510
SNP CHR BP GENE BETA P
0 22_16388891_G_A 22 16388891 ENSG00000233866 0.304794 0.37518
1 22_16388968_C_T 22 16388968 ENSG00000233866 0.304794 0.37518
2 22_16389525_A_G 22 16389525 ENSG00000233866 0.304794 0.37518
3 22_16390411_G_A 22 16390411 ENSG00000233866 0.304794 0.37518
4 22_16391555_G_C 22 16391555 ENSG00000233866 0.304794 0.37518

Approach 1: Post-QTL Clumping with PLINK#

After obtaining QTL results, we identify independent signals by clumping variants based on LD patterns in our data. This approach uses your actual study genotypes to compute LD. Step 1: Export Genotype Data to PLINK Format PLINK requires binary format files (.bed, .bim, .fam). We’ll export our genotype data from the DonorData object.

plink_prefix = "chr22_genotypes"
to_plink(dd.G, plink_prefix)
Writing FAM... done.
Writing BIM...
done.

Step 1: Prepare Summary Statistics for Clumping#

PLINK clumping requires summary statistics in a specific format. We’ll prepare our eQTL results accordingly.

def prepare_clump_input(eqtl_results, output_file):
    """
    Prepare summary statistics for PLINK clumping.

    Parameters
    ----------
    eqtl_results : pd.DataFrame
        DataFrame with columns: SNP, CHR, BP, P (and optionally GENE, BETA)
    output_file : str
        Path to output file
    """
    clump_input = eqtl_results[["SNP", "CHR", "BP", "P"]].copy()
    clump_input = clump_input.sort_values("P")

    clump_input.to_csv(output_file, sep="\t", index=False)
    print(f"Prepared {len(clump_input)} associations for clumping")
    print(f"Saved to: {output_file}")

    return output_file


sumstats_file = prepare_clump_input(eqtl_results, "eqtl_sumstats.txt")
Prepared 57544 associations for clumping
Saved to: eqtl_sumstats.txt

Step 2: Run PLINK Clumping#

Now we’ll run PLINK’s clumping algorithm to identify independent signals. Here we perform clumping on the original genotype data. When your study sample is small or you want to use standard reference LD patterns, you can perform clumping using external genotype data, such as the 1000 Genomes Project data. This is particularly useful for comparing results across studies. The cellink package provides convenient access to 1000 Genomes data via the function cellink.resources.get_1000genomes.

def run_plink_clump(
    bfile, sumstats_file, output_prefix, clump_p1=0.00000005, clump_p2=0.00001, clump_r2=0.1, clump_kb=500
):
    """
    Run PLINK clumping to identify independent signals.

    Parameters
    ----------
    bfile : str
        Prefix for PLINK binary files (.bed/.bim/.fam)
    sumstats_file : str
        Path to summary statistics file
    output_prefix : str
        Prefix for output files
    clump_p1 : float
        P-value threshold for index variants
    clump_p2 : float
        Secondary significance threshold
    clump_r2 : float
        LD r² threshold
    clump_kb : int
        Physical distance window in kb
    """
    cmd = [
        "plink",
        "--bfile",
        bfile,
        "--clump",
        sumstats_file,
        "--clump-p1",
        str(clump_p1),
        "--clump-p2",
        str(clump_p2),
        "--clump-r2",
        str(clump_r2),
        "--clump-kb",
        str(clump_kb),
        "--out",
        output_prefix,
    ]

    print("Running PLINK clumping...")
    print(f"Command: {' '.join(cmd)}")

    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode != 0:
        print("PLINK stderr:", result.stderr)
        raise RuntimeError(f"PLINK clumping failed with return code {result.returncode}")

    print("Clumping completed successfully")
    return f"{output_prefix}.clumped"


clumped_file = run_plink_clump(
    bfile=plink_prefix,
    sumstats_file=sumstats_file,
    output_prefix="eqtl_clumped",
    clump_p1=0.0001,
    clump_p2=0.001,
    clump_r2=0.1,
    clump_kb=500,
)
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump eqtl_sumstats.txt --clump-p1 0.0001 --clump-p2 0.001 --clump-r2 0.1 --clump-kb 500 --out eqtl_clumped
Clumping completed successfully

Step 3: Parse and Interpret Clumping Results#

PLINK clumping output identifies index SNPs and their associated clumped SNPs. We’ll parse this and integrate it with our original results.

def parse_clumped_results(clumped_file):
    """
    Parse PLINK clumped output file.

    Returns
    -------
    pd.DataFrame
        DataFrame with index SNPs and their properties
    """
    clumped = pd.read_csv(clumped_file, delim_whitespace=True)

    print(f"Identified {len(clumped)} independent signals")
    print("\nIndex SNPs:")
    print(clumped[["CHR", "SNP", "BP", "P", "TOTAL", "NSIG", "S05", "S01"]].head(10))

    return clumped


clumped_results = parse_clumped_results(clumped_file)
Identified 1 independent signals

Index SNPs:
   CHR              SNP        BP         P  TOTAL  NSIG  S05  S01
0   22  22_16589219_A_G  16589219  0.000089    147   140    7    0

Step 4: Annotate Results with Clumping Information#

We’ll add clumping information back to our original eQTL results, marking which variants are independent index SNPs.

def annotate_with_clumping(eqtl_results, clumped_results):
    """
    Annotate eQTL results with clumping information.

    Parameters
    ----------
    eqtl_results : pd.DataFrame
        Original eQTL association results
    clumped_results : pd.DataFrame
        PLINK clumped output

    Returns
    -------
    pd.DataFrame
        Annotated results with clumping status
    """
    index_snps = set(clumped_results["SNP"])
    eqtl_results["is_index_snp"] = eqtl_results["SNP"].isin(index_snps)

    clumped_mapping = {}
    for _, row in clumped_results.iterrows():
        index_snp = row["SNP"]
        if pd.notna(row.get("SP2")):
            clumped_snps = row["SP2"].split(",")
            for snp in clumped_snps:
                clumped_mapping[snp.strip()] = index_snp

    eqtl_results["index_snp"] = eqtl_results["SNP"].map(
        lambda x: x if x in index_snps else clumped_mapping.get(x, None)
    )

    clump_sizes = clumped_results.set_index("SNP")["TOTAL"].to_dict()
    eqtl_results["clump_size"] = eqtl_results["index_snp"].map(clump_sizes)

    return eqtl_results


annotated_results = annotate_with_clumping(eqtl_results, clumped_results)

print(f"\nTotal associations: {len(annotated_results)}")
print(f"Independent signals (index SNPs): {annotated_results['is_index_snp'].sum()}")
print(f"Associations assigned to clumps: {annotated_results['index_snp'].notna().sum()}")

print("\nIndex SNPs and their associations:")
index_results = annotated_results[annotated_results["is_index_snp"]].sort_values("P")
print(index_results[["SNP", "GENE", "BP", "BETA", "P", "clump_size"]].head(10))
Total associations: 57544
Independent signals (index SNPs): 17
Associations assigned to clumps: 17

Index SNPs and their associations:
                   SNP             GENE        BP      BETA         P  \
6699   22_16589219_A_G  ENSG00000189295  16589219  2.676809  0.000089   
4918   22_16589219_A_G  ENSG00000267338  16589219  0.928965  0.219391   
44421  22_16589219_A_G  ENSG00000177663  16589219 -0.864473  0.246456   
16946  22_16589219_A_G  ENSG00000237689  16589219 -0.820279  0.271732   
26449  22_16589219_A_G  ENSG00000253460  16589219 -0.783818  0.298028   
114    22_16589219_A_G  ENSG00000273362  16589219  0.721878  0.320765   
14041  22_16589219_A_G  ENSG00000172967  16589219 -0.713094  0.339452   
8832   22_16589219_A_G  ENSG00000288024  16589219 -0.662068  0.377129   
20010  22_16589219_A_G  ENSG00000230481  16589219 -0.657744  0.384286   
33059  22_16589219_A_G  ENSG00000253481  16589219 -0.522308  0.494430   

       clump_size  
6699        147.0  
4918        147.0  
44421       147.0  
16946       147.0  
26449       147.0  
114         147.0  
14041       147.0  
8832        147.0  
20010       147.0  
33059       147.0

Approach 2: Pre-Analysis LD Pruning#

For some analyses, you may want to reduce your variant set before analysis by removing variants in high LD. This is called LD pruning and creates a set of approximately independent variants.

def run_ld_pruning(bfile, output_prefix, window_size=50, step_size=5, r2_threshold=0.5):
    """
    Perform LD pruning to create an independent variant set.

    Parameters
    ----------
    bfile : str
        Prefix for PLINK binary files
    output_prefix : str
        Prefix for output files
    window_size : int
        Window size in variant count
    step_size : int
        Step size for window sliding
    r2_threshold : float
        r² threshold for pruning
    """
    cmd = [
        "plink",
        "--bfile",
        bfile,
        "--indep-pairwise",
        str(window_size),
        str(step_size),
        str(r2_threshold),
        "--out",
        output_prefix,
    ]

    print("Running LD pruning...")
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode != 0:
        raise RuntimeError(f"LD pruning failed: {result.stderr}")

    prune_in = pd.read_csv(f"{output_prefix}.prune.in", header=None, names=["SNP"])
    print(f"Retained {len(prune_in)} variants after LD pruning")

    return prune_in


pruned_variants = run_ld_pruning(
    bfile=plink_prefix, output_prefix="ld_pruned", window_size=50, step_size=5, r2_threshold=0.5
)

dd_pruned = dd[:, dd.G.var.index.isin(pruned_variants["SNP"])].copy()
print(f"Original variants: {dd.shape[1]}")
print(f"After LD pruning: {dd_pruned.shape[1]}")
print(f"Reduction: {(1 - dd_pruned.shape[1]/dd.shape[1])*100:.1f}%")
Running LD pruning...
Retained 30977 variants after LD pruning
Original variants: 136776
After LD pruning: 30977
Reduction: 77.4%

Gene-Level Analysis with Clumping#

For eQTL analysis, it’s often useful to identify independent signals per gene. Here’s how to perform gene-stratified clumping.

def clump_per_gene(eqtl_results, bfile, output_dir, clump_p1=0.00000005, clump_r2=0.1, clump_kb=500):
    """
    Perform clumping separately for each gene.

    This identifies independent eQTL signals per gene.
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    gene_clumped_results = []

    for gene in eqtl_results["GENE"].unique():
        gene_data = eqtl_results[eqtl_results["GENE"] == gene]

        if len(gene_data) < 2:
            continue

        gene_sumstats = output_dir / f"{gene}_sumstats.txt"
        gene_data[["SNP", "CHR", "BP", "P"]].to_csv(gene_sumstats, sep="\t", index=False)

        gene_clumped = run_plink_clump(
            bfile=bfile,
            sumstats_file=str(gene_sumstats),
            output_prefix=str(output_dir / f"{gene}_clumped"),
            clump_p1=clump_p1,
            clump_r2=clump_r2,
            clump_kb=clump_kb,
        )

        try:
            gene_clumped_df = parse_clumped_results(gene_clumped)
            gene_clumped_df["GENE"] = gene
            gene_clumped_results.append(gene_clumped_df)

        except Exception:
            continue

    all_gene_clumped = pd.concat(gene_clumped_results, ignore_index=True)

    return all_gene_clumped


gene_clumped = clump_per_gene(
    eqtl_results=eqtl_results,
    bfile=plink_prefix,
    output_dir="gene_clumped",
    clump_p1=0.000001,
    clump_r2=0.1,
    clump_kb=500,
)

print("\nGene-level clumping summary:")
print(gene_clumped.groupby("GENE").size().describe())
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000233866_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000233866_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000273362_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000273362_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000198445_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000198445_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000287285_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000287285_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000267338_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000267338_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000189295_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000189295_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000288024_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000288024_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000235343_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000235343_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000172967_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000172967_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000237689_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000237689_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000230481_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000230481_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000253691_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000253691_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000253460_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000253460_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000254264_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000254264_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000253481_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000253481_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000215568_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000215568_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000273442_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000273442_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000177663_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000177663_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000183307_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000183307_clumped
Clumping completed successfully
Running PLINK clumping...
Command: plink --bfile chr22_genotypes --clump gene_clumped/ENSG00000235478_sumstats.txt --clump-p1 1e-06 --clump-p2 1e-05 --clump-r2 0.1 --clump-kb 500 --out gene_clumped/ENSG00000235478_clumped
Clumping completed successfully
Identified 4 independent signals

Index SNPs:
   CHR                    SNP        BP        P  TOTAL  NSIG  S05  S01
0   22        22_17197417_A_G  17197417  0.00124     39    26    9    4
1   22  22_16940689_T_TGTGTGC  16940689  0.00164     50     6   20   24
2   22        22_17395142_G_A  17395142  0.00365     26    14    9    3
3   22        22_17511574_A_G  17511574  0.00835     46    39    5    2

Gene-level clumping summary:
count    1.0
mean     4.0
std      NaN
min      4.0
25%      4.0
50%      4.0
75%      4.0
max      4.0
dtype: float64

Integration with cellink workflows#

Finally, let’s demonstrate how to integrate clumped results back into the cellink ecosystem.

def add_clumping_to_gdata(gdata, clumped_results, key="clumping_info"):
    """
    Add clumping information to the variant annotations in gdata.

    Parameters
    ----------
    gdata : AnnData
        Genotype data object (dd.G)
    clumped_results : pd.DataFrame
        PLINK clumped output
    key : str
        Key to store clumping info in gdata.var
    """
    index_snps = set(clumped_results["SNP"])
    gdata.var[f"{key}_is_index"] = gdata.var.index.isin(index_snps)

    clump_size_map = clumped_results.set_index("SNP")["TOTAL"].to_dict()
    gdata.var[f"{key}_clump_size"] = gdata.var.index.map(clump_size_map)

    nsig_map = clumped_results.set_index("SNP")["NSIG"].to_dict()
    gdata.var[f"{key}_n_sig_secondary"] = gdata.var.index.map(nsig_map)

    print(f"Added clumping annotations to gdata.var with prefix '{key}_'")
    print(f"Index SNPs marked: {gdata.var[f'{key}_is_index'].sum()}")


add_clumping_to_gdata(dd.G, clumped_results, key="eqtl_clump")

dd.G
Added clumping annotations to gdata.var with prefix 'eqtl_clump_'
Index SNPs marked: 1
AnnData object with n_obs × n_vars = 100 × 136776
    obs: 'donor_id', 'sex', 'age'
    var: 'chrom', 'pos', 'a0', 'a1', 'AC', 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', 'quality', 'pos_hg19', 'id_hg19', 'eqtl_clump_is_index', 'eqtl_clump_clump_size', 'eqtl_clump_n_sig_secondary'
    uns: 'kinship'
    obsm: 'gPCs', 'PB_CD8 Naive'
    varm: 'filter'

previous

Tutorial: Rare Variant Association Testing with cellink

next

Tutorial: Colocalization Analysis - Linking eQTLs to GWAS Signals with cellink

Contents
  • Environment Setup
  • Load Data and Perform QTL Analysis
  • Quick eQTL Analysis
  • Approach 1: Post-QTL Clumping with PLINK
    • Step 1: Prepare Summary Statistics for Clumping
    • Step 2: Run PLINK Clumping
    • Step 3: Parse and Interpret Clumping Results
    • Step 4: Annotate Results with Clumping Information
  • Approach 2: Pre-Analysis LD Pruning
    • Gene-Level Analysis with Clumping
    • Integration with cellink workflows

By Jan Engelmann, Lucas Arnoldt, Eva Holtkamp

© Copyright 2026, Theislab..