Skip to main content
Ctrl+K

cellink

  • Tutorials
    • Tutorial: Pseudobulk eQTL Analysis with cellink
    • Tutorial: eQTL Analysis with JaxQTL and TensorQTL using cellink
    • Tutorial: Annotating Genetic Variants with cellink
    • Tutorial: Rare Variant Association Testing with cellink
    • Tutorial: LD Clumping and Identifying Independent Signals with cellink
    • Tutorial: Colocalization Analysis - Linking eQTLs to GWAS Signals with cellink
    • Tutorial: Integrating GWAS with Single-Cell Data using cellink
    • Tutorial: Spatially Resolved GWAS Mapping with gsMap
    • Tutorial: eQTL Analysis with SAIGE-QTL using cellink
    • Tutorial: Using EHR Data as Donor-Level Input in cellink
    • Tutorial: Using the MILDataset and PyTorch DataLoader in cellink
  • API
    • DonorData
      • cellink.DonorData
    • Preprocessing pp
      • cellink.pp.variant_qc
      • cellink.pp.cell_level_obs_filter
      • cellink.pp.donor_level_obs_filter
      • cellink.pp.donor_level_var_filter
      • cellink.pp.log_transform
      • cellink.pp.low_abundance_filter
      • cellink.pp.missing_values_filter
      • cellink.pp.normalize
    • Input-Output io
      • cellink.io.from_sgkit_dataset
      • cellink.io.read_plink
      • cellink.io.read_bgen
      • cellink.io.read_sgkit_zarr
      • cellink.io.read_pgen_zarr
      • cellink.io.stream_pgen_to_zarr
      • cellink.io.to_plink
      • cellink.io.write_variants_to_vcf
    • Tools tl
      • cellink.tl.get_snp_df
      • cellink.tl.run_favor
      • cellink.tl.run_snpeff
      • cellink.tl.run_vep
      • cellink.tl.add_vep_annos_to_gdata
      • cellink.tl.combine_annotations
      • cellink.tl.aggregate_annotations_for_varm
      • cellink.tl.run_burden_test
      • cellink.tl.run_skat_test
      • cellink.tl.beta_weighting
    • External tools tl.external
      • cellink.tl.external.calculate_ld
      • cellink.tl.external.run_jaxqtl
      • cellink.tl.external.read_jaxqtl_results
      • cellink.tl.external.run_mixmil
      • cellink.tl.external.calculate_pcs
      • cellink.tl.external.run_tensorqtl
      • cellink.tl.external.read_tensorqtl_results
      • cellink.tl.external.run_scdrs
      • cellink.tl.external.run_seismic
      • cellink.tl.external.run_magma_pipeline
      • cellink.tl.external.run_saigeqtl
      • cellink.tl.external.configure_saigeqtl_runner
      • cellink.tl.external.get_saigeqtl_runner
      • cellink.tl.external.make_group_file
      • cellink.tl.external.read_saigeqtl_results
      • cellink.tl.external.load_gsmap_results
      • cellink.tl.external.format_gsmap_sumstats
    • Plotting
      • cellink.pl.locus
      • cellink.pl.manhattan
      • cellink.pl.qq
      • cellink.pl.expression_by_genotype
      • cellink.pl.volcano
    • Machine Learning ml
      • cellink.ml.MILDataset
      • cellink.ml.mil_collate_fn
      • cellink.ml.DonorMILModel
    • Association Testing at
      • cellink.at.acat_test
      • cellink.at.compute_acat
      • cellink.at.GWAS
      • cellink.at.Skat
    • Utils
      • cellink.utils.column_normalize
      • cellink.utils.gaussianize
      • cellink.utils.one_hot_encode_genotypes
      • cellink.utils.dosage_per_strand
    • Resources
      • cellink.resources.get_1000genomes
      • cellink.resources.get_1000genomes_grch38
      • cellink.resources.get_dummy_onek1k
      • cellink.resources.get_onek1k
      • cellink.resources.get_eqtl_catalog_dataset_associations
      • cellink.resources.get_eqtl_catalog_datasets
      • cellink.resources.get_gwas_catalog_studies
      • cellink.resources.get_gwas_catalog_study
      • cellink.resources.get_gwas_catalog_study_summary_stats
      • cellink.resources.get_pgs_catalog_score
      • cellink.resources.get_pgs_catalog_scores
      • cellink.resources.get_1000genomes_ld_scores
      • cellink.resources.get_1000genomes_ld_weights
  • Changelog
  • Contributing guide
  • References
  • .ipynb

Tutorial: Rare Variant Association Testing with cellink

Contents

  • Environment Setup
  • Load and Prepare Data
  • Normalize and Filter Expression Data
  • Step 1: Annotate Variants with Functional Information
  • Step 2: Burden Testing
    • Filtering for Rare Variants
    • Custom MAF Weights
    • Run burden tests using each annotation individually for weighting
  • Step 3: Combine Multiple Tests with ACAT
  • Step 4: SKAT

Tutorial: Rare Variant Association Testing with cellink#

This tutorial demonstrates how to perform rare variant association testing (RVAT) in single-cell data using the cellink package. We focus on two major types of RVAT: burden tests and the Sequence Kernel Association Test (SKAT). Rare variants can be especially informative for understanding disease biology, but they require careful aggregation and testing approaches due to their low frequency.

We’ll walk through how to:

  • Prepare single-cell and genotype data using DonorData objects.

  • Apply burden tests with multiple variant annotation-based weights.

  • Combine p-values from multiple tests using ACAT.

  • Perform SKAT for gene-level rare variant analysis.

This tutorial uses a subset of data from the 1k1k project, filtered to the CD8 Naive T cell type, and focuses on chromosome 22 for runtime feasibility. It builds on earlier tutorials that cover pseudobulk expression and variant annotation, which should be reviewed for full context.

Note: This notebook assumes that you have already run the variant annotation step as described in the Variant Annotation Tutorial.

Prerequisites: To run this notebook, you need to install the required dependencies for RVAT. Use the following command to install the necessary dependencies:

pip install -e "cellink[rvat, datasets]"
conda install -c conda-forge chiscore

Environment Setup#

We begin by importing necessary libraries and defining key parameters for our analysis. cellink provides utilities that extend AnnData to handle both donor-level genotype data and single-cell RNA-seq data efficiently.

%load_ext autoreload
%autoreload 2
import pandas as pd
from pathlib import Path
import warnings

import anndata as ad
import scanpy as sc
import dask.array as da
import numpy as np
from tqdm.auto import tqdm

import cellink as cl
from cellink._core import DAnn
from cellink.tl._rvat import run_burden_test, run_skat_test, beta_weighting
from cellink.utils import column_normalize, gaussianize
from cellink.at.acat import acat_test
from cellink.resources import get_dummy_onek1k
DATA = Path(cl.__file__).parent.parent.parent / "docs/tutorials/data"
n_gpcs = 20
n_epcs = 15
batch_e_pcs_n_top_genes = 2000
chrom = 22
cis_window = 100_000
cell_type = "CD4 Naive"
pb_gex_key = f"PB_{cell_type}"  # pseudobulk expression in dd.G.obsm[key_added]
original_donor_col = "donor_id"
min_percent_donors_expressed = 0.1
celltype_key = "predicted.celltype.l2"
do_debug = False

Load and Prepare Data#

Here, we load a prepared dataset (onek1k) that includes genotype and expression information from human donors. (This is a subset of the full OneK1K dataset, which can be downloaded, and prepared using get_onek1k()) We also extract gene annotations using Ensembl via pybiomart, which are essential for defining cis-windows during eQTL analysis.

dd = get_dummy_onek1k(config_path="../../src/cellink/resources/config/dummy_onek1k.yaml", verify_checksum=False)
dd
[2026-04-02 08:10:16,307] INFO:root: /home/dnanexus/cellink_data/dummy_onek1k/dummy_onek1k.dd.h5 already exists
[2026-04-02 08:10:16,308] WARNING:root: No checksum provided, skipping verification
[2026-04-02 08:10:18,425] INFO:root: Loaded dummy OneK1K dataset: (100, 146939, 125366, 34073)
╔═ DonorData(n_donors=100, n_cells_per_donor=[613-2,731], donor_id='donor_id') ═══════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 100 × 146,939 │ View of AnnData object with n_obs × n_vars =       │ ║
║ │                                                    │ 125,366 × 34,073                                   │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     uns: 'kinship'                                 │     var: 'vst.mean', 'vst.variance',               │ ║
║ │                                                    │ 'vst.variance.expected',                           │ ║
║ │                                                    │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │                                                    │ 'feature_is_filtered', 'feature_name',             │ ║
║ │                                                    │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     obsm: 'gPCs'                                   │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title'      │ ║
║ │     varm: 'filter'                                 │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │                                                    │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

"""
def _get_ensembl_gene_id_start_end_chr():
    from pybiomart import Server

    server = Server(host="http://www.ensembl.org")
    dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]
    ensembl_gene_id_start_end_chr = dataset.query(
        attributes=["ensembl_gene_id", "start_position", "end_position", "chromosome_name"]
    )
    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.set_index("Gene stable ID")
    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.rename(
        columns={
            "Gene start (bp)": GAnn.start,
            "Gene end (bp)": GAnn.end,
            "Chromosome/scaffold name": GAnn.chrom,
        }
    )
    return ensembl_gene_id_start_end_chr
"""
'\ndef _get_ensembl_gene_id_start_end_chr():\n    from pybiomart import Server\n\n    server = Server(host="http://www.ensembl.org")\n    dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]\n    ensembl_gene_id_start_end_chr = dataset.query(\n        attributes=["ensembl_gene_id", "start_position", "end_position", "chromosome_name"]\n    )\n    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.set_index("Gene stable ID")\n    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.rename(\n        columns={\n            "Gene start (bp)": GAnn.start,\n            "Gene end (bp)": GAnn.end,\n            "Chromosome/scaffold name": GAnn.chrom,\n        }\n    )\n    return ensembl_gene_id_start_end_chr\n'
"""
ensembl_gene_id_start_end_chr = _get_ensembl_gene_id_start_end_chr()
ensembl_gene_id_start_end_chr
"""
'\nensembl_gene_id_start_end_chr = _get_ensembl_gene_id_start_end_chr()\nensembl_gene_id_start_end_chr\n'
"""
dd.C.var = dd.C.var.join(ensembl_gene_id_start_end_chr)
dd.C.obs[DAnn.donor] = dd.C.obs[original_donor_col]
"""
dd.G.obsm["gPCs"] = dd.G.obsm["gPCs"][dd.G.obsm["gPCs"].columns[:n_gpcs]]

Normalize and Filter Expression Data#

We normalize single-cell expression using total-count normalization and log transformation. We then filter the dataset to only include cells of a specific type (CD8 Naive). This step ensures we’re analyzing homogeneous populations, which increases power in eQTL detection.

sc.pp.normalize_total(dd.C)
sc.pp.log1p(dd.C)
sc.pp.normalize_total(dd.C)

# are the expression pcs computed by pseudobulking across all cell types?
mdata = sc.get.aggregate(dd.C, by=DAnn.donor, func="mean")
mdata.X = mdata.layers.pop("mean")

sc.pp.highly_variable_genes(mdata, n_top_genes=batch_e_pcs_n_top_genes)
sc.tl.pca(mdata, n_comps=n_epcs)

dd.G.obsm["ePCs"] = mdata[dd.G.obs_names].obsm["X_pca"]
dd = dd[..., dd.C.obs[celltype_key] == cell_type, :].copy()
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 ×     │ ║
║ │                                                    │ 36,469                                             │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     uns: 'kinship'                                 │     var: 'vst.mean', 'vst.variance',               │ ║
║ │                                                    │ 'vst.variance.expected',                           │ ║
║ │                                                    │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │                                                    │ 'feature_is_filtered', 'feature_name',             │ ║
║ │                                                    │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs'                           │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     varm: 'filter'                                 │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │                                                    │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

dd.aggregate(key_added=pb_gex_key, sync_var=True, verbose=True)
dd.aggregate(obs=["sex", "age"], func="first", add_to_obs=True)
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 ×     │ ║
║ │                                                    │ 36,469                                             │ ║
║ │     obs: 'sex', 'age'                              │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │                                                    │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │                                                    │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │                                                    │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │                                                    │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     var: 'vst.mean', 'vst.variance',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'vst.variance.expected',                           │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'feature_is_filtered', 'feature_name',             │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     uns: 'kinship'                                 │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive'           │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │     varm: 'filter'                                 │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)

keep_genes = ((dd.G.obsm[pb_gex_key] > 0).mean(axis=0) >= min_percent_donors_expressed).values
dd = dd[..., keep_genes]
print("after filtering")
print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)
PB_CD4 Naive shape: (981, 36469)
dd.shape: (981, 136776, 259012, 36469)
after filtering
PB_CD4 Naive shape: (981, 15770)
dd.shape: (981, 136776, 259012, 15770)
# alternative to dd[:, dd.G.var.chrom == str(chrom), :, dd.C.var.chrom == str(chrom)]
dd = dd.sel(G_var=dd.G.var.chrom == str(chrom), C_var=dd.C.var.chrom == str(chrom)).copy()
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 × 404 │ ║
║ │     obs: 'sex', 'age'                              │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │                                                    │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │                                                    │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │                                                    │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │                                                    │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     var: 'vst.mean', 'vst.variance',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'vst.variance.expected',                           │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'feature_is_filtered', 'feature_name',             │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     uns: 'kinship'                                 │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive'           │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │     varm: 'filter'                                 │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Step 1: Annotate Variants with Functional Information#

To inform the weighting in burden tests, we annotate variants using the VEP tool. These annotations include predicted functional consequences (e.g., missense, stop gained), CADD scores, and distance to gene transcription start sites (TSS).

This information allows biologically meaningful prioritization of variants when aggregating their effects.

vep_annotation_file = DATA / "variant_annotation/variants_vep_annotated.txt"
cl.tl.add_vep_annos_to_gdata(vep_anno_file=vep_annotation_file, gdata=dd.G, dummy_consequence=True)
dd.G.uns["variant_annotation_vep"]
Consequence_3_prime_UTR_variant Consequence_5_prime_UTR_variant Consequence_NMD_transcript_variant Consequence_coding_sequence_variant Consequence_downstream_gene_variant Consequence_frameshift_variant Consequence_inframe_deletion Consequence_inframe_insertion Consequence_intergenic_variant Consequence_intron_variant ... Codons cDNA_position gnomADe_ASJ_AF CDS_position gnomADe_AF IMPACT BIOTYPE CLIN_SIG Protein_position PHENO
snp_id gene_id transcript_id
22_16388891_G_A ENSG00000230643 ENST00000447704 0 0 0 0 1 0 0 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16388968_C_T ENSG00000230643 ENST00000447704 0 0 0 0 1 0 0 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16389525_A_G ENSG00000230643 ENST00000447704 0 0 0 0 0 0 0 0 0 0 ... - 78/118 NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16390411_G_A ENSG00000230643 ENST00000447704 0 0 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16391555_G_C ENSG00000230643 ENST00000447704 0 0 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22_50796327_CCA_C ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 0 1 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50798021_A_G ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 0 1 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50798635_T_C ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 0 1 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50799821_A_C ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 0 1 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50802428_C_G ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 0 1 ... - - NaN - NaN MODIFIER protein_coding - - -

169322 rows × 57 columns

cl.tl.aggregate_annotations_for_varm(
    dd.G, "variant_annotation_vep", agg_type="first", return_data=True
)  # TODO change agg type
gene_id transcript_id Consequence_3_prime_UTR_variant Consequence_5_prime_UTR_variant Consequence_NMD_transcript_variant Consequence_coding_sequence_variant Consequence_downstream_gene_variant Consequence_frameshift_variant Consequence_inframe_deletion Consequence_inframe_insertion ... Codons cDNA_position gnomADe_ASJ_AF CDS_position gnomADe_AF IMPACT BIOTYPE CLIN_SIG Protein_position PHENO
snp_id
22_16388891_G_A ENSG00000230643 ENST00000447704 0 0 0 0 1 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16388968_C_T ENSG00000230643 ENST00000447704 0 0 0 0 1 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16389525_A_G ENSG00000230643 ENST00000447704 0 0 0 0 0 0 0 0 ... - 78/118 NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16390411_G_A ENSG00000230643 ENST00000447704 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
22_16391555_G_C ENSG00000230643 ENST00000447704 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER unprocessed_pseudogene - - -
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22_50796327_CCA_C ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50798021_A_G ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50798635_T_C ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50799821_A_C ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER protein_coding - - -
22_50802428_C_G ENSG00000100239 ENST00000395741 0 0 0 0 0 0 0 0 ... - - NaN - NaN MODIFIER protein_coding - - -

136776 rows × 59 columns

dd.G.varm["variant_annotation"].columns
Index(['gene_id', 'transcript_id', 'Consequence_3_prime_UTR_variant',
       'Consequence_5_prime_UTR_variant', 'Consequence_NMD_transcript_variant',
       'Consequence_coding_sequence_variant',
       'Consequence_downstream_gene_variant', 'Consequence_frameshift_variant',
       'Consequence_inframe_deletion', 'Consequence_inframe_insertion',
       'Consequence_intergenic_variant', 'Consequence_intron_variant',
       'Consequence_mature_miRNA_variant', 'Consequence_missense_variant',
       'Consequence_non_coding_transcript_exon_variant',
       'Consequence_non_coding_transcript_variant',
       'Consequence_protein_altering_variant',
       'Consequence_splice_acceptor_variant',
       'Consequence_splice_donor_5th_base_variant',
       'Consequence_splice_donor_region_variant',
       'Consequence_splice_donor_variant',
       'Consequence_splice_polypyrimidine_tract_variant',
       'Consequence_splice_region_variant', 'Consequence_start_lost',
       'Consequence_stop_gained', 'Consequence_stop_lost',
       'Consequence_stop_retained_variant', 'Consequence_synonymous_variant',
       'Consequence_upstream_gene_variant', 'gnomADe_AMR_AF', 'DISTANCE',
       'CANONICAL', 'STRAND', 'CADD_RAW', 'ENSP', 'gnomADe_AFR_AF',
       'gnomADe_EAS_AF', 'SOMATIC', 'TSSDistance', 'gnomADe_SAS_AF', 'SIFT',
       'gnomADe_FIN_AF', 'gnomADe_OTH_AF', 'PolyPhen', 'FLAGS', 'Amino_acids',
       'CADD_PHRED', 'Existing_variation', 'gnomADe_NFE_AF', 'Codons',
       'cDNA_position', 'gnomADe_ASJ_AF', 'CDS_position', 'gnomADe_AF',
       'IMPACT', 'BIOTYPE', 'CLIN_SIG', 'Protein_position', 'PHENO'],
      dtype='object')
dd.G.uns["variant_annotation_vep"]["CADD_RAW"].describe()
count    32966.000000
mean         0.122248
std          0.542875
min         -1.705125
25%         -0.115726
50%          0.023902
75%          0.205452
max          9.936896
Name: CADD_RAW, dtype: float64

Step 2: Burden Testing#

We now perform burden tests, which evaluate whether the aggregate effect of rare variants within a gene is associated with a phenotype — in this case, gene expression in CD8 Naive cells. We use different weighting schemes allow us to prioritize variants differently based on functional scores or allele frequency, including:

  • CADD_RAW: Raw CADD scores.

  • maf_beta: Beta weighting based on minor allele frequency (MAF).

  • tss_distance: Distance to the transcription start site (TSS).

  • tss_distance_exp: Exponential weighting based on TSS distance.

burden_agg_fct = "sum"
run_lrt = True
annotation_cols = ["CADD_RAW", "maf_beta", "tss_distance", "tss_distance_exp"]

rare_maf_threshold = 0.05

Filtering for Rare Variants#

Burden tests focus on rare variants—typically those with minor allele frequency (MAF) below 0.01 or 0.05. We filter our genotype data accordingly to isolate variants in this frequency range.

dd = dd.sel(G_var=dd.G.var.maf < rare_maf_threshold).copy()
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 39,454  │ AnnData object with n_obs × n_vars = 259,012 × 404 │ ║
║ │     obs: 'sex', 'age'                              │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │                                                    │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │                                                    │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │                                                    │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │                                                    │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     var: 'vst.mean', 'vst.variance',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'vst.variance.expected',                           │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'feature_is_filtered', 'feature_name',             │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     uns: 'kinship', 'variant_annotation_vep'       │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive'           │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │     varm: 'filter', 'variant_annotation'           │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Custom MAF Weights#

We add custom MAF weights commonly used in burden tests, such as Beta(MAF, 1, 25). These weights prioritize rarer variants in the analysis. TSS distance weight as used in the SAIGE-QTL paper are added manually for each gene

dd.G.varm["variant_annotation"]["maf_beta"] = beta_weighting(dd.G.var["maf"])

Run burden tests using each annotation individually for weighting#

Next, we specify the covariate matrix F, which includes sex, age, genetic PCs, and expression PCs. This controls for potential confounders in the association analysis. We then iterate over all genes on chromosome 22, define a cis-window of ±500kb, and run a burden test using each of the annotation-based weighting schemes.

# This specifies covariates/fixed effects
F = np.concatenate(
    [
        np.ones((dd.shape[0], 1)),
        dd.G.obs[["sex"]].values - 1,
        dd.G.obs[["age"]].values,
        dd.G.obsm["gPCs"].values,
        dd.G.obsm["ePCs"],
    ],
    axis=1,
).astype(np.float64)
F[:, 2:] = column_normalize(F[:, 2:])
results = []
if isinstance(dd.G.X, da.Array | ad._core.views.DaskArrayView):
    if dd.G.is_view:
        dd._G = dd._G.copy()
    dd.G.X = dd.G.X.compute()

if do_debug:
    warnings.filterwarnings("ignore", category=RuntimeWarning)

for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
    Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))

    start = max(0, row.start - cis_window)
    end = row.end + cis_window
    _G = dd.G[:, (dd.G.var.pos < end)]
    _G = _G[:, (_G.var.pos > start)]
    _G = _G[:, (_G.X.std(0) != 0)]
    _G = _G.copy()

    # TODO make strand aware
    _G.varm["variant_annotation"]["tss_distance"] = np.abs(row.start - _G.var["pos"])
    _G.varm["variant_annotation"]["tss_distance_exp"] = np.exp(-1e-5 * _G.varm["variant_annotation"]["tss_distance"])

    rdf = run_burden_test(
        _G, Y, F, gene, annotation_cols=annotation_cols, burden_agg_fct=burden_agg_fct, run_lrt=run_lrt
    )
    results.append(rdf)

rdf = pd.concat(results)
rdf
burden_gene egene weight_col burden_agg_fct pv beta betaste lrt
0 ENSG00000206195 ENSG00000206195 CADD_RAW sum NaN NaN NaN NaN
1 ENSG00000206195 ENSG00000206195 maf_beta sum NaN NaN NaN NaN
2 ENSG00000206195 ENSG00000206195 tss_distance sum NaN NaN NaN NaN
3 ENSG00000206195 ENSG00000206195 tss_distance_exp sum NaN NaN NaN NaN
0 ENSG00000177663 ENSG00000177663 CADD_RAW sum NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
3 ENSG00000100299 ENSG00000100299 tss_distance_exp sum 0.024520 -3.623585e-03 1.611285e-03 5.057453
0 ENSG00000079974 ENSG00000079974 CADD_RAW sum NaN NaN NaN NaN
1 ENSG00000079974 ENSG00000079974 maf_beta sum 0.343694 6.644163e-04 7.016820e-04 0.896602
2 ENSG00000079974 ENSG00000079974 tss_distance sum 0.977858 4.148339e-09 1.494648e-07 0.000770
3 ENSG00000079974 ENSG00000079974 tss_distance_exp sum 0.244656 1.876147e-02 1.612602e-02 1.353567

1616 rows × 8 columns

Step 3: Combine Multiple Tests with ACAT#

To summarize evidence from multiple annotations, we use the ACAT (Aggregated Cauchy Association Test). This meta-analysis method combines p-values from the burden tests per gene into a single statistic, improving power and interpretability.

combined = rdf.dropna(subset=["pv"]).groupby("egene")["pv"].agg(lambda x: acat_test(x.values)).reset_index()
combined.sort_values("pv")
egene pv
84 ENSG00000100219 2.287059e-14
290 ENSG00000212939 2.274714e-11
75 ENSG00000100154 8.218144e-05
186 ENSG00000133460 9.702130e-05
198 ENSG00000167077 1.625786e-04
... ... ...
158 ENSG00000100427 9.859188e-01
338 ENSG00000244625 9.891262e-01
107 ENSG00000100294 9.921173e-01
44 ENSG00000100030 9.949769e-01
27 ENSG00000099957 9.957546e-01

402 rows × 2 columns

Step 4: SKAT#

As an alternative to burden testing, we apply the Sequence Kernel Association Test (SKAT), which models the variance component of aggregated rare variants rather than assuming a unidirectional effect. SKAT is more robust when variant effects differ in direction or magnitude.

Currently, cellink supports SKAT with the standard Beta(MAF, 1, 25) weighting scheme.

import logging

logger = logging.getLogger()
logger.setLevel(logging.WARNING)  # Suppress INFO and DEBUG esp. from SKAT Test
results = []

for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
    Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))

    start = max(0, row.start - cis_window)
    end = row.end + cis_window
    _G = dd.G[:, (dd.G.var.pos < end)]
    _G = _G[:, (_G.var.pos > start)]
    _G = _G[:, (_G.X.std(0) != 0)]

    rdict = run_skat_test(_G, Y, F, gene)
    results.append(rdict)

rdf = pd.DataFrame(results)
rdf
burden_gene egene weight_col pv
0 ENSG00000206195 ENSG00000206195 maf_beta [[1.0]]
1 ENSG00000177663 ENSG00000177663 maf_beta [[0.7540438404596421]]
2 ENSG00000069998 ENSG00000069998 maf_beta [[0.03897879062889942]]
3 ENSG00000185837 ENSG00000185837 maf_beta [[0.8117376585909591]]
4 ENSG00000093072 ENSG00000093072 maf_beta [[0.9819123716041604]]
... ... ... ... ...
399 ENSG00000205560 ENSG00000205560 maf_beta [[0.02343900249308839]]
400 ENSG00000100288 ENSG00000100288 maf_beta [[0.5147669611167375]]
401 ENSG00000205559 ENSG00000205559 maf_beta [[0.23597911440366892]]
402 ENSG00000100299 ENSG00000100299 maf_beta [[0.0004379433659393861]]
403 ENSG00000079974 ENSG00000079974 maf_beta [[0.03259277644181413]]

404 rows × 4 columns

rdf
burden_gene egene weight_col pv
0 ENSG00000206195 ENSG00000206195 maf_beta [[1.0]]
1 ENSG00000177663 ENSG00000177663 maf_beta [[0.7540438404596421]]
2 ENSG00000069998 ENSG00000069998 maf_beta [[0.03897879062889942]]
3 ENSG00000185837 ENSG00000185837 maf_beta [[0.8117376585909591]]
4 ENSG00000093072 ENSG00000093072 maf_beta [[0.9819123716041604]]
... ... ... ... ...
399 ENSG00000205560 ENSG00000205560 maf_beta [[0.02343900249308839]]
400 ENSG00000100288 ENSG00000100288 maf_beta [[0.5147669611167375]]
401 ENSG00000205559 ENSG00000205559 maf_beta [[0.23597911440366892]]
402 ENSG00000100299 ENSG00000100299 maf_beta [[0.0004379433659393861]]
403 ENSG00000079974 ENSG00000079974 maf_beta [[0.03259277644181413]]

404 rows × 4 columns

previous

Tutorial: Annotating Genetic Variants with cellink

next

Tutorial: LD Clumping and Identifying Independent Signals with cellink

Contents
  • Environment Setup
  • Load and Prepare Data
  • Normalize and Filter Expression Data
  • Step 1: Annotate Variants with Functional Information
  • Step 2: Burden Testing
    • Filtering for Rare Variants
    • Custom MAF Weights
    • Run burden tests using each annotation individually for weighting
  • Step 3: Combine Multiple Tests with ACAT
  • Step 4: SKAT

By Jan Engelmann, Lucas Arnoldt, Eva Holtkamp

© Copyright 2026, Theislab..