Tutorial: Rare Variant Association Testing with `cellink`#

This tutorial demonstrates how to perform rare variant association testing (RVAT) in single-cell data using the cellink package. We focus on two major types of RVAT: burden tests and the Sequence Kernel Association Test (SKAT). Rare variants can be especially informative for understanding disease biology, but they require careful aggregation and testing approaches due to their low frequency.

We’ll walk through how to:

Prepare single-cell and genotype data using DonorData objects.
Apply burden tests with multiple variant annotation-based weights.
Combine p-values from multiple tests using ACAT.
Perform SKAT for gene-level rare variant analysis.

This tutorial uses a subset of data from the 1k1k project, filtered to the CD8 Naive T cell type, and focuses on chromosome 22 for runtime feasibility. It builds on earlier tutorials that cover pseudobulk expression and variant annotation, which should be reviewed for full context.

Note: This notebook assumes that you have already run the variant annotation step as described in the Variant Annotation Tutorial.

Prerequisites: To run this notebook, you need to install the required dependencies for RVAT. Use the following command to install the necessary dependencies:

pip install -e "cellink[rvat, datasets]"
conda install -c conda-forge chiscore

Environment Setup#

We begin by importing necessary libraries and defining key parameters for our analysis. cellink provides utilities that extend AnnData to handle both donor-level genotype data and single-cell RNA-seq data efficiently.

%load_ext autoreload
%autoreload 2

import pandas as pd

from pathlib import Path
import warnings

import anndata as ad
import scanpy as sc
import dask.array as da
import numpy as np
from tqdm.auto import tqdm

import cellink as cl
from cellink._core import DAnn
from cellink.tl._rvat import run_burden_test, run_skat_test, beta_weighting
from cellink.utils import column_normalize, gaussianize
from cellink.at.acat import acat_test
from cellink.resources import get_dummy_onek1k

DATA = Path(cl.__file__).parent.parent.parent / "docs/tutorials/data"

n_gpcs = 20
n_epcs = 15
batch_e_pcs_n_top_genes = 2000
chrom = 22
cis_window = 100_000
cell_type = "CD4 Naive"
pb_gex_key = f"PB_{cell_type}"  # pseudobulk expression in dd.G.obsm[key_added]
original_donor_col = "donor_id"
min_percent_donors_expressed = 0.1
celltype_key = "predicted.celltype.l2"
do_debug = False

Load and Prepare Data#

Here, we load a prepared dataset (onek1k) that includes genotype and expression information from human donors. (This is a subset of the full OneK1K dataset, which can be downloaded, and prepared using get_onek1k()) We also extract gene annotations using Ensembl via pybiomart, which are essential for defining cis-windows during eQTL analysis.

dd = get_dummy_onek1k(config_path="../../src/cellink/resources/config/dummy_onek1k.yaml", verify_checksum=False)
dd

[2026-04-02 08:10:16,307] INFO:root: /home/dnanexus/cellink_data/dummy_onek1k/dummy_onek1k.dd.h5 already exists
[2026-04-02 08:10:16,308] WARNING:root: No checksum provided, skipping verification
[2026-04-02 08:10:18,425] INFO:root: Loaded dummy OneK1K dataset: (100, 146939, 125366, 34073)

╔═ DonorData(n_donors=100, n_cells_per_donor=[613-2,731], donor_id='donor_id') ═══════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 100 × 146,939 │ View of AnnData object with n_obs × n_vars =       │ ║
║ │                                                    │ 125,366 × 34,073                                   │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     uns: 'kinship'                                 │     var: 'vst.mean', 'vst.variance',               │ ║
║ │                                                    │ 'vst.variance.expected',                           │ ║
║ │                                                    │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │                                                    │ 'feature_is_filtered', 'feature_name',             │ ║
║ │                                                    │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     obsm: 'gPCs'                                   │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title'      │ ║
║ │     varm: 'filter'                                 │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │                                                    │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

"""
def _get_ensembl_gene_id_start_end_chr():
    from pybiomart import Server

    server = Server(host="http://www.ensembl.org")
    dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]
    ensembl_gene_id_start_end_chr = dataset.query(
        attributes=["ensembl_gene_id", "start_position", "end_position", "chromosome_name"]
    )
    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.set_index("Gene stable ID")
    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.rename(
        columns={
            "Gene start (bp)": GAnn.start,
            "Gene end (bp)": GAnn.end,
            "Chromosome/scaffold name": GAnn.chrom,
        }
    )
    return ensembl_gene_id_start_end_chr
"""

'\ndef _get_ensembl_gene_id_start_end_chr():\n    from pybiomart import Server\n\n    server = Server(host="http://www.ensembl.org")\n    dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]\n    ensembl_gene_id_start_end_chr = dataset.query(\n        attributes=["ensembl_gene_id", "start_position", "end_position", "chromosome_name"]\n    )\n    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.set_index("Gene stable ID")\n    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.rename(\n        columns={\n            "Gene start (bp)": GAnn.start,\n            "Gene end (bp)": GAnn.end,\n            "Chromosome/scaffold name": GAnn.chrom,\n        }\n    )\n    return ensembl_gene_id_start_end_chr\n'

"""
ensembl_gene_id_start_end_chr = _get_ensembl_gene_id_start_end_chr()
ensembl_gene_id_start_end_chr
"""

'\nensembl_gene_id_start_end_chr = _get_ensembl_gene_id_start_end_chr()\nensembl_gene_id_start_end_chr\n'

"""
dd.C.var = dd.C.var.join(ensembl_gene_id_start_end_chr)
dd.C.obs[DAnn.donor] = dd.C.obs[original_donor_col]
"""
dd.G.obsm["gPCs"] = dd.G.obsm["gPCs"][dd.G.obsm["gPCs"].columns[:n_gpcs]]

Normalize and Filter Expression Data#

We normalize single-cell expression using total-count normalization and log transformation. We then filter the dataset to only include cells of a specific type (CD8 Naive). This step ensures we’re analyzing homogeneous populations, which increases power in eQTL detection.

sc.pp.normalize_total(dd.C)
sc.pp.log1p(dd.C)
sc.pp.normalize_total(dd.C)

# are the expression pcs computed by pseudobulking across all cell types?
mdata = sc.get.aggregate(dd.C, by=DAnn.donor, func="mean")
mdata.X = mdata.layers.pop("mean")

sc.pp.highly_variable_genes(mdata, n_top_genes=batch_e_pcs_n_top_genes)
sc.tl.pca(mdata, n_comps=n_epcs)

dd.G.obsm["ePCs"] = mdata[dd.G.obs_names].obsm["X_pca"]

dd = dd[..., dd.C.obs[celltype_key] == cell_type, :].copy()
dd

╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 ×     │ ║
║ │                                                    │ 36,469                                             │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     uns: 'kinship'                                 │     var: 'vst.mean', 'vst.variance',               │ ║
║ │                                                    │ 'vst.variance.expected',                           │ ║
║ │                                                    │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │                                                    │ 'feature_is_filtered', 'feature_name',             │ ║
║ │                                                    │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs'                           │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     varm: 'filter'                                 │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │                                                    │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

dd.aggregate(key_added=pb_gex_key, sync_var=True, verbose=True)
dd.aggregate(obs=["sex", "age"], func="first", add_to_obs=True)
dd

╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 ×     │ ║
║ │                                                    │ 36,469                                             │ ║
║ │     obs: 'sex', 'age'                              │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │                                                    │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │                                                    │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │                                                    │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │                                                    │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     var: 'vst.mean', 'vst.variance',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'vst.variance.expected',                           │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'feature_is_filtered', 'feature_name',             │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     uns: 'kinship'                                 │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive'           │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │     varm: 'filter'                                 │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)

keep_genes = ((dd.G.obsm[pb_gex_key] > 0).mean(axis=0) >= min_percent_donors_expressed).values
dd = dd[..., keep_genes]
print("after filtering")
print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)

PB_CD4 Naive shape: (981, 36469)
dd.shape: (981, 136776, 259012, 36469)
after filtering
PB_CD4 Naive shape: (981, 15770)
dd.shape: (981, 136776, 259012, 15770)

# alternative to dd[:, dd.G.var.chrom == str(chrom), :, dd.C.var.chrom == str(chrom)]
dd = dd.sel(G_var=dd.G.var.chrom == str(chrom), C_var=dd.C.var.chrom == str(chrom)).copy()
dd

╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 × 404 │ ║
║ │     obs: 'sex', 'age'                              │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │                                                    │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │                                                    │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │                                                    │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │                                                    │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     var: 'vst.mean', 'vst.variance',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'vst.variance.expected',                           │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'feature_is_filtered', 'feature_name',             │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     uns: 'kinship'                                 │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive'           │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │     varm: 'filter'                                 │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Step 1: Annotate Variants with Functional Information#

To inform the weighting in burden tests, we annotate variants using the VEP tool. These annotations include predicted functional consequences (e.g., missense, stop gained), CADD scores, and distance to gene transcription start sites (TSS).

This information allows biologically meaningful prioritization of variants when aggregating their effects.

vep_annotation_file = DATA / "variant_annotation/variants_vep_annotated.txt"

cl.tl.add_vep_annos_to_gdata(vep_anno_file=vep_annotation_file, gdata=dd.G, dummy_consequence=True)
dd.G.uns["variant_annotation_vep"]

			Consequence_3_prime_UTR_variant	Consequence_5_prime_UTR_variant	Consequence_NMD_transcript_variant	Consequence_coding_sequence_variant	Consequence_downstream_gene_variant	Consequence_frameshift_variant	Consequence_inframe_deletion	Consequence_inframe_insertion	Consequence_intergenic_variant	Consequence_intron_variant	...	Codons	cDNA_position	gnomADe_ASJ_AF	CDS_position	gnomADe_AF	IMPACT	BIOTYPE	CLIN_SIG	Protein_position	PHENO
snp_id	gene_id	transcript_id
22_16388891_G_A	ENSG00000230643	ENST00000447704	0	0	0	0	1	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16388968_C_T	ENSG00000230643	ENST00000447704	0	0	0	0	1	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16389525_A_G	ENSG00000230643	ENST00000447704	0	0	0	0	0	0	0	0	0	0	...	-	78/118	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16390411_G_A	ENSG00000230643	ENST00000447704	0	0	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16391555_G_C	ENSG00000230643	ENST00000447704	0	0	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
22_50796327_CCA_C	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	0	1	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50798021_A_G	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	0	1	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50798635_T_C	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	0	1	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50799821_A_C	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	0	1	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50802428_C_G	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	0	1	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-

169322 rows × 57 columns

cl.tl.aggregate_annotations_for_varm(
    dd.G, "variant_annotation_vep", agg_type="first", return_data=True
)  # TODO change agg type

	gene_id	transcript_id	Consequence_3_prime_UTR_variant	Consequence_5_prime_UTR_variant	Consequence_NMD_transcript_variant	Consequence_coding_sequence_variant	Consequence_downstream_gene_variant	Consequence_frameshift_variant	Consequence_inframe_deletion	Consequence_inframe_insertion	...	Codons	cDNA_position	gnomADe_ASJ_AF	CDS_position	gnomADe_AF	IMPACT	BIOTYPE	CLIN_SIG	Protein_position	PHENO
snp_id
22_16388891_G_A	ENSG00000230643	ENST00000447704	0	0	0	0	1	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16388968_C_T	ENSG00000230643	ENST00000447704	0	0	0	0	1	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16389525_A_G	ENSG00000230643	ENST00000447704	0	0	0	0	0	0	0	0	...	-	78/118	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16390411_G_A	ENSG00000230643	ENST00000447704	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
22_16391555_G_C	ENSG00000230643	ENST00000447704	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	unprocessed_pseudogene	-	-	-
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
22_50796327_CCA_C	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50798021_A_G	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50798635_T_C	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50799821_A_C	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-
22_50802428_C_G	ENSG00000100239	ENST00000395741	0	0	0	0	0	0	0	0	...	-	-	NaN	-	NaN	MODIFIER	protein_coding	-	-	-

136776 rows × 59 columns

dd.G.varm["variant_annotation"].columns

Index(['gene_id', 'transcript_id', 'Consequence_3_prime_UTR_variant',
       'Consequence_5_prime_UTR_variant', 'Consequence_NMD_transcript_variant',
       'Consequence_coding_sequence_variant',
       'Consequence_downstream_gene_variant', 'Consequence_frameshift_variant',
       'Consequence_inframe_deletion', 'Consequence_inframe_insertion',
       'Consequence_intergenic_variant', 'Consequence_intron_variant',
       'Consequence_mature_miRNA_variant', 'Consequence_missense_variant',
       'Consequence_non_coding_transcript_exon_variant',
       'Consequence_non_coding_transcript_variant',
       'Consequence_protein_altering_variant',
       'Consequence_splice_acceptor_variant',
       'Consequence_splice_donor_5th_base_variant',
       'Consequence_splice_donor_region_variant',
       'Consequence_splice_donor_variant',
       'Consequence_splice_polypyrimidine_tract_variant',
       'Consequence_splice_region_variant', 'Consequence_start_lost',
       'Consequence_stop_gained', 'Consequence_stop_lost',
       'Consequence_stop_retained_variant', 'Consequence_synonymous_variant',
       'Consequence_upstream_gene_variant', 'gnomADe_AMR_AF', 'DISTANCE',
       'CANONICAL', 'STRAND', 'CADD_RAW', 'ENSP', 'gnomADe_AFR_AF',
       'gnomADe_EAS_AF', 'SOMATIC', 'TSSDistance', 'gnomADe_SAS_AF', 'SIFT',
       'gnomADe_FIN_AF', 'gnomADe_OTH_AF', 'PolyPhen', 'FLAGS', 'Amino_acids',
       'CADD_PHRED', 'Existing_variation', 'gnomADe_NFE_AF', 'Codons',
       'cDNA_position', 'gnomADe_ASJ_AF', 'CDS_position', 'gnomADe_AF',
       'IMPACT', 'BIOTYPE', 'CLIN_SIG', 'Protein_position', 'PHENO'],
      dtype='object')

dd.G.uns["variant_annotation_vep"]["CADD_RAW"].describe()

count    32966.000000
mean         0.122248
std          0.542875
min         -1.705125
25%         -0.115726
50%          0.023902
75%          0.205452
max          9.936896
Name: CADD_RAW, dtype: float64

Step 2: Burden Testing#

We now perform burden tests, which evaluate whether the aggregate effect of rare variants within a gene is associated with a phenotype — in this case, gene expression in CD8 Naive cells. We use different weighting schemes allow us to prioritize variants differently based on functional scores or allele frequency, including:

CADD_RAW: Raw CADD scores.
maf_beta: Beta weighting based on minor allele frequency (MAF).
tss_distance: Distance to the transcription start site (TSS).
tss_distance_exp: Exponential weighting based on TSS distance.

burden_agg_fct = "sum"
run_lrt = True
annotation_cols = ["CADD_RAW", "maf_beta", "tss_distance", "tss_distance_exp"]

rare_maf_threshold = 0.05

Filtering for Rare Variants#

Burden tests focus on rare variants—typically those with minor allele frequency (MAF) below 0.01 or 0.05. We filter our genotype data accordingly to isolate variants in this frequency range.

dd = dd.sel(G_var=dd.G.var.maf < rare_maf_threshold).copy()
dd

╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║
║ ┃ G (donors)                                         ┃ C (cells)                                          ┃ ║
║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║
║ │ AnnData object with n_obs × n_vars = 981 × 39,454  │ AnnData object with n_obs × n_vars = 259,012 × 404 │ ║
║ │     obs: 'sex', 'age'                              │     obs: 'orig.ident', 'nCount_RNA',               │ ║
║ │                                                    │ 'nFeature_RNA', 'percent.mt', 'donor_id',          │ ║
║ │                                                    │ 'pool_number', 'predicted.celltype.l2',            │ ║
║ │                                                    │ 'predicted.celltype.l2.score', 'age',              │ ║
║ │                                                    │ 'organism_ontology_term_id',                       │ ║
║ │                                                    │ 'tissue_ontology_term_id',                         │ ║
║ │                                                    │ 'assay_ontology_term_id',                          │ ║
║ │                                                    │ 'disease_ontology_term_id',                        │ ║
║ │                                                    │ 'cell_type_ontology_term_id',                      │ ║
║ │                                                    │ 'self_reported_ethnicity_ontology_term_id',        │ ║
║ │                                                    │ 'development_stage_ontology_term_id',              │ ║
║ │                                                    │ 'sex_ontology_term_id', 'is_primary_data',         │ ║
║ │                                                    │ 'suspension_type', 'tissue_type', 'cell_type',     │ ║
║ │                                                    │ 'assay', 'disease', 'organism', 'sex', 'tissue',   │ ║
║ │                                                    │ 'self_reported_ethnicity', 'development_stage',    │ ║
║ │                                                    │ 'observation_joinid'                               │ ║
║ │     var: 'chrom', 'pos', 'a0', 'a1', 'AC',         │     var: 'vst.mean', 'vst.variance',               │ ║
║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2',  │ 'vst.variance.expected',                           │ ║
║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2',     │ 'vst.variance.standardized', 'vst.variable',       │ ║
║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length',  │ 'feature_is_filtered', 'feature_name',             │ ║
║ │ 'quality', 'pos_hg19', 'id_hg19'                   │ 'feature_reference', 'feature_biotype',            │ ║
║ │                                                    │ 'feature_length', 'feature_type', 'start', 'end',  │ ║
║ │                                                    │ 'chrom'                                            │ ║
║ │     uns: 'kinship', 'variant_annotation_vep'       │     uns: 'cell_type_ontology_term_id_colors',      │ ║
║ │                                                    │ 'citation', 'default_embedding',                   │ ║
║ │                                                    │ 'schema_reference', 'schema_version', 'title',     │ ║
║ │                                                    │ 'log1p'                                            │ ║
║ │     obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive'           │     obsm: 'X_azimuth_spca', 'X_azimuth_umap',      │ ║
║ │                                                    │ 'X_harmony', 'X_pca', 'X_umap'                     │ ║
║ │     varm: 'filter', 'variant_annotation'           │     varm: 'PCs'                                    │ ║
║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Custom MAF Weights#

We add custom MAF weights commonly used in burden tests, such as Beta(MAF, 1, 25). These weights prioritize rarer variants in the analysis. TSS distance weight as used in the SAIGE-QTL paper are added manually for each gene

dd.G.varm["variant_annotation"]["maf_beta"] = beta_weighting(dd.G.var["maf"])

Run burden tests using each annotation individually for weighting#

Next, we specify the covariate matrix F, which includes sex, age, genetic PCs, and expression PCs. This controls for potential confounders in the association analysis. We then iterate over all genes on chromosome 22, define a cis-window of ±500kb, and run a burden test using each of the annotation-based weighting schemes.

# This specifies covariates/fixed effects
F = np.concatenate(
    [
        np.ones((dd.shape[0], 1)),
        dd.G.obs[["sex"]].values - 1,
        dd.G.obs[["age"]].values,
        dd.G.obsm["gPCs"].values,
        dd.G.obsm["ePCs"],
    ],
    axis=1,
).astype(np.float64)
F[:, 2:] = column_normalize(F[:, 2:])

results = []
if isinstance(dd.G.X, da.Array | ad._core.views.DaskArrayView):
    if dd.G.is_view:
        dd._G = dd._G.copy()
    dd.G.X = dd.G.X.compute()

if do_debug:
    warnings.filterwarnings("ignore", category=RuntimeWarning)

for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
    Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))

    start = max(0, row.start - cis_window)
    end = row.end + cis_window
    _G = dd.G[:, (dd.G.var.pos < end)]
    _G = _G[:, (_G.var.pos > start)]
    _G = _G[:, (_G.X.std(0) != 0)]
    _G = _G.copy()

    # TODO make strand aware
    _G.varm["variant_annotation"]["tss_distance"] = np.abs(row.start - _G.var["pos"])
    _G.varm["variant_annotation"]["tss_distance_exp"] = np.exp(-1e-5 * _G.varm["variant_annotation"]["tss_distance"])

    rdf = run_burden_test(
        _G, Y, F, gene, annotation_cols=annotation_cols, burden_agg_fct=burden_agg_fct, run_lrt=run_lrt
    )
    results.append(rdf)

rdf = pd.concat(results)
rdf

	burden_gene	egene	weight_col	burden_agg_fct	pv	beta	betaste	lrt
0	ENSG00000206195	ENSG00000206195	CADD_RAW	sum	NaN	NaN	NaN	NaN
1	ENSG00000206195	ENSG00000206195	maf_beta	sum	NaN	NaN	NaN	NaN
2	ENSG00000206195	ENSG00000206195	tss_distance	sum	NaN	NaN	NaN	NaN
3	ENSG00000206195	ENSG00000206195	tss_distance_exp	sum	NaN	NaN	NaN	NaN
0	ENSG00000177663	ENSG00000177663	CADD_RAW	sum	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
3	ENSG00000100299	ENSG00000100299	tss_distance_exp	sum	0.024520	-3.623585e-03	1.611285e-03	5.057453
0	ENSG00000079974	ENSG00000079974	CADD_RAW	sum	NaN	NaN	NaN	NaN
1	ENSG00000079974	ENSG00000079974	maf_beta	sum	0.343694	6.644163e-04	7.016820e-04	0.896602
2	ENSG00000079974	ENSG00000079974	tss_distance	sum	0.977858	4.148339e-09	1.494648e-07	0.000770
3	ENSG00000079974	ENSG00000079974	tss_distance_exp	sum	0.244656	1.876147e-02	1.612602e-02	1.353567

1616 rows × 8 columns

Step 3: Combine Multiple Tests with ACAT#

To summarize evidence from multiple annotations, we use the ACAT (Aggregated Cauchy Association Test). This meta-analysis method combines p-values from the burden tests per gene into a single statistic, improving power and interpretability.

combined = rdf.dropna(subset=["pv"]).groupby("egene")["pv"].agg(lambda x: acat_test(x.values)).reset_index()
combined.sort_values("pv")

	egene	pv
84	ENSG00000100219	2.287059e-14
290	ENSG00000212939	2.274714e-11
75	ENSG00000100154	8.218144e-05
186	ENSG00000133460	9.702130e-05
198	ENSG00000167077	1.625786e-04
...	...	...
158	ENSG00000100427	9.859188e-01
338	ENSG00000244625	9.891262e-01
107	ENSG00000100294	9.921173e-01
44	ENSG00000100030	9.949769e-01
27	ENSG00000099957	9.957546e-01

402 rows × 2 columns

Step 4: SKAT#

As an alternative to burden testing, we apply the Sequence Kernel Association Test (SKAT), which models the variance component of aggregated rare variants rather than assuming a unidirectional effect. SKAT is more robust when variant effects differ in direction or magnitude.

Currently, cellink supports SKAT with the standard Beta(MAF, 1, 25) weighting scheme.

import logging

logger = logging.getLogger()
logger.setLevel(logging.WARNING)  # Suppress INFO and DEBUG esp. from SKAT Test

results = []

for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
    Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))

    start = max(0, row.start - cis_window)
    end = row.end + cis_window
    _G = dd.G[:, (dd.G.var.pos < end)]
    _G = _G[:, (_G.var.pos > start)]
    _G = _G[:, (_G.X.std(0) != 0)]

    rdict = run_skat_test(_G, Y, F, gene)
    results.append(rdict)

rdf = pd.DataFrame(results)
rdf

	burden_gene	egene	weight_col	pv
0	ENSG00000206195	ENSG00000206195	maf_beta	[[1.0]]
1	ENSG00000177663	ENSG00000177663	maf_beta	[[0.7540438404596421]]
2	ENSG00000069998	ENSG00000069998	maf_beta	[[0.03897879062889942]]
3	ENSG00000185837	ENSG00000185837	maf_beta	[[0.8117376585909591]]
4	ENSG00000093072	ENSG00000093072	maf_beta	[[0.9819123716041604]]
...	...	...	...	...
399	ENSG00000205560	ENSG00000205560	maf_beta	[[0.02343900249308839]]
400	ENSG00000100288	ENSG00000100288	maf_beta	[[0.5147669611167375]]
401	ENSG00000205559	ENSG00000205559	maf_beta	[[0.23597911440366892]]
402	ENSG00000100299	ENSG00000100299	maf_beta	[[0.0004379433659393861]]
403	ENSG00000079974	ENSG00000079974	maf_beta	[[0.03259277644181413]]

404 rows × 4 columns

rdf

	burden_gene	egene	weight_col	pv
0	ENSG00000206195	ENSG00000206195	maf_beta	[[1.0]]
1	ENSG00000177663	ENSG00000177663	maf_beta	[[0.7540438404596421]]
2	ENSG00000069998	ENSG00000069998	maf_beta	[[0.03897879062889942]]
3	ENSG00000185837	ENSG00000185837	maf_beta	[[0.8117376585909591]]
4	ENSG00000093072	ENSG00000093072	maf_beta	[[0.9819123716041604]]
...	...	...	...	...
399	ENSG00000205560	ENSG00000205560	maf_beta	[[0.02343900249308839]]
400	ENSG00000100288	ENSG00000100288	maf_beta	[[0.5147669611167375]]
401	ENSG00000205559	ENSG00000205559	maf_beta	[[0.23597911440366892]]
402	ENSG00000100299	ENSG00000100299	maf_beta	[[0.0004379433659393861]]
403	ENSG00000079974	ENSG00000079974	maf_beta	[[0.03259277644181413]]

404 rows × 4 columns

Tutorial: Rare Variant Association Testing with cellink

Contents

Tutorial: Rare Variant Association Testing with `cellink`#

Environment Setup#

Load and Prepare Data#

Normalize and Filter Expression Data#

Step 1: Annotate Variants with Functional Information#

Step 2: Burden Testing#

Filtering for Rare Variants#

Custom MAF Weights#

Run burden tests using each annotation individually for weighting#

Step 3: Combine Multiple Tests with ACAT#

Step 4: SKAT#

Tutorial: Rare Variant Association Testing with cellink

Contents

Tutorial: Rare Variant Association Testing with cellink#

Environment Setup#

Load and Prepare Data#

Normalize and Filter Expression Data#

Step 1: Annotate Variants with Functional Information#

Step 2: Burden Testing#

Filtering for Rare Variants#

Custom MAF Weights#

Run burden tests using each annotation individually for weighting#

Step 3: Combine Multiple Tests with ACAT#

Step 4: SKAT#

Tutorial: Rare Variant Association Testing with `cellink`#