Tutorial: Rare Variant Association Testing with cellink#
This tutorial demonstrates how to perform rare variant association testing (RVAT) in single-cell data using the cellink package. We focus on two major types of RVAT: burden tests and the Sequence Kernel Association Test (SKAT). Rare variants can be especially informative for understanding disease biology, but they require careful aggregation and testing approaches due to their low frequency.
We’ll walk through how to:
Prepare single-cell and genotype data using
DonorDataobjects.Apply burden tests with multiple variant annotation-based weights.
Combine p-values from multiple tests using ACAT.
Perform SKAT for gene-level rare variant analysis.
This tutorial uses a subset of data from the 1k1k project, filtered to the CD8 Naive T cell type, and focuses on chromosome 22 for runtime feasibility. It builds on earlier tutorials that cover pseudobulk expression and variant annotation, which should be reviewed for full context.
Note: This notebook assumes that you have already run the variant annotation step as described in the Variant Annotation Tutorial.
Prerequisites: To run this notebook, you need to install the required dependencies for RVAT. Use the following command to install the necessary dependencies:
pip install -e "cellink[rvat, datasets]"
conda install -c conda-forge chiscore
Environment Setup#
We begin by importing necessary libraries and defining key parameters for our analysis. cellink provides utilities that extend AnnData to handle both donor-level genotype data and single-cell RNA-seq data efficiently.
%load_ext autoreload
%autoreload 2
import pandas as pd
from pathlib import Path
import warnings
import anndata as ad
import scanpy as sc
import dask.array as da
import numpy as np
from tqdm.auto import tqdm
import cellink as cl
from cellink._core import DAnn
from cellink.tl._rvat import run_burden_test, run_skat_test, beta_weighting
from cellink.utils import column_normalize, gaussianize
from cellink.at.acat import acat_test
from cellink.resources import get_dummy_onek1k
DATA = Path(cl.__file__).parent.parent.parent / "docs/tutorials/data"
n_gpcs = 20
n_epcs = 15
batch_e_pcs_n_top_genes = 2000
chrom = 22
cis_window = 100_000
cell_type = "CD4 Naive"
pb_gex_key = f"PB_{cell_type}" # pseudobulk expression in dd.G.obsm[key_added]
original_donor_col = "donor_id"
min_percent_donors_expressed = 0.1
celltype_key = "predicted.celltype.l2"
do_debug = False
Load and Prepare Data#
Here, we load a prepared dataset (onek1k) that includes genotype and expression information from human donors. (This is a subset of the full OneK1K dataset, which can be downloaded, and prepared using get_onek1k()) We also extract gene annotations using Ensembl via pybiomart, which are essential for defining cis-windows during eQTL analysis.
dd = get_dummy_onek1k(config_path="../../src/cellink/resources/config/dummy_onek1k.yaml", verify_checksum=False)
dd
[2026-04-02 08:10:16,307] INFO:root: /home/dnanexus/cellink_data/dummy_onek1k/dummy_onek1k.dd.h5 already exists
[2026-04-02 08:10:16,308] WARNING:root: No checksum provided, skipping verification
[2026-04-02 08:10:18,425] INFO:root: Loaded dummy OneK1K dataset: (100, 146939, 125366, 34073)
╔═ DonorData(n_donors=100, n_cells_per_donor=[613-2,731], donor_id='donor_id') ═══════════════════════════════╗ ║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║ ║ ┃ G (donors) ┃ C (cells) ┃ ║ ║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║ ║ │ AnnData object with n_obs × n_vars = 100 × 146,939 │ View of AnnData object with n_obs × n_vars = │ ║ ║ │ │ 125,366 × 34,073 │ ║ ║ │ var: 'chrom', 'pos', 'a0', 'a1', 'AC', │ obs: 'orig.ident', 'nCount_RNA', │ ║ ║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', │ 'nFeature_RNA', 'percent.mt', 'donor_id', │ ║ ║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', │ 'pool_number', 'predicted.celltype.l2', │ ║ ║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', │ 'predicted.celltype.l2.score', 'age', │ ║ ║ │ 'quality', 'pos_hg19', 'id_hg19' │ 'organism_ontology_term_id', │ ║ ║ │ │ 'tissue_ontology_term_id', │ ║ ║ │ │ 'assay_ontology_term_id', │ ║ ║ │ │ 'disease_ontology_term_id', │ ║ ║ │ │ 'cell_type_ontology_term_id', │ ║ ║ │ │ 'self_reported_ethnicity_ontology_term_id', │ ║ ║ │ │ 'development_stage_ontology_term_id', │ ║ ║ │ │ 'sex_ontology_term_id', 'is_primary_data', │ ║ ║ │ │ 'suspension_type', 'tissue_type', 'cell_type', │ ║ ║ │ │ 'assay', 'disease', 'organism', 'sex', 'tissue', │ ║ ║ │ │ 'self_reported_ethnicity', 'development_stage', │ ║ ║ │ │ 'observation_joinid' │ ║ ║ │ uns: 'kinship' │ var: 'vst.mean', 'vst.variance', │ ║ ║ │ │ 'vst.variance.expected', │ ║ ║ │ │ 'vst.variance.standardized', 'vst.variable', │ ║ ║ │ │ 'feature_is_filtered', 'feature_name', │ ║ ║ │ │ 'feature_reference', 'feature_biotype', │ ║ ║ │ │ 'feature_length', 'feature_type', 'start', 'end', │ ║ ║ │ │ 'chrom' │ ║ ║ │ obsm: 'gPCs' │ uns: 'cell_type_ontology_term_id_colors', │ ║ ║ │ │ 'citation', 'default_embedding', │ ║ ║ │ │ 'schema_reference', 'schema_version', 'title' │ ║ ║ │ varm: 'filter' │ obsm: 'X_azimuth_spca', 'X_azimuth_umap', │ ║ ║ │ │ 'X_harmony', 'X_pca', 'X_umap' │ ║ ║ │ │ varm: 'PCs' │ ║ ║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
"""
def _get_ensembl_gene_id_start_end_chr():
from pybiomart import Server
server = Server(host="http://www.ensembl.org")
dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]
ensembl_gene_id_start_end_chr = dataset.query(
attributes=["ensembl_gene_id", "start_position", "end_position", "chromosome_name"]
)
ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.set_index("Gene stable ID")
ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.rename(
columns={
"Gene start (bp)": GAnn.start,
"Gene end (bp)": GAnn.end,
"Chromosome/scaffold name": GAnn.chrom,
}
)
return ensembl_gene_id_start_end_chr
"""
'\ndef _get_ensembl_gene_id_start_end_chr():\n from pybiomart import Server\n\n server = Server(host="http://www.ensembl.org")\n dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]\n ensembl_gene_id_start_end_chr = dataset.query(\n attributes=["ensembl_gene_id", "start_position", "end_position", "chromosome_name"]\n )\n ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.set_index("Gene stable ID")\n ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.rename(\n columns={\n "Gene start (bp)": GAnn.start,\n "Gene end (bp)": GAnn.end,\n "Chromosome/scaffold name": GAnn.chrom,\n }\n )\n return ensembl_gene_id_start_end_chr\n'
"""
ensembl_gene_id_start_end_chr = _get_ensembl_gene_id_start_end_chr()
ensembl_gene_id_start_end_chr
"""
'\nensembl_gene_id_start_end_chr = _get_ensembl_gene_id_start_end_chr()\nensembl_gene_id_start_end_chr\n'
"""
dd.C.var = dd.C.var.join(ensembl_gene_id_start_end_chr)
dd.C.obs[DAnn.donor] = dd.C.obs[original_donor_col]
"""
dd.G.obsm["gPCs"] = dd.G.obsm["gPCs"][dd.G.obsm["gPCs"].columns[:n_gpcs]]
Normalize and Filter Expression Data#
We normalize single-cell expression using total-count normalization and log transformation. We then filter the dataset to only include cells of a specific type (CD8 Naive). This step ensures we’re analyzing homogeneous populations, which increases power in eQTL detection.
sc.pp.normalize_total(dd.C)
sc.pp.log1p(dd.C)
sc.pp.normalize_total(dd.C)
# are the expression pcs computed by pseudobulking across all cell types?
mdata = sc.get.aggregate(dd.C, by=DAnn.donor, func="mean")
mdata.X = mdata.layers.pop("mean")
sc.pp.highly_variable_genes(mdata, n_top_genes=batch_e_pcs_n_top_genes)
sc.tl.pca(mdata, n_comps=n_epcs)
dd.G.obsm["ePCs"] = mdata[dd.G.obs_names].obsm["X_pca"]
dd = dd[..., dd.C.obs[celltype_key] == cell_type, :].copy()
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗ ║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║ ║ ┃ G (donors) ┃ C (cells) ┃ ║ ║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║ ║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 × │ ║ ║ │ │ 36,469 │ ║ ║ │ var: 'chrom', 'pos', 'a0', 'a1', 'AC', │ obs: 'orig.ident', 'nCount_RNA', │ ║ ║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', │ 'nFeature_RNA', 'percent.mt', 'donor_id', │ ║ ║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', │ 'pool_number', 'predicted.celltype.l2', │ ║ ║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', │ 'predicted.celltype.l2.score', 'age', │ ║ ║ │ 'quality', 'pos_hg19', 'id_hg19' │ 'organism_ontology_term_id', │ ║ ║ │ │ 'tissue_ontology_term_id', │ ║ ║ │ │ 'assay_ontology_term_id', │ ║ ║ │ │ 'disease_ontology_term_id', │ ║ ║ │ │ 'cell_type_ontology_term_id', │ ║ ║ │ │ 'self_reported_ethnicity_ontology_term_id', │ ║ ║ │ │ 'development_stage_ontology_term_id', │ ║ ║ │ │ 'sex_ontology_term_id', 'is_primary_data', │ ║ ║ │ │ 'suspension_type', 'tissue_type', 'cell_type', │ ║ ║ │ │ 'assay', 'disease', 'organism', 'sex', 'tissue', │ ║ ║ │ │ 'self_reported_ethnicity', 'development_stage', │ ║ ║ │ │ 'observation_joinid' │ ║ ║ │ uns: 'kinship' │ var: 'vst.mean', 'vst.variance', │ ║ ║ │ │ 'vst.variance.expected', │ ║ ║ │ │ 'vst.variance.standardized', 'vst.variable', │ ║ ║ │ │ 'feature_is_filtered', 'feature_name', │ ║ ║ │ │ 'feature_reference', 'feature_biotype', │ ║ ║ │ │ 'feature_length', 'feature_type', 'start', 'end', │ ║ ║ │ │ 'chrom' │ ║ ║ │ obsm: 'gPCs', 'ePCs' │ uns: 'cell_type_ontology_term_id_colors', │ ║ ║ │ │ 'citation', 'default_embedding', │ ║ ║ │ │ 'schema_reference', 'schema_version', 'title', │ ║ ║ │ │ 'log1p' │ ║ ║ │ varm: 'filter' │ obsm: 'X_azimuth_spca', 'X_azimuth_umap', │ ║ ║ │ │ 'X_harmony', 'X_pca', 'X_umap' │ ║ ║ │ │ varm: 'PCs' │ ║ ║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
dd.aggregate(key_added=pb_gex_key, sync_var=True, verbose=True)
dd.aggregate(obs=["sex", "age"], func="first", add_to_obs=True)
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗ ║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║ ║ ┃ G (donors) ┃ C (cells) ┃ ║ ║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║ ║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 × │ ║ ║ │ │ 36,469 │ ║ ║ │ obs: 'sex', 'age' │ obs: 'orig.ident', 'nCount_RNA', │ ║ ║ │ │ 'nFeature_RNA', 'percent.mt', 'donor_id', │ ║ ║ │ │ 'pool_number', 'predicted.celltype.l2', │ ║ ║ │ │ 'predicted.celltype.l2.score', 'age', │ ║ ║ │ │ 'organism_ontology_term_id', │ ║ ║ │ │ 'tissue_ontology_term_id', │ ║ ║ │ │ 'assay_ontology_term_id', │ ║ ║ │ │ 'disease_ontology_term_id', │ ║ ║ │ │ 'cell_type_ontology_term_id', │ ║ ║ │ │ 'self_reported_ethnicity_ontology_term_id', │ ║ ║ │ │ 'development_stage_ontology_term_id', │ ║ ║ │ │ 'sex_ontology_term_id', 'is_primary_data', │ ║ ║ │ │ 'suspension_type', 'tissue_type', 'cell_type', │ ║ ║ │ │ 'assay', 'disease', 'organism', 'sex', 'tissue', │ ║ ║ │ │ 'self_reported_ethnicity', 'development_stage', │ ║ ║ │ │ 'observation_joinid' │ ║ ║ │ var: 'chrom', 'pos', 'a0', 'a1', 'AC', │ var: 'vst.mean', 'vst.variance', │ ║ ║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', │ 'vst.variance.expected', │ ║ ║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', │ 'vst.variance.standardized', 'vst.variable', │ ║ ║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', │ 'feature_is_filtered', 'feature_name', │ ║ ║ │ 'quality', 'pos_hg19', 'id_hg19' │ 'feature_reference', 'feature_biotype', │ ║ ║ │ │ 'feature_length', 'feature_type', 'start', 'end', │ ║ ║ │ │ 'chrom' │ ║ ║ │ uns: 'kinship' │ uns: 'cell_type_ontology_term_id_colors', │ ║ ║ │ │ 'citation', 'default_embedding', │ ║ ║ │ │ 'schema_reference', 'schema_version', 'title', │ ║ ║ │ │ 'log1p' │ ║ ║ │ obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive' │ obsm: 'X_azimuth_spca', 'X_azimuth_umap', │ ║ ║ │ │ 'X_harmony', 'X_pca', 'X_umap' │ ║ ║ │ varm: 'filter' │ varm: 'PCs' │ ║ ║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)
keep_genes = ((dd.G.obsm[pb_gex_key] > 0).mean(axis=0) >= min_percent_donors_expressed).values
dd = dd[..., keep_genes]
print("after filtering")
print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)
PB_CD4 Naive shape: (981, 36469)
dd.shape: (981, 136776, 259012, 36469)
after filtering
PB_CD4 Naive shape: (981, 15770)
dd.shape: (981, 136776, 259012, 15770)
# alternative to dd[:, dd.G.var.chrom == str(chrom), :, dd.C.var.chrom == str(chrom)]
dd = dd.sel(G_var=dd.G.var.chrom == str(chrom), C_var=dd.C.var.chrom == str(chrom)).copy()
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗ ║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║ ║ ┃ G (donors) ┃ C (cells) ┃ ║ ║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║ ║ │ AnnData object with n_obs × n_vars = 981 × 136,776 │ AnnData object with n_obs × n_vars = 259,012 × 404 │ ║ ║ │ obs: 'sex', 'age' │ obs: 'orig.ident', 'nCount_RNA', │ ║ ║ │ │ 'nFeature_RNA', 'percent.mt', 'donor_id', │ ║ ║ │ │ 'pool_number', 'predicted.celltype.l2', │ ║ ║ │ │ 'predicted.celltype.l2.score', 'age', │ ║ ║ │ │ 'organism_ontology_term_id', │ ║ ║ │ │ 'tissue_ontology_term_id', │ ║ ║ │ │ 'assay_ontology_term_id', │ ║ ║ │ │ 'disease_ontology_term_id', │ ║ ║ │ │ 'cell_type_ontology_term_id', │ ║ ║ │ │ 'self_reported_ethnicity_ontology_term_id', │ ║ ║ │ │ 'development_stage_ontology_term_id', │ ║ ║ │ │ 'sex_ontology_term_id', 'is_primary_data', │ ║ ║ │ │ 'suspension_type', 'tissue_type', 'cell_type', │ ║ ║ │ │ 'assay', 'disease', 'organism', 'sex', 'tissue', │ ║ ║ │ │ 'self_reported_ethnicity', 'development_stage', │ ║ ║ │ │ 'observation_joinid' │ ║ ║ │ var: 'chrom', 'pos', 'a0', 'a1', 'AC', │ var: 'vst.mean', 'vst.variance', │ ║ ║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', │ 'vst.variance.expected', │ ║ ║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', │ 'vst.variance.standardized', 'vst.variable', │ ║ ║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', │ 'feature_is_filtered', 'feature_name', │ ║ ║ │ 'quality', 'pos_hg19', 'id_hg19' │ 'feature_reference', 'feature_biotype', │ ║ ║ │ │ 'feature_length', 'feature_type', 'start', 'end', │ ║ ║ │ │ 'chrom' │ ║ ║ │ uns: 'kinship' │ uns: 'cell_type_ontology_term_id_colors', │ ║ ║ │ │ 'citation', 'default_embedding', │ ║ ║ │ │ 'schema_reference', 'schema_version', 'title', │ ║ ║ │ │ 'log1p' │ ║ ║ │ obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive' │ obsm: 'X_azimuth_spca', 'X_azimuth_umap', │ ║ ║ │ │ 'X_harmony', 'X_pca', 'X_umap' │ ║ ║ │ varm: 'filter' │ varm: 'PCs' │ ║ ║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
Step 1: Annotate Variants with Functional Information#
To inform the weighting in burden tests, we annotate variants using the VEP tool. These annotations include predicted functional consequences (e.g., missense, stop gained), CADD scores, and distance to gene transcription start sites (TSS).
This information allows biologically meaningful prioritization of variants when aggregating their effects.
vep_annotation_file = DATA / "variant_annotation/variants_vep_annotated.txt"
cl.tl.add_vep_annos_to_gdata(vep_anno_file=vep_annotation_file, gdata=dd.G, dummy_consequence=True)
dd.G.uns["variant_annotation_vep"]
| Consequence_3_prime_UTR_variant | Consequence_5_prime_UTR_variant | Consequence_NMD_transcript_variant | Consequence_coding_sequence_variant | Consequence_downstream_gene_variant | Consequence_frameshift_variant | Consequence_inframe_deletion | Consequence_inframe_insertion | Consequence_intergenic_variant | Consequence_intron_variant | ... | Codons | cDNA_position | gnomADe_ASJ_AF | CDS_position | gnomADe_AF | IMPACT | BIOTYPE | CLIN_SIG | Protein_position | PHENO | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| snp_id | gene_id | transcript_id | |||||||||||||||||||||
| 22_16388891_G_A | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16388968_C_T | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16389525_A_G | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | 78/118 | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16390411_G_A | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16391555_G_C | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 22_50796327_CCA_C | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50798021_A_G | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50798635_T_C | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50799821_A_C | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50802428_C_G | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
169322 rows × 57 columns
cl.tl.aggregate_annotations_for_varm(
dd.G, "variant_annotation_vep", agg_type="first", return_data=True
) # TODO change agg type
| gene_id | transcript_id | Consequence_3_prime_UTR_variant | Consequence_5_prime_UTR_variant | Consequence_NMD_transcript_variant | Consequence_coding_sequence_variant | Consequence_downstream_gene_variant | Consequence_frameshift_variant | Consequence_inframe_deletion | Consequence_inframe_insertion | ... | Codons | cDNA_position | gnomADe_ASJ_AF | CDS_position | gnomADe_AF | IMPACT | BIOTYPE | CLIN_SIG | Protein_position | PHENO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| snp_id | |||||||||||||||||||||
| 22_16388891_G_A | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16388968_C_T | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16389525_A_G | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | 78/118 | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16390411_G_A | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| 22_16391555_G_C | ENSG00000230643 | ENST00000447704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | unprocessed_pseudogene | - | - | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 22_50796327_CCA_C | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50798021_A_G | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50798635_T_C | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50799821_A_C | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
| 22_50802428_C_G | ENSG00000100239 | ENST00000395741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | - | - | NaN | - | NaN | MODIFIER | protein_coding | - | - | - |
136776 rows × 59 columns
dd.G.varm["variant_annotation"].columns
Index(['gene_id', 'transcript_id', 'Consequence_3_prime_UTR_variant',
'Consequence_5_prime_UTR_variant', 'Consequence_NMD_transcript_variant',
'Consequence_coding_sequence_variant',
'Consequence_downstream_gene_variant', 'Consequence_frameshift_variant',
'Consequence_inframe_deletion', 'Consequence_inframe_insertion',
'Consequence_intergenic_variant', 'Consequence_intron_variant',
'Consequence_mature_miRNA_variant', 'Consequence_missense_variant',
'Consequence_non_coding_transcript_exon_variant',
'Consequence_non_coding_transcript_variant',
'Consequence_protein_altering_variant',
'Consequence_splice_acceptor_variant',
'Consequence_splice_donor_5th_base_variant',
'Consequence_splice_donor_region_variant',
'Consequence_splice_donor_variant',
'Consequence_splice_polypyrimidine_tract_variant',
'Consequence_splice_region_variant', 'Consequence_start_lost',
'Consequence_stop_gained', 'Consequence_stop_lost',
'Consequence_stop_retained_variant', 'Consequence_synonymous_variant',
'Consequence_upstream_gene_variant', 'gnomADe_AMR_AF', 'DISTANCE',
'CANONICAL', 'STRAND', 'CADD_RAW', 'ENSP', 'gnomADe_AFR_AF',
'gnomADe_EAS_AF', 'SOMATIC', 'TSSDistance', 'gnomADe_SAS_AF', 'SIFT',
'gnomADe_FIN_AF', 'gnomADe_OTH_AF', 'PolyPhen', 'FLAGS', 'Amino_acids',
'CADD_PHRED', 'Existing_variation', 'gnomADe_NFE_AF', 'Codons',
'cDNA_position', 'gnomADe_ASJ_AF', 'CDS_position', 'gnomADe_AF',
'IMPACT', 'BIOTYPE', 'CLIN_SIG', 'Protein_position', 'PHENO'],
dtype='object')
dd.G.uns["variant_annotation_vep"]["CADD_RAW"].describe()
count 32966.000000
mean 0.122248
std 0.542875
min -1.705125
25% -0.115726
50% 0.023902
75% 0.205452
max 9.936896
Name: CADD_RAW, dtype: float64
Step 2: Burden Testing#
We now perform burden tests, which evaluate whether the aggregate effect of rare variants within a gene is associated with a phenotype — in this case, gene expression in CD8 Naive cells. We use different weighting schemes allow us to prioritize variants differently based on functional scores or allele frequency, including:
CADD_RAW: Raw CADD scores.
maf_beta: Beta weighting based on minor allele frequency (MAF).
tss_distance: Distance to the transcription start site (TSS).
tss_distance_exp: Exponential weighting based on TSS distance.
burden_agg_fct = "sum"
run_lrt = True
annotation_cols = ["CADD_RAW", "maf_beta", "tss_distance", "tss_distance_exp"]
rare_maf_threshold = 0.05
Filtering for Rare Variants#
Burden tests focus on rare variants—typically those with minor allele frequency (MAF) below 0.01 or 0.05. We filter our genotype data accordingly to isolate variants in this frequency range.
dd = dd.sel(G_var=dd.G.var.maf < rare_maf_threshold).copy()
dd
╔═ DonorData(n_donors=981, n_cells_per_donor=[6-927], donor_id='donor_id') ═══════════════════════════════════╗ ║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║ ║ ┃ G (donors) ┃ C (cells) ┃ ║ ║ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ ║ ║ │ AnnData object with n_obs × n_vars = 981 × 39,454 │ AnnData object with n_obs × n_vars = 259,012 × 404 │ ║ ║ │ obs: 'sex', 'age' │ obs: 'orig.ident', 'nCount_RNA', │ ║ ║ │ │ 'nFeature_RNA', 'percent.mt', 'donor_id', │ ║ ║ │ │ 'pool_number', 'predicted.celltype.l2', │ ║ ║ │ │ 'predicted.celltype.l2.score', 'age', │ ║ ║ │ │ 'organism_ontology_term_id', │ ║ ║ │ │ 'tissue_ontology_term_id', │ ║ ║ │ │ 'assay_ontology_term_id', │ ║ ║ │ │ 'disease_ontology_term_id', │ ║ ║ │ │ 'cell_type_ontology_term_id', │ ║ ║ │ │ 'self_reported_ethnicity_ontology_term_id', │ ║ ║ │ │ 'development_stage_ontology_term_id', │ ║ ║ │ │ 'sex_ontology_term_id', 'is_primary_data', │ ║ ║ │ │ 'suspension_type', 'tissue_type', 'cell_type', │ ║ ║ │ │ 'assay', 'disease', 'organism', 'sex', 'tissue', │ ║ ║ │ │ 'self_reported_ethnicity', 'development_stage', │ ║ ║ │ │ 'observation_joinid' │ ║ ║ │ var: 'chrom', 'pos', 'a0', 'a1', 'AC', │ var: 'vst.mean', 'vst.variance', │ ║ ║ │ 'AC_Hemi', 'AC_Het', 'AC_Hom', 'AF', 'AN', 'ER2', │ 'vst.variance.expected', │ ║ ║ │ 'ExcHet', 'HWE', 'IMPUTED', 'maf', 'NS', 'R2', │ 'vst.variance.standardized', 'vst.variable', │ ║ ║ │ 'TYPED', 'TYPED_ONLY', 'id', 'id_mask', 'length', │ 'feature_is_filtered', 'feature_name', │ ║ ║ │ 'quality', 'pos_hg19', 'id_hg19' │ 'feature_reference', 'feature_biotype', │ ║ ║ │ │ 'feature_length', 'feature_type', 'start', 'end', │ ║ ║ │ │ 'chrom' │ ║ ║ │ uns: 'kinship', 'variant_annotation_vep' │ uns: 'cell_type_ontology_term_id_colors', │ ║ ║ │ │ 'citation', 'default_embedding', │ ║ ║ │ │ 'schema_reference', 'schema_version', 'title', │ ║ ║ │ │ 'log1p' │ ║ ║ │ obsm: 'gPCs', 'ePCs', 'PB_CD4 Naive' │ obsm: 'X_azimuth_spca', 'X_azimuth_umap', │ ║ ║ │ │ 'X_harmony', 'X_pca', 'X_umap' │ ║ ║ │ varm: 'filter', 'variant_annotation' │ varm: 'PCs' │ ║ ║ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
Custom MAF Weights#
We add custom MAF weights commonly used in burden tests, such as Beta(MAF, 1, 25). These weights prioritize rarer variants in the analysis. TSS distance weight as used in the SAIGE-QTL paper are added manually for each gene
dd.G.varm["variant_annotation"]["maf_beta"] = beta_weighting(dd.G.var["maf"])
Run burden tests using each annotation individually for weighting#
Next, we specify the covariate matrix F, which includes sex, age, genetic PCs, and expression PCs. This controls for potential confounders in the association analysis.
We then iterate over all genes on chromosome 22, define a cis-window of ±500kb, and run a burden test using each of the annotation-based weighting schemes.
# This specifies covariates/fixed effects
F = np.concatenate(
[
np.ones((dd.shape[0], 1)),
dd.G.obs[["sex"]].values - 1,
dd.G.obs[["age"]].values,
dd.G.obsm["gPCs"].values,
dd.G.obsm["ePCs"],
],
axis=1,
).astype(np.float64)
F[:, 2:] = column_normalize(F[:, 2:])
results = []
if isinstance(dd.G.X, da.Array | ad._core.views.DaskArrayView):
if dd.G.is_view:
dd._G = dd._G.copy()
dd.G.X = dd.G.X.compute()
if do_debug:
warnings.filterwarnings("ignore", category=RuntimeWarning)
for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))
start = max(0, row.start - cis_window)
end = row.end + cis_window
_G = dd.G[:, (dd.G.var.pos < end)]
_G = _G[:, (_G.var.pos > start)]
_G = _G[:, (_G.X.std(0) != 0)]
_G = _G.copy()
# TODO make strand aware
_G.varm["variant_annotation"]["tss_distance"] = np.abs(row.start - _G.var["pos"])
_G.varm["variant_annotation"]["tss_distance_exp"] = np.exp(-1e-5 * _G.varm["variant_annotation"]["tss_distance"])
rdf = run_burden_test(
_G, Y, F, gene, annotation_cols=annotation_cols, burden_agg_fct=burden_agg_fct, run_lrt=run_lrt
)
results.append(rdf)
rdf = pd.concat(results)
rdf
| burden_gene | egene | weight_col | burden_agg_fct | pv | beta | betaste | lrt | |
|---|---|---|---|---|---|---|---|---|
| 0 | ENSG00000206195 | ENSG00000206195 | CADD_RAW | sum | NaN | NaN | NaN | NaN |
| 1 | ENSG00000206195 | ENSG00000206195 | maf_beta | sum | NaN | NaN | NaN | NaN |
| 2 | ENSG00000206195 | ENSG00000206195 | tss_distance | sum | NaN | NaN | NaN | NaN |
| 3 | ENSG00000206195 | ENSG00000206195 | tss_distance_exp | sum | NaN | NaN | NaN | NaN |
| 0 | ENSG00000177663 | ENSG00000177663 | CADD_RAW | sum | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3 | ENSG00000100299 | ENSG00000100299 | tss_distance_exp | sum | 0.024520 | -3.623585e-03 | 1.611285e-03 | 5.057453 |
| 0 | ENSG00000079974 | ENSG00000079974 | CADD_RAW | sum | NaN | NaN | NaN | NaN |
| 1 | ENSG00000079974 | ENSG00000079974 | maf_beta | sum | 0.343694 | 6.644163e-04 | 7.016820e-04 | 0.896602 |
| 2 | ENSG00000079974 | ENSG00000079974 | tss_distance | sum | 0.977858 | 4.148339e-09 | 1.494648e-07 | 0.000770 |
| 3 | ENSG00000079974 | ENSG00000079974 | tss_distance_exp | sum | 0.244656 | 1.876147e-02 | 1.612602e-02 | 1.353567 |
1616 rows × 8 columns
Step 3: Combine Multiple Tests with ACAT#
To summarize evidence from multiple annotations, we use the ACAT (Aggregated Cauchy Association Test). This meta-analysis method combines p-values from the burden tests per gene into a single statistic, improving power and interpretability.
combined = rdf.dropna(subset=["pv"]).groupby("egene")["pv"].agg(lambda x: acat_test(x.values)).reset_index()
combined.sort_values("pv")
| egene | pv | |
|---|---|---|
| 84 | ENSG00000100219 | 2.287059e-14 |
| 290 | ENSG00000212939 | 2.274714e-11 |
| 75 | ENSG00000100154 | 8.218144e-05 |
| 186 | ENSG00000133460 | 9.702130e-05 |
| 198 | ENSG00000167077 | 1.625786e-04 |
| ... | ... | ... |
| 158 | ENSG00000100427 | 9.859188e-01 |
| 338 | ENSG00000244625 | 9.891262e-01 |
| 107 | ENSG00000100294 | 9.921173e-01 |
| 44 | ENSG00000100030 | 9.949769e-01 |
| 27 | ENSG00000099957 | 9.957546e-01 |
402 rows × 2 columns
Step 4: SKAT#
As an alternative to burden testing, we apply the Sequence Kernel Association Test (SKAT), which models the variance component of aggregated rare variants rather than assuming a unidirectional effect. SKAT is more robust when variant effects differ in direction or magnitude.
Currently, cellink supports SKAT with the standard Beta(MAF, 1, 25) weighting scheme.
import logging
logger = logging.getLogger()
logger.setLevel(logging.WARNING) # Suppress INFO and DEBUG esp. from SKAT Test
results = []
for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))
start = max(0, row.start - cis_window)
end = row.end + cis_window
_G = dd.G[:, (dd.G.var.pos < end)]
_G = _G[:, (_G.var.pos > start)]
_G = _G[:, (_G.X.std(0) != 0)]
rdict = run_skat_test(_G, Y, F, gene)
results.append(rdict)
rdf = pd.DataFrame(results)
rdf
| burden_gene | egene | weight_col | pv | |
|---|---|---|---|---|
| 0 | ENSG00000206195 | ENSG00000206195 | maf_beta | [[1.0]] |
| 1 | ENSG00000177663 | ENSG00000177663 | maf_beta | [[0.7540438404596421]] |
| 2 | ENSG00000069998 | ENSG00000069998 | maf_beta | [[0.03897879062889942]] |
| 3 | ENSG00000185837 | ENSG00000185837 | maf_beta | [[0.8117376585909591]] |
| 4 | ENSG00000093072 | ENSG00000093072 | maf_beta | [[0.9819123716041604]] |
| ... | ... | ... | ... | ... |
| 399 | ENSG00000205560 | ENSG00000205560 | maf_beta | [[0.02343900249308839]] |
| 400 | ENSG00000100288 | ENSG00000100288 | maf_beta | [[0.5147669611167375]] |
| 401 | ENSG00000205559 | ENSG00000205559 | maf_beta | [[0.23597911440366892]] |
| 402 | ENSG00000100299 | ENSG00000100299 | maf_beta | [[0.0004379433659393861]] |
| 403 | ENSG00000079974 | ENSG00000079974 | maf_beta | [[0.03259277644181413]] |
404 rows × 4 columns
rdf
| burden_gene | egene | weight_col | pv | |
|---|---|---|---|---|
| 0 | ENSG00000206195 | ENSG00000206195 | maf_beta | [[1.0]] |
| 1 | ENSG00000177663 | ENSG00000177663 | maf_beta | [[0.7540438404596421]] |
| 2 | ENSG00000069998 | ENSG00000069998 | maf_beta | [[0.03897879062889942]] |
| 3 | ENSG00000185837 | ENSG00000185837 | maf_beta | [[0.8117376585909591]] |
| 4 | ENSG00000093072 | ENSG00000093072 | maf_beta | [[0.9819123716041604]] |
| ... | ... | ... | ... | ... |
| 399 | ENSG00000205560 | ENSG00000205560 | maf_beta | [[0.02343900249308839]] |
| 400 | ENSG00000100288 | ENSG00000100288 | maf_beta | [[0.5147669611167375]] |
| 401 | ENSG00000205559 | ENSG00000205559 | maf_beta | [[0.23597911440366892]] |
| 402 | ENSG00000100299 | ENSG00000100299 | maf_beta | [[0.0004379433659393861]] |
| 403 | ENSG00000079974 | ENSG00000079974 | maf_beta | [[0.03259277644181413]] |
404 rows × 4 columns