cellink.resources.get_dummy_onek1k#
- cellink.resources.get_dummy_onek1k(config_path='./cellink/resources/config/dummy_onek1k.yaml', data_home=None, verify_checksum=True)#
Download and load the dummy OneK1K dataset.
This function downloads a pre-processed subset of the OneK1K dataset containing: - Full chromosome 22 genotype data - 0.1% sample of SNPs from chromosomes 1-21 - ~100 donors (randomly sampled) - All cell types and expression data preserved - Gene annotations pre-included (GAnn.start, GAnn.end, GAnn.chrom) - Only QTL-relevant genes (chr 1-22, within ±1Mb of any SNP)
The dummy dataset is ideal for tutorials, testing, and demonstrations as it is ~100x smaller than the full OneK1K dataset while maintaining the same structure and API. Unlike the full dataset, gene annotations are already included, eliminating the need for pybiomart calls in tutorials.
- Parameters:
config_path (str, default="./cellink/resources/config/dummy_onek1k.yaml") – Path to the YAML configuration file listing the remote dummy dataset file.
data_home (str or None, optional) – Directory where data should be stored. If None, uses the default
cellinkdata directory.verify_checksum (bool, default=True) – If True, verifies the checksum of the downloaded file.
- Return type:
- Returns:
cellink.DonorData A
DonorDataobject containing preprocessed genotype (G) and expression (C) data, along with kinship and principal component metadata. Gene annotations are already included indd.C.var.
Examples
>>> from cellink.resources import get_dummy_onek1k >>> dd = get_dummy_onek1k() >>> print(dd.shape) # (n_donors, n_snps, n_cells, n_genes) >>> # Gene annotations are already included! >>> print(dd.C.var[[GAnn.start, GAnn.end, GAnn.chrom]].head())
Notes
The dummy dataset maintains the same structure as the full OneK1K dataset: - dd.G: Genotype data (donors x SNPs) - dd.C: Single-cell expression data (cells x genes) - dd.G.obsm[“gPCs”]: Genotype principal components - dd.G.uns[“kinship”]: Kinship matrix - dd.C.var[GAnn.start/end/chrom]: Gene annotations (pre-included!)
The dataset is provided as a single file (.dd.h5 or .dd.zarr) that can be quickly downloaded and loaded without additional preprocessing.
Key differences from full dataset: - Gene annotations are pre-included (no pybiomart needed) - Only genes on chromosomes 1-22 are included - Only genes within ±1Mb of any SNP (QTL-relevant) - Reduces gene count from ~20k to ~5-10k for faster processing
See also
get_onek1kLoad the full OneK1K dataset