cellink.resources.get_onek1k#
- cellink.resources.get_onek1k(config_path='./cellink/resources/config/onek1k.yaml', data_home=None, verify_checksum=True, only_download=False, rerun_preprocessing=False, worker_processes=0, max_memory=None, variants_chunk_size=32000, samples_chunk_size=981)#
Download and preprocess the OneK1K genotype and expression dataset.
This function downloads genotype and expression files listed in a YAML configuration, optionally verifies checksums, converts VCF files to Zarr format, performs PLINK preprocessing including filtering, pruning, and kinship computation, and loads the dataset into a
DonorDataobject.Genotype preprocessing requires PLINK and is only performed if preprocessed outputs are not already present.
Additionally, it: - Performs liftover to hg19 coordinates for variant positions. - Computes donor principal components (gPCs) from genotype data. - Aligns expression data from CellxGene to the genotype data. - Encodes donor metadata such as sex and age.
- Parameters:
config_path (str, default="./cellink/resources/config/onek1k.yaml") – Path to the YAML configuration file listing remote genotype and expression files.
data_home (str or None, optional) – Directory where data should be stored. If None, uses the default
cellinkdata directory.verify_checksum (bool, default=True) – If True, verifies the checksum of downloaded files.
only_download (bool, default=False) – If True, only downloads the data without running the data conversion.
rerun_preprocessing (bool, default=False) – If True, re-runs all preprocessing steps even if outputs already exist.
worker_processes (int, default=0) – Number of worker processes for
vcf2zarrexplode and encode steps. 0 means use a single process (no parallelism).max_memory (str or None, optional) – Approximate upper bound on memory usage during the encode step (e.g. “8G”, “32G”). If None and worker_processes > 0, vcf2zarr will use its default memory limit. If worker_processes is 0, this argument is ignored entirely as single-process encoding does not require a memory bound.
variants_chunk_size (int, default=32000) – Chunk size in the variants dimension for the output Zarr store.
samples_chunk_size (int, default=981) – Chunk size in the samples dimension. OneK1K has 981 donors, so the default stores all samples in a single chunk.
- Return type:
- Returns:
cellink.DonorData A
DonorDataobject containing preprocessed genotype (G) and expression (C) data, along with kinship and principal component metadata.- Raises:
FileNotFoundError – If any required genotype or expression files are missing after download.
RuntimeError – If preprocessing steps (VCF conversion, PLINK operations, or liftover) fail, e.g. due to insufficient memory or invalid chunk sizes.
ValueError – If variant liftover or donor alignment cannot be performed.
EnvironmentError – If PLINK is required for preprocessing but is not available on PATH.