cellink.resources.get_1000genomes#
- cellink.resources.get_1000genomes(config_path='./cellink/resources/config/1000genomes.yaml', data_home=None, verify_checksum=True, only_download=False, rerun_preprocessing=False, worker_processes=0, max_memory=None, variants_chunk_size=32000, samples_chunk_size=2504)#
Download and preprocess the 1000 Genomes Project genotype data.
This function downloads genotype files specified in a YAML configuration, optionally verifies checksums, converts all VCF files across chromosomes to a single Zarr store using
vcf2zarr, and loads the result into anAnnDataobject usingcellink.- Parameters:
config_path (str, default="./cellink/resources/config/1000genomes.yaml") – Path to the YAML configuration file listing remote genotype files.
data_home (str or None, optional) – Directory where data should be stored. If None, uses the default
cellinkdata directory.verify_checksum (bool, default=True) – If True, verifies the checksum of downloaded files.
only_download (bool, default=False) – If True, only downloads the data without running the data conversion.
rerun_preprocessing (bool, default=False) – If True, re-runs VCF to Zarr conversion even if output already exists.
worker_processes (int, default=0) – Number of worker processes for
vcf2zarrexplode and encode steps. 0 means use a single process (no parallelism).max_memory (str or None, optional) – Approximate upper bound on memory usage during the encode step (e.g. “8G”, “32G”). If None and worker_processes > 0, vcf2zarr will use its default memory limit. If worker_processes is 0, this argument is ignored entirely as single-process encoding does not require a memory bound.
variants_chunk_size (int, default=32000) – Chunk size in the variants dimension for the output Zarr store.
samples_chunk_size (int, default=2504) – Chunk size in the samples dimension. The 1000 Genomes Project has 2,504 samples, so the default stores all samples in a single chunk.
- Return type:
- Returns:
anndata.AnnData Genotype data across all autosomes in a single Zarr-backed AnnData object.
- Raises:
FileNotFoundError – If any required VCF files are missing after download.
RuntimeError – If
vcf2zarrconversion fails, e.g. due to insufficient memory or chunk sizes that exceed array dimensions.