cellink.io.stream_pgen_to_zarr#

cellink.io.stream_pgen_to_zarr(pgen_path, output_path, *, max_variants=None, max_samples=None, chunk_samples=4096, chunk_variants=2048, memory_limit_gb=10.0, compressor='zstd', compression_level=7, sparse=False, sparse_format='csc', return_adata=False)#

Stream one or more PGEN files → a single Zarr v3 AnnData-compatible store.

When multiple pgen_path entries are provided, they are treated as disjoint variant sets for the SAME set of samples (e.g. rare + common split pgens). Variants are concatenated column-wise; obs (samples) must be identical across all inputs and are taken from the first file.

Parameters:

pgen_path (str or list[str]) – Path(s) to .pgen file(s). Extension optional.
output_path (str) – Output Zarr v3 directory.
max_variants (int, optional) – Cap total variants (applied per file, then summed).
max_samples (int, optional) – Cap samples (applied once from the first file).
chunk_samples (int) – Zarr chunk size along sample axis.
chunk_variants (int) – Zarr chunk size along variant axis.
memory_limit_gb (float) – Max RAM per read block in GB.
compressor (str) – Blosc compressor name (‘zstd’, ‘lz4’, ‘zlib’).
compression_level (int) – Compression level 1-9.
sparse (bool) – If True, accumulate a scipy CSR matrix and write as AnnData sparse X. If False (default), stream directly into a dense Zarr dataset. Use sparse=True for rare variants (low density); dense for common.
sparse_format ({"csc", "csr"}) – Sparse matrix format when sparse=True. Default is ‘csc’, which is more efficient for variant-wise access (e.g. association tests, per-variant filtering). Use ‘csr’ if your workload is primarily sample-wise (e.g. per-sample operations).
return_adata (bool) – If True, return the written AnnData (Dask-backed X for the dense path). Default is False: this is normally a one-off conversion step, so the data isn’t re-opened/held onto unless asked for. Load it back later with read_pgen_zarr().

Return type:

AnnData | None

Returns:

AnnData or None The written AnnData if return_adata=True, otherwise None.

cellink.io.stream_pgen_to_zarr

Contents

cellink.io.stream_pgen_to_zarr#