cellink.io.stream_pgen_to_zarr#
- cellink.io.stream_pgen_to_zarr(pgen_path, output_path, *, max_variants=None, max_samples=None, chunk_samples=4096, chunk_variants=2048, memory_limit_gb=10.0, compressor='zstd', compression_level=7, sparse=False, sparse_format='csc')#
Stream one or more PGEN files → a single Zarr v3 AnnData-compatible store.
When multiple pgen_path entries are provided, they are treated as disjoint variant sets for the SAME set of samples (e.g. rare + common split pgens). Variants are concatenated column-wise; obs (samples) must be identical across all inputs and are taken from the first file.
- Parameters:
pgen_path (str or list[str]) – Path(s) to .pgen file(s). Extension optional.
output_path (str) – Output Zarr v3 directory.
max_variants (int, optional) – Cap total variants (applied per file, then summed).
max_samples (int, optional) – Cap samples (applied once from the first file).
chunk_samples (int) – Zarr chunk size along sample axis.
chunk_variants (int) – Zarr chunk size along variant axis.
memory_limit_gb (float) – Max RAM per read block in GB.
compressor (str) – Blosc compressor name (‘zstd’, ‘lz4’, ‘zlib’).
compression_level (int) – Compression level 1-9.
sparse (bool) – If True, accumulate a scipy CSR matrix and write as AnnData sparse X. If False (default), stream directly into a dense Zarr dataset. Use sparse=True for rare variants (low density); dense for common.
sparse_format ({"csc", "csr"}) – Sparse matrix format when
sparse=True. Default is ‘csc’, which is more efficient for variant-wise access (e.g. association tests, per-variant filtering). Use ‘csr’ if your workload is primarily sample-wise (e.g. per-sample operations).
- Return type:
- Returns:
AnnData with Dask-backed X.