cellink.tl.aggregate_annotations_for_varm#
- cellink.tl.aggregate_annotations_for_varm(gdata, annotation_key, agg_type='unique_list_max', return_data=False)#
Aggregates a DataFrame containing variant annotations based on the specified aggregation type such that there is only row per variant id. This means that annotations are aggregated across different gene/transcript contexts
- Parameters:
gdata (object) – The genomic data object containing annotations stored in
unsunder specific keys.annotation_key (str) – Key to access the annotations within
gdata.uns. The annotations are expected to be stored as a pandas DataFrame.agg_type (str) –
- Aggregation type to determine how annotation values are combined. Options are:
- ”unique_list_max”: Unique string values are aggregated into a comma-separated string,
and numeric columns are aggregated by their maximum value.
”list”: Aggregates all values into a list, preserving duplicates.
”str”: Aggregates all values into a single comma-separated string.
”first”: Drops duplicates and keeps only the first occurrence for each variant-context pair.
Default is “unique_list_max”.
return_data (bool) – If True, the aggregated DataFrame is returned in addition to modifying the
gdataobject. Default is False.
- Returns:
pd.DataFrame The aggregated DataFrame is returned if
return_datais True. Otherwise, the function writes the aggregated annotations to gdata.varm[“variant_annotation”].
Examples
>>> aggregate_annotations(gdata, "variant_annotation_vep", agg_type = "unique_list_max", debug = True)