cellink.tl.get_snp_df#
- cellink.tl.get_snp_df(variant_codes, server='https://grch37.rest.ensembl.org/')#
Retrieve SNP (Single Nucleotide Polymorphism) information and overlap with genes from Ensembl.
This function takes a list of SNP identifiers, queries the Ensembl REST API to retrieve information about the SNPs and their overlapping genes, and returns this data in the form of two dataframes. The first dataframe contains SNP-related data, including whether each SNP is located within a protein-coding gene and its clinical significance. The second dataframe provides gene-related information for the overlapping genes.
- Parameters:
- Returns:
tuple of pd.DataFrame A tuple containing: - var_df: A dataframe with SNP-related information, including whether the SNP is within a gene and its
clinical significance. Each row corresponds to an SNP.
gene_df: A dataframe containing information about the genes that overlap with the SNPs, with genes as the index.
Notes
The function uses the Ensembl REST API to query data, specifically querying for overlapping regions between SNPs and genes. The results include SNPs sourced from dbSNP, and the genes returned are limited to protein-coding genes.
Example
>>> variant_codes = ["1_55516888_T_C", "2_117900001_A_G"] >>> var_df, gene_df = get_snp_df(variant_codes) >>> var_df.head() snp_id is_in_gene genes ... clinical_significance 0 1_55516888_T_C True GENE1 ... pathogenic 1 2_117900001_A_G False GENE2 ... benign >>> gene_df.head() biotype start end ... strand id GENE1 protein_coding 5500000 5600000 ... 1 GENE2 protein_coding 11780000 11800000 ... -1