Retrieving KEGG Genes Using Bioservices in Python

Biopython does not have built-in support for KEGG database. However, you can use the bioservices library to retrieve and interact with KEGG data. To fetch all available genes in the KEGG database, you would need to iterate through each organism and collect all their genes. Note that this process might take a long time and may not be efficient, as there are thousands of organisms and millions of genes in the KEGG database.

Use the bioservices library to fetch the list of available organisms and retrieve genes for the first few organisms:

from bioservices import KEGG

# Initialize KEGG API
kegg_api = KEGG()

# Get the list of organisms
organisms_raw = kegg_api.list("organism")
organisms = [entry.split("\t")[1] for entry in organisms_raw.split("\n") if entry]
#['hsa', 'ptr', 'pps', 'ggo', 'pon', 'nle', 'hmh', 'mcc', 'mcf', 'mthb', 'mni', 'csab', 'caty', 'panu', 'tge', 'mleu', 'rro', 'rbb', 'tfn', 'pteh', 'cang', 'cjc', 'sbq', 'cimi', 'csyr', 'mmur', 'lcat', 'pcoq', 'oga', 'mmu', 'mcal', ... , 'loki', 'psyt', 'agw', 'arg']

# Limit the number of organisms and genes for demonstration purposes
organism_limit = 3
gene_limit = 10

# Iterate through the organisms
for organism in organisms[:organism_limit]:
    print(f"Organism: {organism}")

    # Get the list of genes for the current organism
    genes = kegg_api.list(f"{organism}").split("\n")[:gene_limit]

    # Iterate through the genes and print gene identifiers
    for gene_entry in genes:
        gene_id = gene_entry.split("\t")[0]
        print(f"Gene ID: {gene_id}")

    print("\n")

#Organism: hsa
#Gene ID: hsa:102466751
#Gene ID: hsa:100302278
#Gene ID: hsa:79501
#Gene ID: hsa:112268260
#Gene ID: hsa:729759
#Gene ID: hsa:124904706
#Gene ID: hsa:105378947
#Gene ID: hsa:113219467
#Gene ID: hsa:81399
#Gene ID: hsa:148398
#...

This code will print the gene identifiers of the first 10 genes for the first 3 organisms in the KEGG database. You can modify the organism_limit and gene_limit variables to change the number of organisms and genes processed.

Remember that fetching all genes from the KEGG database might take a significant amount of time and may not be efficient. It’s usually more practical to focus on specific organisms or pathways of interest.

Leave a Reply

Your email address will not be published. Required fields are marked *