Downloading DNA and Protein Sequences from NCBI Using Biopython and Taxonomy ID

gene_x 0 like s 381 view s

Tags: packages, python, Biopython

Biopython is a useful library for working with biological data in Python. You can use it to download sequences from GenBank by specifying a taxonomy ID. Here's a code example to download sequences using Biopython's Entrez module, which is an interface to the NCBI Entrez databases, including GenBank:

from Bio import Entrez
from Bio import SeqIO

# Set your email address (required by NCBI)
Entrez.email = "your_email@example.com"

# Specify the taxonomy ID
taxonomy_id = "your_taxonomy_id_here"

# Search for records in the nucleotide database using the taxonomy ID
search_query = f"txid{taxonomy_id}[Organism:exp]"
handle = Entrez.esearch(db="nucleotide", term=search_query)

# Parse the search results
record = Entrez.read(handle)
handle.close()

# Get the GenBank IDs of the records
genbank_ids = record["IdList"]

# Fetch the sequences using the GenBank IDs
sequences = []
for genbank_id in genbank_ids:
    handle = Entrez.efetch(db="nucleotide", id=genbank_id, rettype="gb", retmode="text")
    seq_record = SeqIO.read(handle, "genbank")
    handle.close()
    sequences.append(seq_record)

# Print the fetched sequences
for seq in sequences:
    print(seq)

Replace "your_email@example.com" with your actual email address, and "your_taxonomy_id_here" with the specific taxonomy ID you're interested in. This script will search the nucleotide database in NCBI GenBank using the specified taxonomy ID and download the corresponding sequences.

Here's a brief explanation of the code:

  • Import the necessary modules from Biopython: Entrez for accessing the NCBI databases and SeqIO for handling sequence records.
  • Set your email address. This is required by NCBI to track the usage of their resources.
  • Specify the taxonomy ID you want to search for.
  • Create a search query using the taxonomy ID, and search the nucleotide database.
  • Parse the search results to get a list of GenBank IDs.
  • Use the GenBank IDs to fetch the sequences, and store them in a list.
  • Print the fetched sequences in the console.

Please note that the script will fetch all the sequences associated with the specified taxonomy ID. If there are many sequences, it might take a while to download them all. Also, make sure you have installed Biopython in your Python environment before running the script. If you haven't, you can install it using pip:

pip install biopython

If you want to download protein sequences instead of nucleotide sequences, you can make a few modifications to the code I provided earlier. The main change is to search and fetch data from the "protein" database instead of the "nucleotide" database. Here's the updated code:

from Bio import Entrez
from Bio import SeqIO

# Set your email address (required by NCBI)
Entrez.email = "your_email@example.com"

# Specify the taxonomy ID
taxonomy_id = "your_taxonomy_id_here"

# Search for records in the protein database using the taxonomy ID
search_query = f"txid{taxonomy_id}[Organism:exp]"
handle = Entrez.esearch(db="protein", term=search_query)

# Parse the search results
record = Entrez.read(handle)
handle.close()

# Get the protein IDs of the records
protein_ids = record["IdList"]

# Fetch the sequences using the protein IDs
sequences = []
for protein_id in protein_ids:
    handle = Entrez.efetch(db="protein", id=protein_id, rettype="gb", retmode="text")
    seq_record = SeqIO.read(handle, "genbank")
    handle.close()
    sequences.append(seq_record)

# Print the fetched sequences
for seq in sequences:
    print(seq)

This code is almost identical to the previous one, but with two important changes:

  1. The Entrez.esearch(): function has its db parameter set to "protein" instead of "nucleotide". This change ensures that the search is performed in the protein database rather than the nucleotide database.
  2. The Entrez.efetch(): function also has its db parameter set to "protein" for the same reason. This change ensures that the sequences are fetched from the protein database.

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum