Author Archives: gene_x

Preparing custom gtf and bed files for nextflow RNA-seq pipeline

-- 0, generate the ref.gtf and ref.bed from GenBank file: HD21-2_new.gtf and HD21-2_new.bed --

#[PREPARE gtf from gff] https://github.com/gpertea/gffread/issues/3
#DEL gb2gff.py HD21-2_chr.gbk > HD21-2.gtf
#DEL gb2gff.py HD21-2_plasmid_pHD21-2_1.gbk >> HD21-2.gtf
#DEL gb2gff.py HD21-2_plasmid_pHD21-2_2.gbk >> HD21-2.gtf

#-- STEP 0.0: download the gff3 files --
cat HD21-2_chr.gff3 HD21-2_plasmid1.gff3 HD21-2_plasmid2.gff3 > HD21-2.gff3

#-- STEP 0.1: rename CDS to exon! --
gffread -E -F -O -T HD21-2.gff3 -o HD21-2.gtf
sed -i -e "s/\tCDS\t/\texon\t/g" HD21-2.gtf 
#Genbank_CDS --> Genbank_exon
#cmsearch_CDS --> cmsearch_exon
#tRNAscan-SE_CDS --> tRNAscan-SE_exon
#or xxx_CDS --> xxx_exon --> xxx[cmsearch, Genbank, tRNAscan] exon
#
#-- STEP 0.2: add protein_coding "xxxx" to type exon
#exon (former Genbank CDS): 2657 = (Genbank transcript + cmsearch transcript + tRNAscan-SE transcript) = (2586+61+10)
#Genbank region:                                   7
#        transcript:                            2649
#Genbank_exon == Genbank transcript             2586
#tRNAscan-SE_exon == tRNAscan-SE transcript       61
#cmsearch_exon == cmsearch transcript             10
------------------------------------------------------
#total                                          5313 (transcript 2649-->bed, exon 2657, region 7)

#in kate "\texon" --> "_exon"
grep "Genbank_exon" HD21-2.gtf > HD21-2_Genbank_exon.gtf            #add gene_biotype "protein_coding"; at the end of line (2586)
grep "tRNAscan-SE_exon" HD21-2.gtf > HD21-2_tRNAscan-SE_exon.gtf    #add gene_biotype "tRNA"; at the end of line (61)
grep "cmsearch_exon" HD21-2.gtf > HD21-2_cmsearch_exon.gtf          #add gene_biotype "RNase_P_RNA";| gene_biotype "ncRNA";| gene_biotype "rRNA";| gene_biotype "tmRNA";| gene_biotype "SRP_RNA"; at the end (10)

grep -i -e "cmsearch" HD21-2.gtf > temp
grep -i -e "gene_biotype" temp > temp2
CP052994.1  cmsearch    transcript  1286239 1286622 .   -   .   transcript_id "rna-HKH70_06005"; gene_id "gene-HKH70_06005"; gene_name "rnpB"; Dbxref "RFAM:RF00011"; gbkey "ncRNA"; gene "rnpB"; inference "COORDINATES: profile:INFERNAL:1.1.1"; locus_tag "HKH70_06005"; product "RNase P RNA component class B"; Name "rnpB"; gene_biotype "RNase_P_RNA";
CP052994.1  cmsearch    transcript  1452715 1452905 .   -   .   transcript_id "rna-HKH70_06865"; gene_id "gene-HKH70_06865"; gene_name "ssrS"; Dbxref "RFAM:RF00013"; gbkey "ncRNA"; gene "ssrS"; inference "COORDINATES: profile:INFERNAL:1.1.1"; locus_tag "HKH70_06865"; product "6S RNA"; Name "ssrS"; gene_biotype "ncRNA";
CP052994.1  cmsearch    transcript  1706296 1706410 .   -   .   transcript_id "rna-HKH70_08190"; gene_id "gene-HKH70_08190"; gene_name "rrf"; Dbxref "RFAM:RF00001"; gbkey "rRNA"; gene "rrf"; inference "COORDINATES: profile:INFERNAL:1.1.1"; locus_tag "HKH70_08190"; product "5S ribosomal RNA"; Name "rrf"; gene_biotype "rRNA";
...

grep -v "_exon" HD21-2.gtf > HD21-2_non_exon.gtf #2656
cat HD21-2_non_exon.gtf HD21-2_Genbank_exon.gtf HD21-2_tRNAscan-SE_exon.gtf HD21-2_cmsearch_exon.gtf  > HD21-2_new.gtf #5313

#in kate "exon" --> "\texon"
#[Eventually] sort the HD21-2_new.gtf using libreoffice
#[Eventually] replace "" with "; "transcript_id with transcript_id; "\n with \n in kate

#-- STEP 0.3: PREPARE bed from gtf
#[Eventually] replace '\t0\t' with '\t.\t' in the gff-file in the kate-editor
gffread -E -F --bed HD21-2_new.gtf -o HD21-2_new.bed  #   .. loaded 2649 genomic features from HD21-2_new.gtf

#[IMPORTANT] delelte [*.1] in gtf, bed. In fa could be e.g. ">CP052994 Staphylococcus epidermidis strain HD21-2 chromosome, complete genome"
#GTF: CP052994.1      Genbank exon    1       1356    .       +       .       transcript_id "gene-HKH70_00005"; gene_id "gene-HKH70_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QRJ41002.1"; Name "QRJ41002.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_017723675.1"; locus_tag "HKH70_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QRJ41002.1"; transl_table "11"; gene_biotype "protein_coding";
#BED: CP052994.1      0       1356    gene-HKH70_00005        100     +       0       1356    0,0,0   1       1356,   0,      CDS=0:1356;CDSphase=0;geneID=gene-HKH70_00005;gene_name=dnaA;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=HKH70_00005

-------
-- 2 --
conda activate rnaseq
# --> !!!! SUCCESSFUL !!!!
nextflow run rnaseq --reads '/home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/raw_data/*_R{1,2}.fq.gz' --fasta /home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/HD21-2.fa --gtf /home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/HD21-2_new.gtf --bed12 /home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/HD21-2_new.bed -profile standard --aligner hisat2 --fcGroupFeatures gene_id --fcGroupFeaturesType gene_biotype --saveReference --skip_genebody_coverage --skip_dupradar --skip_preseq --skip_edger  -resume
#default fcGroupFeatures="gene_biotype" (or gbkey)
      --skip_qc                     Skip all QC steps apart from MultiQC
      --skip_fastqc                 Skip FastQC
      --skip_rseqc                  Skip RSeQC
      --skip_genebody_coverage      Skip calculating genebody coverage
      --skip_preseq                 Skip Preseq
      --skip_dupradar               Skip dupRadar (and Picard MarkDups)
      --skip_edger                  Skip edgeR MDS plot and heatmap
      --skip_multiqc                Skip MultiQC

47 popular functions in Biopython

Leave a reply

Here are 47 popular functions in Biopython:

Seq: Represents a biological sequence, such as DNA or protein.
SeqRecord: Represents a sequence with metadata, including ID and description.
SeqIO: Handles reading and writing sequence files in various formats.
AlignIO: Handles reading and writing multiple sequence alignment files in various formats.
Bio.Alphabet: Defines sequence alphabets, such as IUPAC codes for DNA and protein.
Bio.SeqUtils: Provides various utilities for working with sequences, such as computing molecular weight and codon usage.
Bio.SeqIO.QualityIO: Handles reading and writing FASTQ files.
Bio.SeqIO.PhdIO: Handles reading and writing PHD files.
Bio.SeqIO.FastaIO: Handles reading and writing FASTA files.
Bio.SeqIO.ClustalIO: Handles reading and writing ClustalW files.
Bio.SeqIO.EmbossIO: Handles reading and writing EMBOSS files.
Bio.SeqIO.GenBankIO: Handles reading and writing GenBank files.
Bio.SeqIO.SwissIO: Handles reading and writing Swiss-Prot files.
Bio.SeqIO.TabIO: Handles reading and writing tab-separated files.
Bio.SeqIO.NexusIO: Handles reading and writing NEXUS files.
Bio.SearchIO: Parses search results from various tools, such as BLAST and HMMER.
Bio.Blast: Runs BLAST searches against various databases.
Bio.KEGG: Provides access to the KEGG database.
Bio.PDB: Handles reading and writing PDB files, as well as structure manipulation and analysis.
Bio.Align: Represents a multiple sequence alignment, with associated metadata and annotations.
Bio.Align.Applications: Wrappers for running multiple sequence alignment tools, such as ClustalW and Muscle.
Bio.Restriction: Provides tools for simulating and analyzing DNA restriction digests.
Bio.SeqFeature: Represents a feature or annotation on a sequence, such as a gene or a mutation.
Bio.Graphics.GenomeDiagram: Generates graphical representations of genome sequences, with features and annotations.
Bio.Graphics.BasicChromosome: Generates basic chromosome diagrams.
Bio.Emboss.Applications: Wrappers for running EMBOSS tools, such as Needle and Water.
Bio.SeqIO.Example: Provides example sequence data for testing and learning.
Bio.AlignIO.Example: Provides example alignment data for testing and learning.
Bio.Align.AlignInfo: Computes various summary statistics and information content for a multiple sequence alignment.
Bio.Align.PairwiseAligner: Computes pairwise alignments between two sequences.
Bio.Motif: Represents sequence motifs, such as transcription factor binding sites.
Bio.Motif.PWM: Represents a position weight matrix, commonly used to represent sequence motifs.
Bio.Alphabet.IUPAC: Defines IUPAC codes for various sequence alphabets, such as DNA and protein.
Bio.Alphabet.Gapped: Defines a gapped sequence alphabet, used for representing alignments.
Bio.SeqIO.AceIO: Handles reading and writing ACE files, commonly used for sequencing data.
Bio.SeqIO.GameXMLIO: Handles reading and writing GAME XML files, used for gene annotation.
Bio.SeqIO.LasIO: A module in Biopython that provides a parser for the PacBio LAS file format. The LAS file format is used to store PacBio long-read sequencing data, including raw signal data, basecalls, and metadata.
Bio.SeqUtils.molecular_weight(seq) – Calculates the molecular weight of a sequence.
Bio.Phylo.PAML.codeml() – A module for running the PAML codeml program for estimating rates of evolution and selection.
Bio.AlignIO.parse(handle, format) – Parses multiple sequence alignments from a file.
Bio.Alphabet.IUPAC.unambiguous_dna – Defines the unambiguous DNA alphabet.
Bio.Seq.reverse_complement(seq) – Returns the reverse complement of a sequence.
Bio.Seq.translate(seq) – Translates a nucleotide sequence into the corresponding amino acid sequence.
Bio.SeqUtils.seq1(seq) – Converts a protein sequence from three-letter code to one-letter code.
Bio.SeqIO.write(records, handle, format) – Writes a sequence record or a list of sequence records to a file.
Bio.PDB.PDBList() – A module for downloading PDB files from the RCSB PDB server.
Bio.Blast.NCBIWWW.qblast(program, database, sequence) – Sends a sequence to the NCBI BLAST server for homology search.

GAMOLA2 pre- and postprocessing

Leave a reply

#-- view process.py --
with open("MT880872_CDS.fasta", "r") as f:
    for line in f:
        if line.startswith(">"):
            header_words = line.strip().split("locus_tag=")
            third_word = header_words[1]
            third_word_ = third_word.strip().split("]")
            third_word__ = third_word_[0]
            print(">"+third_word__)
        else:
            print(line.strip()) 

 #input_file: MT880870_.fasta as pan_genome_reference_.fa; MT880870_.fasta is modified from MT880870.fasta by simplying the fasta headers.
echo ">MT880870_genes" > MT880870_genes.fa;
merge_seq.py MT880870_.fasta >> MT880870_genes.fa;
samtools faidx MT880870_genes.fa
bioawk -c fastx '{ print $name, length($seq) }' < MT880870_.fasta > length.txt 
generate_gene_model.py length.txt > /media/jhuang/Titisee/GAMOLA2/Results/gene_models/MT880870_genes.fa.combined;
cp MT880870_genes.fa /media/jhuang/Titisee/GAMOLA2/Input_sequences

python process.py > MT880871_.fasta
echo ">MT880871_genes" > MT880871_genes.fa;
merge_seq.py MT880871_.fasta >> MT880871_genes.fa;
samtools faidx MT880871_genes.fa
bioawk -c fastx '{ print $name, length($seq) }' < MT880871_.fasta > length.txt 
generate_gene_model.py length.txt > /media/jhuang/Titisee/GAMOLA2/Results/gene_models/MT880871_genes.fa.combined;
cp MT880871_genes.fa /media/jhuang/Titisee/GAMOLA2/Input_sequences

python process.py > MT880872_.fasta
echo ">MT880872_genes" > MT880872_genes.fa;
merge_seq.py MT880872_.fasta >> MT880872_genes.fa;
samtools faidx MT880872_genes.fa
bioawk -c fastx '{ print $name, length($seq) }' < MT880872_.fasta > length.txt 
generate_gene_model.py length.txt > /media/jhuang/Titisee/GAMOLA2/Results/gene_models/MT880872_genes.fa.combined;
cp MT880872_genes.fa /media/jhuang/Titisee/GAMOLA2/Input_sequences

#---- 4 ----
cd /media/jhuang/Titisee/GAMOLA2
#WARNING: !!swissprot as default database, manually choose nr as Blast_db --> too slow, better choose swissprot, it is quick!!
./Gamola.pl    #No Glimmer model and Critica database due to self-extracted ORF; choosing nr.pal or swissprot.pal as Blast_db; COG2014 as COG_db; Pfam-A.hmm as Pfam_db; No Rfam_db; TIGRFAMS_15.0_HMM.LIB as TIGRfam_db
/media/jhuang/Titisee/GAMOLA2/Results/COG_results
#grep "Class: ," roary_182s_95.fa_COG_*
for sample in 10601 10710 11187 11377 12474 12504  1520 1551 2092 2160 2467   289 3455 4694 5862 5863 6166 6464 7618 8007 8406 8784 9157 9373 953; do
  sed -i -e 's/Class: ,/Class: None, None/g' roary_182s_95.fa_COG_${sample}
done
./Gamola.pl

#---- 5 ---- update locustag
cp /media/jhuang/Titisee/GAMOLA2/Consolidated_results/MT880870_genes.fa/MT880870_genes.fa.gb ./
awk '{print $0 "ORIGIN"> "file" NR}' RS='ORIGIN' MT880870_genes.fa.gb
sed -i 's/</ /g' file2
sed -i 's/^ //g' file2
cat file1 file2 > MT880870_genes_.fa.gb
iconv -t UTF-8 -f Windows-1252 MT880870_genes_.fa.gb
#89600270 Jan 26 13:44 roary_182s_95__.fa.gb
#-- if ran [3.2 selected]--
python ~/Scripts/update_locustag.py MT880870_genes_.fa.gb MT880870_.fasta.fai > MT880870_genes__.fa.gb
> MT880870_genes__.fa.gb    #clean old *__.fa.gb

#---- 6.6 ----
cut -d$'\t' -f1 MT880870_.fasta.fai > MT880870_f1
process_gamola_gb_plus.py MT880870_f1 MT880870_genes__.fa.gb > annotated_MT880870.txt  #!!

Some fields may be wrong.
  BiopythonParserWarning)
/usr/local/lib/python2.7/dist-packages/Bio/GenBank/__init__.py:1047: BiopythonParserWarning: Ignoring invalid location: '1186620..1185596'
sed -i -e 's/1186620\.\.1185596/1185596\.\.1186620/g' roary__.fa.gb
#DEBUG "group_12676" -> "group_12667"

#---- ADD DNA-Sequences add the end of table as the last column ----
cut -d',' -f1-1 annotated_MT880870.txt > get_seq_ORF.sh
#extend the file get_seq_ORF.sh to extract all DNA-sequences
#mv the bash file under DIR of MT880870_.fasta.fai 
#samtools faidx MT880870_.fasta "PLKLOBMN_00001" > seq_ORF.fasta
#samtools faidx MT880870_.fasta "PLKLOBMN_00002" >> seq_ORF.fasta
#or in the case simply "cp MT880870_.fasta seq_ORF.fasta"
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < seq_ORF.fasta > seq_ORF_.fasta
grep -v ">" seq_ORF_.fasta > seq_ORF__.txt
#add "Sequence" to the first line of seq_ORF__.txt
paste -d',' annotated_MT880870.txt seq_ORF__.txt > annotated_MT880870_.txt

#---- repeat 5-->6.6 for MT880871 and MT880872 --------------
cp /media/jhuang/Titisee/GAMOLA2/Consolidated_results/MT880871_genes.fa/MT880871_genes.fa.gb ./
awk '{print $0 "ORIGIN"> "file" NR}' RS='ORIGIN' MT880871_genes.fa.gb
sed -i 's/</ /g' file2
sed -i 's/^ //g' file2
cat file1 file2 > MT880871_genes_.fa.gb
iconv -t UTF-8 -f Windows-1252 MT880871_genes_.fa.gb
python ~/Scripts/update_locustag.py MT880871_genes_.fa.gb MT880871_.fasta.fai > MT880871_genes__.fa.gb
#remove the last line 'ORIGIN'
cut -d$'\t' -f1 MT880871_.fasta.fai > MT880871_f1
process_gamola_gb_plus.py MT880871_f1 MT880871_genes__.fa.gb > annotated_MT880871.txt 

cp MT880871_.fasta seq_ORF_71.fasta
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < seq_ORF_71.fasta > seq_ORF_71_.fasta
grep -v ">" seq_ORF_71_.fasta > seq_ORF_71__.txt
#add 'Sequence' in the first line in seq_ORF_71__.txt
paste -d',' annotated_MT880871.txt seq_ORF_71__.txt > annotated_MT880871_.txt

#--

cp /media/jhuang/Titisee/GAMOLA2/Consolidated_results/MT880872_genes.fa/MT880872_genes.fa.gb ./
awk '{print $0 "ORIGIN"> "file" NR}' RS='ORIGIN' MT880872_genes.fa.gb
sed -i 's/</ /g' file2
sed -i 's/^ //g' file2
cat file1 file2 > MT880872_genes_.fa.gb
iconv -t UTF-8 -f Windows-1252 MT880872_genes_.fa.gb
python ~/Scripts/update_locustag.py MT880872_genes_.fa.gb MT880872_.fasta.fai > MT880872_genes__.fa.gb
#remove the last line 'ORIGIN'
cut -d$'\t' -f1 MT880872_.fasta.fai > MT880872_f1
process_gamola_gb_plus.py MT880872_f1 MT880872_genes__.fa.gb > annotated_MT880872.txt 

cp MT880872_.fasta seq_ORF_72.fasta
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < seq_ORF_72.fasta > seq_ORF_72_.fasta
grep -v ">" seq_ORF_72_.fasta > seq_ORF_72__.txt
#add 'Sequence' in the first line in seq_ORF_72__.txt
paste -d',' annotated_MT880872.txt seq_ORF_72__.txt > annotated_MT880872_.txt

mv annotated_MT880870_.txt annotated__MT880870_.txt
#~/Tools/csv2xls-0.4/csv_to_xls.py annotated__MT880870_.txt annotated_MT880871_.txt annotated_MT880872_.txt -d',' -o annotated_MT880870-MT880872.xls

The script that takes two file names as command line arguments, merges the files based on matching first words, removes the key from file2, and deletes all lines in the merged file in which the key does not occur in file2.

import sys

#"","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj","Gene","Product","PFAM Label","PFAM Match", "PFAM Interpro", "PFAM Product", "TIGR Label", "TIGR Product","COG Label", "COG Code","COG Product","Translation"  #delete "Locus Tag",
#python3 merge2.py hS_vs_TSB_output/background annotated_CP052994_2996.txt hS_vs_TSB_output/background_
#python3 merge2.py hS_vs_TSB_output/downregulated_filtered annotated_CP052994_2996.txt hS_vs_TSB_output/downregulated_filtered_
#python3 merge2.py hS_vs_TSB_output/upregulated_filtered annotated_CP052994_2996.txt hS_vs_TSB_output/upregulated_filtered_
#python3 merge2.py infection_vs_nose_output/background annotated_CP052994_2996.txt infection_vs_nose_output/background_
#python3 merge2.py infection_vs_nose_output/downregulated_filtered annotated_CP052994_2996.txt infection_vs_nose_output/downregulated_filtered_
#python3 merge2.py infection_vs_nose_output/upregulated_filtered annotated_CP052994_2996.txt infection_vs_nose_output/upregulated_filtered_
#sed -i -e 's/HKH70_09175/HKH70_09175 (PLKLOBMN_00173)/g' background_
#sed -i -e 's/HKH70_05165/HKH70_05165 (PLKLOBMN_00016)/g' background_
#sed -i -e 's/HKH70_05170/HKH70_05170 (PLKLOBMN_00017)/g' background_
#sed -i -e 's/HKH70_05175/HKH70_05175 (PLKLOBMN_00018)/g' background_
#python3 ~/Tools/csv2xls-0.4/csv_to_xls.py background_ upregulated_filtered_ downregulated_filtered_ -d',' -o degenes_hS_vs_TSB.xls
#python3 ~/Tools/csv2xls-0.4/csv_to_xls.py background_ upregulated_filtered_ downregulated_filtered_ -d',' -o degenes_infection_vs_nose.xls
# Check if both file names were provided as command line arguments
if len(sys.argv) != 4:
    print("Usage: python merge2.py file1.txt file2.txt output_file.txt")
    sys.exit(1)

# Open file1 for reading and file2 for editing
with open(sys.argv[1], 'r') as f1, open(sys.argv[2], 'r+') as f2:
    # Read file2 into a dictionary
    f2_dict = {}
    for line in f2:
        key, content = line.strip().split(',', 1)
        f2_dict[key] = content

    # Merge file1 and file2
    merged_lines = []
    for line in f1:
        key = line.split(',')[0]
        if key in f2_dict:
            content = f2_dict[key]
            merged_lines.append(line.strip() + ',' + content)

    ## Move the file pointer to the beginning of file2
    #f2.seek(0)
    ## Truncate file2 to delete all its contents
    #f2.truncate()
    ## Rewrite file2 with the updated content
    #for key, content in f2_dict.items():
    #    f2.write(content + '\n')

    # Overwrite file1 with the merged content
    with open(sys.argv[3], 'w') as f3:
        f3.write('\n'.join(merged_lines))

Retrieving KEGG Genes Using Bioservices in Python

Leave a reply

Biopython does not have built-in support for KEGG database. However, you can use the bioservices library to retrieve and interact with KEGG data. To fetch all available genes in the KEGG database, you would need to iterate through each organism and collect all their genes. Note that this process might take a long time and may not be efficient, as there are thousands of organisms and millions of genes in the KEGG database.

Use the bioservices library to fetch the list of available organisms and retrieve genes for the first few organisms:

from bioservices import KEGG

# Initialize KEGG API
kegg_api = KEGG()

# Get the list of organisms
organisms_raw = kegg_api.list("organism")
organisms = [entry.split("\t")[1] for entry in organisms_raw.split("\n") if entry]
#['hsa', 'ptr', 'pps', 'ggo', 'pon', 'nle', 'hmh', 'mcc', 'mcf', 'mthb', 'mni', 'csab', 'caty', 'panu', 'tge', 'mleu', 'rro', 'rbb', 'tfn', 'pteh', 'cang', 'cjc', 'sbq', 'cimi', 'csyr', 'mmur', 'lcat', 'pcoq', 'oga', 'mmu', 'mcal', ... , 'loki', 'psyt', 'agw', 'arg']

# Limit the number of organisms and genes for demonstration purposes
organism_limit = 3
gene_limit = 10

# Iterate through the organisms
for organism in organisms[:organism_limit]:
    print(f"Organism: {organism}")

    # Get the list of genes for the current organism
    genes = kegg_api.list(f"{organism}").split("\n")[:gene_limit]

    # Iterate through the genes and print gene identifiers
    for gene_entry in genes:
        gene_id = gene_entry.split("\t")[0]
        print(f"Gene ID: {gene_id}")

    print("\n")

#Organism: hsa
#Gene ID: hsa:102466751
#Gene ID: hsa:100302278
#Gene ID: hsa:79501
#Gene ID: hsa:112268260
#Gene ID: hsa:729759
#Gene ID: hsa:124904706
#Gene ID: hsa:105378947
#Gene ID: hsa:113219467
#Gene ID: hsa:81399
#Gene ID: hsa:148398
#...

This code will print the gene identifiers of the first 10 genes for the first 3 organisms in the KEGG database. You can modify the organism_limit and gene_limit variables to change the number of organisms and genes processed.

Remember that fetching all genes from the KEGG database might take a significant amount of time and may not be efficient. It’s usually more practical to focus on specific organisms or pathways of interest.

Running PICRUSt2: A Step-by-Step Guide to Predicting Microbial Community Functional Profiles

Leave a reply

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a bioinformatics tool used to predict the functional content of a microbial community based on marker gene sequences (e.g., 16S rRNA gene sequences).

Here’s a step-by-step guide on how to run PICRUSt2:

Install PICRUSt2 and its dependencies: You can install PICRUSt2 using conda. If you don’t have conda installed, first install Miniconda or Anaconda. Then, create a conda environment and install PICRUSt2 using the following commands:
```
conda create -n picrust2_env
conda activate picrust2_env
conda install -c bioconda -c conda-forge picrust2
```
Prepare input files: You will need two input files to run PICRUSt2: a sequence alignment file (in FASTA format) and an OTU (Operational Taxonomic Unit) table or ASV (Amplicon Sequence Variant) table (in BIOM format). Make sure your input files are properly formatted and contain the necessary information.
Place sequences into reference phylogeny: Run the place_seqs.py script to place your input sequences into a reference phylogeny. Replace
and with your input FASTA file and desired output directory, respectively. place_seqs.py -s -o /out.tre -p 1 –intermediate /intermediate
Run PICRUSt2 pipeline: Execute the picrust2_pipeline.py script to run the PICRUSt2 pipeline. Replace
with your OTU/ASV table, and with your desired output directory. picrust2_pipeline.py -i -s -o -p 1
Generate output files: The pipeline will generate various output files in the specified output directory, including predicted functional profiles in TSV format. You can further analyze and visualize these results using other tools and software (e.g., STAMP or R).

For more detailed information and advanced options, consult the official PICRUSt2 documentation: https://github.com/picrust/picrust2/wiki

Creating a Simple Snake Game (贪吃蛇) in Python Using Pygame

Leave a reply

贪吃蛇是一款经典的游戏，可以用Python的Pygame库来实现。以下是一个简单的贪吃蛇游戏实现：

首先，确保已经安装了Python和Pygame库。如果没有安装Pygame，请使用以下命令进行安装：
```
pip install pygame
```

创建一个新的Python文件（例如：snake.py），然后复制以下代码到文件中：

import sys
import pygame
import random

pygame.init()

# 游戏窗口设置
width, height = 640, 480
screen = pygame.display.set_mode((width, height))
pygame.display.set_caption('贪吃蛇')

# 定义颜色
black = (0, 0, 0)
white = (255, 255, 255)
red = (255, 0, 0)

# 蛇和食物的大小
block_size = 20

# 初始化蛇的位置
snake = [(width // 2, height // 2)]
snake_speed = block_size
snake_direction = 'right'

# 初始化食物的位置
food = (random.randint(1, width // block_size - 1) * block_size, random.randint(1, height // block_size - 1) * block_size)

clock = pygame.time.Clock()

while True:
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            pygame.quit()
            sys.exit()

        # 检测键盘方向键
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_UP and snake_direction != 'down':
                snake_direction = 'up'
            elif event.key == pygame.K_DOWN and snake_direction != 'up':
                snake_direction = 'down'
            elif event.key == pygame.K_LEFT and snake_direction != 'right':
                snake_direction = 'left'
            elif event.key == pygame.K_RIGHT and snake_direction != 'left':
                snake_direction = 'right'

    x, y = snake[0]
    if snake_direction == 'up':
        y -= snake_speed
    elif snake_direction == 'down':
        y += snake_speed
    elif snake_direction == 'left':
        x -= snake_speed
    elif snake_direction == 'right':
        x += snake_speed

    snake.insert(0, (x, y))

    # 检测蛇是否吃到食物
    if snake[0] == food:
        food = (random.randint(1, width // block_size - 1) * block_size, random.randint(1, height // block_size - 1) * block_size)
    else:
        snake.pop()

    # 检测蛇是否撞到边界或自身
    if (x < 0 or x >= width or y < 0 or y >= height) or snake[0] in snake[1:]:
        pygame.quit()
        sys.exit()

    screen.fill(black)

    # 绘制蛇
    for segment in snake:
        pygame.draw.rect(screen, white, pygame.Rect(segment[0], segment[1], block_size, block_size))

    # 绘制食物
    pygame.draw.rect(screen, red, pygame.Rect(food[0], food[1], block_size, block_size))

    pygame.display.flip()

    # 设置游戏速度
    clock.tick(4)

运行贪吃蛇游戏：
- 保存Python文件（例如：snake.py）。
- 打开终端（Linux，macOS）或命令提示符（Windows）。
- 导航到包含snake.py文件的目录。
- 运行以下命令启动游戏：
  
  python3 snake.py
- 现在，你应该看到一个贪吃蛇游戏窗口。你可以使用方向键来控制蛇的移动。当蛇吃到红色的食物时，它会变长。如果蛇碰到屏幕边缘或者自己的身体，游戏就会结束。

Enteroviruses (肠道病毒): Classification, Transmission, and Prevention of a Diverse Group of Viral Infections

Leave a reply

Enteroviruses are a group of viruses that belong to the Picornaviridae family. They are small, single-stranded RNA viruses that can cause a wide range of illnesses in humans and animals. There are over 100 different types of enteroviruses, which can be further classified into several subgroups, such as:

Polioviruses: These viruses are responsible for causing poliomyelitis (polio), a highly infectious disease that primarily affects young children and can lead to paralysis or even death in severe cases. The introduction of polio vaccines has significantly reduced the incidence of polio worldwide, and it is now eradicated in most countries.
Coxsackieviruses (Group A and B): These viruses can cause a variety of illnesses, including hand, foot, and mouth disease (HFMD), herpangina, myocarditis, and meningitis.
Echoviruses: These viruses can cause mild to severe illnesses, such as respiratory infections, gastrointestinal illnesses, and meningitis.
Enteroviruses 68-71: These include the Enterovirus D68 (EV-D68), which can cause severe respiratory illness, particularly in children with asthma or other pre-existing respiratory conditions.

Enteroviruses are primarily spread through the fecal-oral route, contact with contaminated surfaces, and respiratory droplets from an infected person. They can cause a wide range of symptoms, from mild to severe, and may sometimes lead to more serious complications.

Prevention measures for enterovirus infections include practicing good personal hygiene, such as washing hands regularly, disinfecting surfaces, and avoiding close contact with infected individuals. There is no specific antiviral treatment for enterovirus infections, and management typically focuses on relieving symptoms and providing supportive care. In some cases, hospitalization may be necessary for severe infections or complications.

肠道病毒是属于Picornaviridae科的一组病毒。它们是小型的、单链RNA病毒，可在人类和动物中引起各种疾病。肠道病毒有100多种不同类型，可以进一步分类为几个亚组，如：

脊髓灰质炎病毒：这些病毒会导致小儿麻痹症（脊髓灰质炎），这是一种主要影响儿童的高度传染性疾病，严重情况下可导致瘫痪甚至死亡。脊髓灰质炎疫苗的推广已显著降低了全球脊髓灰质炎的发病率，现在在大多数国家已经消除。
库萨奇病毒（A组和B组）：这些病毒可引起多种疾病，包括手足口病（HFMD）、口腔疱疹、心肌炎和脑膜炎。
回声病毒：这些病毒可引起从轻度到严重的各种疾病，如呼吸道感染、胃肠道疾病和脑膜炎。
肠道病毒68-71：包括肠道病毒D68（EV-D68），可导致严重的呼吸道疾病，尤其是在已有哮喘或其他呼吸道疾病的儿童中。

肠道病毒主要通过粪-口途径、接触污染表面和感染者的呼吸道飞沫传播。它们可引起各种症状，从轻度到严重，并可能导致更严重的并发症。

预防肠道病毒感染的措施包括保持良好的个人卫生，如勤洗手、消毒表面和避免与感染者密切接触。目前没有针对肠道病毒感染的特异性抗病毒治疗，治疗通常集中在缓解症状和提供支持性护理。在某些情况下，严重感染或并发症可能需要住院治疗。

The p53-R175H Mutation (p53-R175H突变): Implications for Cancer Development and Emerging Therapeutic Strategies

Leave a reply

The p53-R175H mutation refers to a specific change in the TP53 gene, which encodes for the p53 protein. The p53 protein is a crucial tumor suppressor that plays a significant role in preventing the development of cancer by regulating cell growth and apoptosis (programmed cell death).

In the p53-R175H mutation, the amino acid arginine (R) at position 175 in the p53 protein is replaced by histidine (H). This single amino acid substitution can lead to a loss of p53 function, causing the protein to lose its ability to regulate cell growth and trigger apoptosis properly. As a result, cells with this mutation may grow uncontrollably and evade normal cell death processes, which can contribute to the development of cancer.

The p53-R175H mutation has been observed in various types of human cancers, including breast, ovarian, lung, and colorectal cancers. In some cases, it is associated with a more aggressive tumor behavior and a poorer prognosis.

Cancer treatments targeting the p53-R175H mutation are under investigation, with some researchers focusing on the development of small molecules or other strategies to restore the normal function of p53 or induce cell death in cells harboring the mutation. Additionally, understanding the role of p53-R175H and other TP53 mutations can help improve the development of personalized cancer therapies.

p53-R175H突变是指TP53基因中的一个特定变化，该基因编码p53蛋白。p53蛋白是一种关键的肿瘤抑制蛋白，通过调节细胞生长和凋亡（程序性细胞死亡）起着防止癌症发展的重要作用。

在p53-R175H突变中，p53蛋白中位于175位的氨基酸精氨酸（R）被组氨酸（H）取代。这种单一氨基酸替换可能导致p53功能丧失，使蛋白质失去正常调节细胞生长和触发凋亡的能力。因此，携带这种突变的细胞可能会失控地生长并逃避正常的细胞死亡过程，从而有助于癌症的发展。

p53-R175H突变已在各种类型的人类癌症中观察到，包括乳腺癌、卵巢癌、肺癌和结直肠癌。在某些情况下，它与更具侵袭性的肿瘤行为和较差的预后有关。

针对p53-R175H突变的癌症治疗正在研究中，一些研究人员专注于开发小分子或其他策略，以恢复p53的正常功能或诱导携带突变的细胞死亡。此外，了解p53-R175H和其他TP53突变在癌症发展中的作用有助于改善个性化癌症治疗的研发。

MtlD基因: Exploring the Role of Mannitol-1-Phosphate Dehydrogenase in Microbial Metabolism and Survival

Leave a reply

MtlD refers to a gene that encodes for the enzyme mannitol-1-phosphate dehydrogenase. This enzyme is involved in the metabolism of mannitol, a sugar alcohol that serves as a source of carbon and energy for certain microorganisms. Mannitol-1-phosphate dehydrogenase catalyzes the conversion of mannitol-1-phosphate to fructose-6-phosphate, which can then be further processed in the glycolysis pathway to generate energy (ATP).

The MtlD gene is present in a variety of bacteria and some fungi, where mannitol metabolism plays a role in their growth and survival. In some plant-associated bacteria, mannitol may be utilized as a compatible solute to help these bacteria cope with osmotic stress in their environment. Additionally, in some pathogenic bacteria, the ability to metabolize mannitol may provide a competitive advantage during infection or colonization of host tissues.

Understanding the function and regulation of MtlD and its corresponding enzyme is important for the study of bacterial metabolism, stress adaptation, and potential applications in biotechnology or as a target for developing antimicrobial strategies.

MtlD是指编码甘露醇-1-磷酸脱氢酶的基因。这种酶参与甘露醇的代谢，甘露醇是一种糖醇，为某些微生物提供碳源和能量。甘露醇-1-磷酸脱氢酶催化甘露醇-1-磷酸转化为果糖-6-磷酸，然后在糖酵解途径中进一步处理以产生能量（ATP）。

MtlD基因存在于多种细菌和一些真菌中，在它们的生长和生存中，甘露醇代谢起着重要作用。在一些与植物相关的细菌中，甘露醇可能被用作兼容溶质，帮助这些细菌应对环境中的渗透压应激。此外，在一些致病细菌中，代谢甘露醇的能力可能在感染或定殖宿主组织过程中提供竞争优势。

了解MtlD及其相应酶的功能和调控对于研究细菌代谢、抗压适应以及在生物技术中的潜在应用或作为开发抗微生物策略的靶标具有重要意义。

The Importance of acrA and acrB Genes in Antibiotic Resistance and Bacterial Physiology

Leave a reply

acrA和acrB基因编码的AcrA和AcrB蛋白质是AcrAB-TolC泵的两个组成部分。AcrA是一种膜融合蛋白，位于AcrB和TolC之间，起到连接的作用。AcrB是一种跨膜效应器蛋白，能够识别各种不同类型的有毒化合物并驱动其从细胞内排出。TolC是一种外膜通道蛋白，负责将底物从细胞内排到细胞外。这三个蛋白质共同组成了AcrAB-TolC泵，使细菌能够有效地排出各种有毒化合物。

除了在细菌的抗生素耐药性中起着重要的作用之外，acrA和acrB基因在细菌的生长和代谢中也发挥着重要的作用。例如，在某些细菌中，AcrAB-TolC泵可能参与细胞外分泌、细胞表面性状的调控、生物膜的合成等重要的生理过程。

在临床实践中，对于抗生素治疗失效的细菌感染，常常是由于这些细菌已经发展出了对多种抗生素的耐药性。因此，对acrA和acrB基因及其编码的蛋白质的研究，对于探索新的抗菌药物和开发更有效的治疗方法具有重要意义。

acrA and acrB are two commonly found bacterial genes that encode proteins critical to many Gram-negative bacteria. The proteins produced by these genes, AcrA and AcrB, along with the protein TolC, work together to form the AcrAB-TolC pump.

The AcrAB-TolC pump is a three-component complex that belongs to the resistance-nodulation-division (RND) family of antibiotic efflux pumps. This pump uses energy to expel a wide range of toxic compounds from within bacterial cells, including antibiotics, dyes, and organic solvents.

acrA and acrB play a crucial role in the antibiotic resistance of bacteria. Studies have shown that mutations or deletions in these genes can cause bacteria to become more or less susceptible to antibiotics. In addition to their role in antibiotic resistance, acrA and acrB are also involved in other biological processes such as bacterial growth and metabolism.

In clinical practice, antibiotic treatment of bacterial infections is becoming increasingly challenging due to the development of multidrug-resistant bacteria. The importance of acrA and acrB in bacterial resistance mechanisms makes them important targets for the development of new antibiotics and more effective treatments for antibiotic-resistant infections.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Author Archives: gene_x

Preparing custom gtf and bed files for nextflow RNA-seq pipeline

47 popular functions in Biopython

GAMOLA2 pre- and postprocessing

Retrieving KEGG Genes Using Bioservices in Python

Running PICRUSt2: A Step-by-Step Guide to Predicting Microbial Community Functional Profiles

Creating a Simple Snake Game (贪吃蛇) in Python Using Pygame

Enteroviruses (肠道病毒): Classification, Transmission, and Prevention of a Diverse Group of Viral Infections

The p53-R175H Mutation (p53-R175H突变): Implications for Cancer Development and Emerging Therapeutic Strategies

MtlD基因: Exploring the Role of Mannitol-1-Phosphate Dehydrogenase in Microbial Metabolism and Survival

The Importance of acrA and acrB Genes in Antibiotic Resistance and Bacterial Physiology