Author Archives: gene_x

Creating and Training a Toy Transformer Model with Python: A Step-by-Step Guide

GPT (short for “Generative Pre-trained Transformer”) is a type of transformer model, which is an advanced deep learning architecture. It is based on the Transformer architecture introduced by Vaswani et al. in the paper “Attention is All You Need” in 2017. GPT and its successors, such as GPT-2 and GPT-3, have been developed and released by OpenAI.

Yes, you can train a toy transformer model with Python. It will likely be smaller and less powerful than GPT models, but it can help you understand the core concepts and workings of the transformer architecture. There are several libraries and resources available in Python for creating and training transformer models, including TensorFlow and PyTorch.

Here is a high-level overview of the process:

  • Install necessary libraries: Install TensorFlow or PyTorch, depending on which library you prefer for deep learning tasks.
  • Prepare the dataset: Collect and preprocess the text data you will use to train your model. The data should be tokenized (converted into smaller units, such as words or subwords) and encoded into a suitable format for the model.

  • Define the model architecture: Implement the transformer model architecture, including the attention mechanism, multi-head self-attention, positional encoding, feed-forward layers, and other necessary components.

  • Train the model: Train your toy transformer model on the prepared dataset using an appropriate optimizer, learning rate schedule, and loss function (e.g., cross-entropy loss for language modeling tasks). Make sure to set aside a portion of the data for validation to monitor the model\’s performance during training and prevent overfitting.

    • Evaluate the model: After training, evaluate your toy transformer model on a test dataset to measure its performance. Common evaluation metrics for language models include perplexity, BLEU score, and ROUGE score, depending on the specific task you\’re focusing on.

    • Fine-tune and experiment: You can fine-tune your toy transformer model by adjusting hyperparameters, such as learning rate, batch size, and model size (e.g., the number of layers and attention heads). Experiment with different settings to improve the model’s performance.

Keep in mind that training a transformer model, even a toy one, can be computationally expensive and may require a significant amount of time and resources. Using a GPU or cloud-based services, such as Google Colab, can help speed up the training process.

There are also many tutorials and code examples available online to help you get started with building and training your own toy transformer model in Python. Remember to consult the official documentation of the deep learning libraries (TensorFlow or PyTorch) and any other resources to ensure you’re using the most up-to-date practices and techniques.

Downloading DNA and Protein Sequences from NCBI Using Biopython and Taxonomy ID

Biopython is a useful library for working with biological data in Python. You can use it to download sequences from GenBank by specifying a taxonomy ID. Here’s a code example to download sequences using Biopython’s Entrez module, which is an interface to the NCBI Entrez databases, including GenBank:

from Bio import Entrez
from Bio import SeqIO

# Set your email address (required by NCBI)
Entrez.email = "your_email@example.com"

# Specify the taxonomy ID
taxonomy_id = "your_taxonomy_id_here"

# Search for records in the nucleotide database using the taxonomy ID
search_query = f"txid{taxonomy_id}[Organism:exp]"
handle = Entrez.esearch(db="nucleotide", term=search_query)

# Parse the search results
record = Entrez.read(handle)
handle.close()

# Get the GenBank IDs of the records
genbank_ids = record["IdList"]

# Fetch the sequences using the GenBank IDs
sequences = []
for genbank_id in genbank_ids:
    handle = Entrez.efetch(db="nucleotide", id=genbank_id, rettype="gb", retmode="text")
    seq_record = SeqIO.read(handle, "genbank")
    handle.close()
    sequences.append(seq_record)

# Print the fetched sequences
for seq in sequences:
    print(seq)

Replace “your_email@example.com” with your actual email address, and “your_taxonomy_id_here” with the specific taxonomy ID you’re interested in. This script will search the nucleotide database in NCBI GenBank using the specified taxonomy ID and download the corresponding sequences.

Here’s a brief explanation of the code:

  • Import the necessary modules from Biopython: Entrez for accessing the NCBI databases and SeqIO for handling sequence records.
  • Set your email address. This is required by NCBI to track the usage of their resources.
  • Specify the taxonomy ID you want to search for.
  • Create a search query using the taxonomy ID, and search the nucleotide database.
  • Parse the search results to get a list of GenBank IDs.
  • Use the GenBank IDs to fetch the sequences, and store them in a list.
  • Print the fetched sequences in the console.

Please note that the script will fetch all the sequences associated with the specified taxonomy ID. If there are many sequences, it might take a while to download them all. Also, make sure you have installed Biopython in your Python environment before running the script. If you haven’t, you can install it using pip:

pip install biopython

If you want to download protein sequences instead of nucleotide sequences, you can make a few modifications to the code I provided earlier. The main change is to search and fetch data from the “protein” database instead of the “nucleotide” database. Here’s the updated code:

from Bio import Entrez
from Bio import SeqIO

# Set your email address (required by NCBI)
Entrez.email = "your_email@example.com"

# Specify the taxonomy ID
taxonomy_id = "your_taxonomy_id_here"

# Search for records in the protein database using the taxonomy ID
search_query = f"txid{taxonomy_id}[Organism:exp]"
handle = Entrez.esearch(db="protein", term=search_query)

# Parse the search results
record = Entrez.read(handle)
handle.close()

# Get the protein IDs of the records
protein_ids = record["IdList"]

# Fetch the sequences using the protein IDs
sequences = []
for protein_id in protein_ids:
    handle = Entrez.efetch(db="protein", id=protein_id, rettype="gb", retmode="text")
    seq_record = SeqIO.read(handle, "genbank")
    handle.close()
    sequences.append(seq_record)

# Print the fetched sequences
for seq in sequences:
    print(seq)

This code is almost identical to the previous one, but with two important changes:

  1. The Entrez.esearch(): function has its db parameter set to “protein” instead of “nucleotide”. This change ensures that the search is performed in the protein database rather than the nucleotide database.
  2. The Entrez.efetch(): function also has its db parameter set to “protein” for the same reason. This change ensures that the sequences are fetched from the protein database.

Preparing custom gtf and bed files for nextflow RNA-seq pipeline

-- 0, generate the ref.gtf and ref.bed from GenBank file: HD21-2_new.gtf and HD21-2_new.bed --

#[PREPARE gtf from gff] https://github.com/gpertea/gffread/issues/3
#DEL gb2gff.py HD21-2_chr.gbk > HD21-2.gtf
#DEL gb2gff.py HD21-2_plasmid_pHD21-2_1.gbk >> HD21-2.gtf
#DEL gb2gff.py HD21-2_plasmid_pHD21-2_2.gbk >> HD21-2.gtf

#-- STEP 0.0: download the gff3 files --
cat HD21-2_chr.gff3 HD21-2_plasmid1.gff3 HD21-2_plasmid2.gff3 > HD21-2.gff3

#-- STEP 0.1: rename CDS to exon! --
gffread -E -F -O -T HD21-2.gff3 -o HD21-2.gtf
sed -i -e "s/\tCDS\t/\texon\t/g" HD21-2.gtf 
#Genbank_CDS --> Genbank_exon
#cmsearch_CDS --> cmsearch_exon
#tRNAscan-SE_CDS --> tRNAscan-SE_exon
#or xxx_CDS --> xxx_exon --> xxx[cmsearch, Genbank, tRNAscan] exon
#
#-- STEP 0.2: add protein_coding "xxxx" to type exon
#exon (former Genbank CDS): 2657 = (Genbank transcript + cmsearch transcript + tRNAscan-SE transcript) = (2586+61+10)
#Genbank region:                                   7
#        transcript:                            2649
#Genbank_exon == Genbank transcript             2586
#tRNAscan-SE_exon == tRNAscan-SE transcript       61
#cmsearch_exon == cmsearch transcript             10
------------------------------------------------------
#total                                          5313 (transcript 2649-->bed, exon 2657, region 7)

#in kate "\texon" --> "_exon"
grep "Genbank_exon" HD21-2.gtf > HD21-2_Genbank_exon.gtf            #add gene_biotype "protein_coding"; at the end of line (2586)
grep "tRNAscan-SE_exon" HD21-2.gtf > HD21-2_tRNAscan-SE_exon.gtf    #add gene_biotype "tRNA"; at the end of line (61)
grep "cmsearch_exon" HD21-2.gtf > HD21-2_cmsearch_exon.gtf          #add gene_biotype "RNase_P_RNA";| gene_biotype "ncRNA";| gene_biotype "rRNA";| gene_biotype "tmRNA";| gene_biotype "SRP_RNA"; at the end (10)

grep -i -e "cmsearch" HD21-2.gtf > temp
grep -i -e "gene_biotype" temp > temp2
CP052994.1  cmsearch    transcript  1286239 1286622 .   -   .   transcript_id "rna-HKH70_06005"; gene_id "gene-HKH70_06005"; gene_name "rnpB"; Dbxref "RFAM:RF00011"; gbkey "ncRNA"; gene "rnpB"; inference "COORDINATES: profile:INFERNAL:1.1.1"; locus_tag "HKH70_06005"; product "RNase P RNA component class B"; Name "rnpB"; gene_biotype "RNase_P_RNA";
CP052994.1  cmsearch    transcript  1452715 1452905 .   -   .   transcript_id "rna-HKH70_06865"; gene_id "gene-HKH70_06865"; gene_name "ssrS"; Dbxref "RFAM:RF00013"; gbkey "ncRNA"; gene "ssrS"; inference "COORDINATES: profile:INFERNAL:1.1.1"; locus_tag "HKH70_06865"; product "6S RNA"; Name "ssrS"; gene_biotype "ncRNA";
CP052994.1  cmsearch    transcript  1706296 1706410 .   -   .   transcript_id "rna-HKH70_08190"; gene_id "gene-HKH70_08190"; gene_name "rrf"; Dbxref "RFAM:RF00001"; gbkey "rRNA"; gene "rrf"; inference "COORDINATES: profile:INFERNAL:1.1.1"; locus_tag "HKH70_08190"; product "5S ribosomal RNA"; Name "rrf"; gene_biotype "rRNA";
...

grep -v "_exon" HD21-2.gtf > HD21-2_non_exon.gtf #2656
cat HD21-2_non_exon.gtf HD21-2_Genbank_exon.gtf HD21-2_tRNAscan-SE_exon.gtf HD21-2_cmsearch_exon.gtf  > HD21-2_new.gtf #5313

#in kate "exon" --> "\texon"
#[Eventually] sort the HD21-2_new.gtf using libreoffice
#[Eventually] replace "" with "; "transcript_id with transcript_id; "\n with \n in kate

#-- STEP 0.3: PREPARE bed from gtf
#[Eventually] replace '\t0\t' with '\t.\t' in the gff-file in the kate-editor
gffread -E -F --bed HD21-2_new.gtf -o HD21-2_new.bed  #   .. loaded 2649 genomic features from HD21-2_new.gtf

#[IMPORTANT] delelte [*.1] in gtf, bed. In fa could be e.g. ">CP052994 Staphylococcus epidermidis strain HD21-2 chromosome, complete genome"
#GTF: CP052994.1      Genbank exon    1       1356    .       +       .       transcript_id "gene-HKH70_00005"; gene_id "gene-HKH70_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QRJ41002.1"; Name "QRJ41002.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_017723675.1"; locus_tag "HKH70_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QRJ41002.1"; transl_table "11"; gene_biotype "protein_coding";
#BED: CP052994.1      0       1356    gene-HKH70_00005        100     +       0       1356    0,0,0   1       1356,   0,      CDS=0:1356;CDSphase=0;geneID=gene-HKH70_00005;gene_name=dnaA;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=HKH70_00005

-------
-- 2 --
conda activate rnaseq
# --> !!!! SUCCESSFUL !!!!
nextflow run rnaseq --reads '/home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/raw_data/*_R{1,2}.fq.gz' --fasta /home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/HD21-2.fa --gtf /home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/HD21-2_new.gtf --bed12 /home/jhuang/DATA/Data_Anna_HAPDICS_RNASeq/HD21-2_new.bed -profile standard --aligner hisat2 --fcGroupFeatures gene_id --fcGroupFeaturesType gene_biotype --saveReference --skip_genebody_coverage --skip_dupradar --skip_preseq --skip_edger  -resume
#default fcGroupFeatures="gene_biotype" (or gbkey)
      --skip_qc                     Skip all QC steps apart from MultiQC
      --skip_fastqc                 Skip FastQC
      --skip_rseqc                  Skip RSeQC
      --skip_genebody_coverage      Skip calculating genebody coverage
      --skip_preseq                 Skip Preseq
      --skip_dupradar               Skip dupRadar (and Picard MarkDups)
      --skip_edger                  Skip edgeR MDS plot and heatmap
      --skip_multiqc                Skip MultiQC

47 popular functions in Biopython

Here are 47 popular functions in Biopython:

  • Seq: Represents a biological sequence, such as DNA or protein.
  • SeqRecord: Represents a sequence with metadata, including ID and description.
  • SeqIO: Handles reading and writing sequence files in various formats.
  • AlignIO: Handles reading and writing multiple sequence alignment files in various formats.
  • Bio.Alphabet: Defines sequence alphabets, such as IUPAC codes for DNA and protein.
  • Bio.SeqUtils: Provides various utilities for working with sequences, such as computing molecular weight and codon usage.
  • Bio.SeqIO.QualityIO: Handles reading and writing FASTQ files.
  • Bio.SeqIO.PhdIO: Handles reading and writing PHD files.
  • Bio.SeqIO.FastaIO: Handles reading and writing FASTA files.
  • Bio.SeqIO.ClustalIO: Handles reading and writing ClustalW files.
  • Bio.SeqIO.EmbossIO: Handles reading and writing EMBOSS files.
  • Bio.SeqIO.GenBankIO: Handles reading and writing GenBank files.
  • Bio.SeqIO.SwissIO: Handles reading and writing Swiss-Prot files.
  • Bio.SeqIO.TabIO: Handles reading and writing tab-separated files.
  • Bio.SeqIO.NexusIO: Handles reading and writing NEXUS files.
  • Bio.SearchIO: Parses search results from various tools, such as BLAST and HMMER.
  • Bio.Blast: Runs BLAST searches against various databases.
  • Bio.KEGG: Provides access to the KEGG database.
  • Bio.PDB: Handles reading and writing PDB files, as well as structure manipulation and analysis.
  • Bio.Align: Represents a multiple sequence alignment, with associated metadata and annotations.
  • Bio.Align.Applications: Wrappers for running multiple sequence alignment tools, such as ClustalW and Muscle.
  • Bio.Restriction: Provides tools for simulating and analyzing DNA restriction digests.
  • Bio.SeqFeature: Represents a feature or annotation on a sequence, such as a gene or a mutation.
  • Bio.Graphics.GenomeDiagram: Generates graphical representations of genome sequences, with features and annotations.
  • Bio.Graphics.BasicChromosome: Generates basic chromosome diagrams.
  • Bio.Emboss.Applications: Wrappers for running EMBOSS tools, such as Needle and Water.
  • Bio.SeqIO.Example: Provides example sequence data for testing and learning.
  • Bio.AlignIO.Example: Provides example alignment data for testing and learning.
  • Bio.Align.AlignInfo: Computes various summary statistics and information content for a multiple sequence alignment.
  • Bio.Align.PairwiseAligner: Computes pairwise alignments between two sequences.
  • Bio.Motif: Represents sequence motifs, such as transcription factor binding sites.
  • Bio.Motif.PWM: Represents a position weight matrix, commonly used to represent sequence motifs.
  • Bio.Alphabet.IUPAC: Defines IUPAC codes for various sequence alphabets, such as DNA and protein.
  • Bio.Alphabet.Gapped: Defines a gapped sequence alphabet, used for representing alignments.
  • Bio.SeqIO.AceIO: Handles reading and writing ACE files, commonly used for sequencing data.
  • Bio.SeqIO.GameXMLIO: Handles reading and writing GAME XML files, used for gene annotation.
  • Bio.SeqIO.LasIO: A module in Biopython that provides a parser for the PacBio LAS file format. The LAS file format is used to store PacBio long-read sequencing data, including raw signal data, basecalls, and metadata.
  • Bio.SeqUtils.molecular_weight(seq) – Calculates the molecular weight of a sequence.
  • Bio.Phylo.PAML.codeml() – A module for running the PAML codeml program for estimating rates of evolution and selection.
  • Bio.AlignIO.parse(handle, format) – Parses multiple sequence alignments from a file.
  • Bio.Alphabet.IUPAC.unambiguous_dna – Defines the unambiguous DNA alphabet.
  • Bio.Seq.reverse_complement(seq) – Returns the reverse complement of a sequence.
  • Bio.Seq.translate(seq) – Translates a nucleotide sequence into the corresponding amino acid sequence.
  • Bio.SeqUtils.seq1(seq) – Converts a protein sequence from three-letter code to one-letter code.
  • Bio.SeqIO.write(records, handle, format) – Writes a sequence record or a list of sequence records to a file.
  • Bio.PDB.PDBList() – A module for downloading PDB files from the RCSB PDB server.
  • Bio.Blast.NCBIWWW.qblast(program, database, sequence) – Sends a sequence to the NCBI BLAST server for homology search.

GAMOLA2 pre- and postprocessing

#-- view process.py --
with open("MT880872_CDS.fasta", "r") as f:
    for line in f:
        if line.startswith(">"):
            header_words = line.strip().split("locus_tag=")
            third_word = header_words[1]
            third_word_ = third_word.strip().split("]")
            third_word__ = third_word_[0]
            print(">"+third_word__)
        else:
            print(line.strip()) 

 #input_file: MT880870_.fasta as pan_genome_reference_.fa; MT880870_.fasta is modified from MT880870.fasta by simplying the fasta headers.
echo ">MT880870_genes" > MT880870_genes.fa;
merge_seq.py MT880870_.fasta >> MT880870_genes.fa;
samtools faidx MT880870_genes.fa
bioawk -c fastx '{ print $name, length($seq) }' < MT880870_.fasta > length.txt 
generate_gene_model.py length.txt > /media/jhuang/Titisee/GAMOLA2/Results/gene_models/MT880870_genes.fa.combined;
cp MT880870_genes.fa /media/jhuang/Titisee/GAMOLA2/Input_sequences

python process.py > MT880871_.fasta
echo ">MT880871_genes" > MT880871_genes.fa;
merge_seq.py MT880871_.fasta >> MT880871_genes.fa;
samtools faidx MT880871_genes.fa
bioawk -c fastx '{ print $name, length($seq) }' < MT880871_.fasta > length.txt 
generate_gene_model.py length.txt > /media/jhuang/Titisee/GAMOLA2/Results/gene_models/MT880871_genes.fa.combined;
cp MT880871_genes.fa /media/jhuang/Titisee/GAMOLA2/Input_sequences

python process.py > MT880872_.fasta
echo ">MT880872_genes" > MT880872_genes.fa;
merge_seq.py MT880872_.fasta >> MT880872_genes.fa;
samtools faidx MT880872_genes.fa
bioawk -c fastx '{ print $name, length($seq) }' < MT880872_.fasta > length.txt 
generate_gene_model.py length.txt > /media/jhuang/Titisee/GAMOLA2/Results/gene_models/MT880872_genes.fa.combined;
cp MT880872_genes.fa /media/jhuang/Titisee/GAMOLA2/Input_sequences

#---- 4 ----
cd /media/jhuang/Titisee/GAMOLA2
#WARNING: !!swissprot as default database, manually choose nr as Blast_db --> too slow, better choose swissprot, it is quick!!
./Gamola.pl    #No Glimmer model and Critica database due to self-extracted ORF; choosing nr.pal or swissprot.pal as Blast_db; COG2014 as COG_db; Pfam-A.hmm as Pfam_db; No Rfam_db; TIGRFAMS_15.0_HMM.LIB as TIGRfam_db
/media/jhuang/Titisee/GAMOLA2/Results/COG_results
#grep "Class: ," roary_182s_95.fa_COG_*
for sample in 10601 10710 11187 11377 12474 12504  1520 1551 2092 2160 2467   289 3455 4694 5862 5863 6166 6464 7618 8007 8406 8784 9157 9373 953; do
  sed -i -e 's/Class: ,/Class: None, None/g' roary_182s_95.fa_COG_${sample}
done
./Gamola.pl

#---- 5 ---- update locustag
cp /media/jhuang/Titisee/GAMOLA2/Consolidated_results/MT880870_genes.fa/MT880870_genes.fa.gb ./
awk '{print $0 "ORIGIN"> "file" NR}' RS='ORIGIN' MT880870_genes.fa.gb
sed -i 's/</ /g' file2
sed -i 's/^ //g' file2
cat file1 file2 > MT880870_genes_.fa.gb
iconv -t UTF-8 -f Windows-1252 MT880870_genes_.fa.gb
#89600270 Jan 26 13:44 roary_182s_95__.fa.gb
#-- if ran [3.2 selected]--
python ~/Scripts/update_locustag.py MT880870_genes_.fa.gb MT880870_.fasta.fai > MT880870_genes__.fa.gb
> MT880870_genes__.fa.gb    #clean old *__.fa.gb

#---- 6.6 ----
cut -d$'\t' -f1 MT880870_.fasta.fai > MT880870_f1
process_gamola_gb_plus.py MT880870_f1 MT880870_genes__.fa.gb > annotated_MT880870.txt  #!!

Some fields may be wrong.
  BiopythonParserWarning)
/usr/local/lib/python2.7/dist-packages/Bio/GenBank/__init__.py:1047: BiopythonParserWarning: Ignoring invalid location: '1186620..1185596'
sed -i -e 's/1186620\.\.1185596/1185596\.\.1186620/g' roary__.fa.gb
#DEBUG "group_12676" -> "group_12667"

#---- ADD DNA-Sequences add the end of table as the last column ----
cut -d',' -f1-1 annotated_MT880870.txt > get_seq_ORF.sh
#extend the file get_seq_ORF.sh to extract all DNA-sequences
#mv the bash file under DIR of MT880870_.fasta.fai 
#samtools faidx MT880870_.fasta "PLKLOBMN_00001" > seq_ORF.fasta
#samtools faidx MT880870_.fasta "PLKLOBMN_00002" >> seq_ORF.fasta
#or in the case simply "cp MT880870_.fasta seq_ORF.fasta"
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < seq_ORF.fasta > seq_ORF_.fasta
grep -v ">" seq_ORF_.fasta > seq_ORF__.txt
#add "Sequence" to the first line of seq_ORF__.txt
paste -d',' annotated_MT880870.txt seq_ORF__.txt > annotated_MT880870_.txt

#---- repeat 5-->6.6 for MT880871 and MT880872 --------------
cp /media/jhuang/Titisee/GAMOLA2/Consolidated_results/MT880871_genes.fa/MT880871_genes.fa.gb ./
awk '{print $0 "ORIGIN"> "file" NR}' RS='ORIGIN' MT880871_genes.fa.gb
sed -i 's/</ /g' file2
sed -i 's/^ //g' file2
cat file1 file2 > MT880871_genes_.fa.gb
iconv -t UTF-8 -f Windows-1252 MT880871_genes_.fa.gb
python ~/Scripts/update_locustag.py MT880871_genes_.fa.gb MT880871_.fasta.fai > MT880871_genes__.fa.gb
#remove the last line 'ORIGIN'
cut -d$'\t' -f1 MT880871_.fasta.fai > MT880871_f1
process_gamola_gb_plus.py MT880871_f1 MT880871_genes__.fa.gb > annotated_MT880871.txt 

cp MT880871_.fasta seq_ORF_71.fasta
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < seq_ORF_71.fasta > seq_ORF_71_.fasta
grep -v ">" seq_ORF_71_.fasta > seq_ORF_71__.txt
#add 'Sequence' in the first line in seq_ORF_71__.txt
paste -d',' annotated_MT880871.txt seq_ORF_71__.txt > annotated_MT880871_.txt

#--

cp /media/jhuang/Titisee/GAMOLA2/Consolidated_results/MT880872_genes.fa/MT880872_genes.fa.gb ./
awk '{print $0 "ORIGIN"> "file" NR}' RS='ORIGIN' MT880872_genes.fa.gb
sed -i 's/</ /g' file2
sed -i 's/^ //g' file2
cat file1 file2 > MT880872_genes_.fa.gb
iconv -t UTF-8 -f Windows-1252 MT880872_genes_.fa.gb
python ~/Scripts/update_locustag.py MT880872_genes_.fa.gb MT880872_.fasta.fai > MT880872_genes__.fa.gb
#remove the last line 'ORIGIN'
cut -d$'\t' -f1 MT880872_.fasta.fai > MT880872_f1
process_gamola_gb_plus.py MT880872_f1 MT880872_genes__.fa.gb > annotated_MT880872.txt 

cp MT880872_.fasta seq_ORF_72.fasta
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < seq_ORF_72.fasta > seq_ORF_72_.fasta
grep -v ">" seq_ORF_72_.fasta > seq_ORF_72__.txt
#add 'Sequence' in the first line in seq_ORF_72__.txt
paste -d',' annotated_MT880872.txt seq_ORF_72__.txt > annotated_MT880872_.txt

mv annotated_MT880870_.txt annotated__MT880870_.txt
#~/Tools/csv2xls-0.4/csv_to_xls.py annotated__MT880870_.txt annotated_MT880871_.txt annotated_MT880872_.txt -d',' -o annotated_MT880870-MT880872.xls

The script that takes two file names as command line arguments, merges the files based on matching first words, removes the key from file2, and deletes all lines in the merged file in which the key does not occur in file2.

import sys

#"","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj","Gene","Product","PFAM Label","PFAM Match", "PFAM Interpro", "PFAM Product", "TIGR Label", "TIGR Product","COG Label", "COG Code","COG Product","Translation"  #delete "Locus Tag",
#python3 merge2.py hS_vs_TSB_output/background annotated_CP052994_2996.txt hS_vs_TSB_output/background_
#python3 merge2.py hS_vs_TSB_output/downregulated_filtered annotated_CP052994_2996.txt hS_vs_TSB_output/downregulated_filtered_
#python3 merge2.py hS_vs_TSB_output/upregulated_filtered annotated_CP052994_2996.txt hS_vs_TSB_output/upregulated_filtered_
#python3 merge2.py infection_vs_nose_output/background annotated_CP052994_2996.txt infection_vs_nose_output/background_
#python3 merge2.py infection_vs_nose_output/downregulated_filtered annotated_CP052994_2996.txt infection_vs_nose_output/downregulated_filtered_
#python3 merge2.py infection_vs_nose_output/upregulated_filtered annotated_CP052994_2996.txt infection_vs_nose_output/upregulated_filtered_
#sed -i -e 's/HKH70_09175/HKH70_09175 (PLKLOBMN_00173)/g' background_
#sed -i -e 's/HKH70_05165/HKH70_05165 (PLKLOBMN_00016)/g' background_
#sed -i -e 's/HKH70_05170/HKH70_05170 (PLKLOBMN_00017)/g' background_
#sed -i -e 's/HKH70_05175/HKH70_05175 (PLKLOBMN_00018)/g' background_
#python3 ~/Tools/csv2xls-0.4/csv_to_xls.py background_ upregulated_filtered_ downregulated_filtered_ -d',' -o degenes_hS_vs_TSB.xls
#python3 ~/Tools/csv2xls-0.4/csv_to_xls.py background_ upregulated_filtered_ downregulated_filtered_ -d',' -o degenes_infection_vs_nose.xls
# Check if both file names were provided as command line arguments
if len(sys.argv) != 4:
    print("Usage: python merge2.py file1.txt file2.txt output_file.txt")
    sys.exit(1)

# Open file1 for reading and file2 for editing
with open(sys.argv[1], 'r') as f1, open(sys.argv[2], 'r+') as f2:
    # Read file2 into a dictionary
    f2_dict = {}
    for line in f2:
        key, content = line.strip().split(',', 1)
        f2_dict[key] = content

    # Merge file1 and file2
    merged_lines = []
    for line in f1:
        key = line.split(',')[0]
        if key in f2_dict:
            content = f2_dict[key]
            merged_lines.append(line.strip() + ',' + content)

    ## Move the file pointer to the beginning of file2
    #f2.seek(0)
    ## Truncate file2 to delete all its contents
    #f2.truncate()
    ## Rewrite file2 with the updated content
    #for key, content in f2_dict.items():
    #    f2.write(content + '\n')

    # Overwrite file1 with the merged content
    with open(sys.argv[3], 'w') as f3:
        f3.write('\n'.join(merged_lines))

Retrieving KEGG Genes Using Bioservices in Python

Biopython does not have built-in support for KEGG database. However, you can use the bioservices library to retrieve and interact with KEGG data. To fetch all available genes in the KEGG database, you would need to iterate through each organism and collect all their genes. Note that this process might take a long time and may not be efficient, as there are thousands of organisms and millions of genes in the KEGG database.

Use the bioservices library to fetch the list of available organisms and retrieve genes for the first few organisms:

from bioservices import KEGG

# Initialize KEGG API
kegg_api = KEGG()

# Get the list of organisms
organisms_raw = kegg_api.list("organism")
organisms = [entry.split("\t")[1] for entry in organisms_raw.split("\n") if entry]
#['hsa', 'ptr', 'pps', 'ggo', 'pon', 'nle', 'hmh', 'mcc', 'mcf', 'mthb', 'mni', 'csab', 'caty', 'panu', 'tge', 'mleu', 'rro', 'rbb', 'tfn', 'pteh', 'cang', 'cjc', 'sbq', 'cimi', 'csyr', 'mmur', 'lcat', 'pcoq', 'oga', 'mmu', 'mcal', ... , 'loki', 'psyt', 'agw', 'arg']

# Limit the number of organisms and genes for demonstration purposes
organism_limit = 3
gene_limit = 10

# Iterate through the organisms
for organism in organisms[:organism_limit]:
    print(f"Organism: {organism}")

    # Get the list of genes for the current organism
    genes = kegg_api.list(f"{organism}").split("\n")[:gene_limit]

    # Iterate through the genes and print gene identifiers
    for gene_entry in genes:
        gene_id = gene_entry.split("\t")[0]
        print(f"Gene ID: {gene_id}")

    print("\n")

#Organism: hsa
#Gene ID: hsa:102466751
#Gene ID: hsa:100302278
#Gene ID: hsa:79501
#Gene ID: hsa:112268260
#Gene ID: hsa:729759
#Gene ID: hsa:124904706
#Gene ID: hsa:105378947
#Gene ID: hsa:113219467
#Gene ID: hsa:81399
#Gene ID: hsa:148398
#...

This code will print the gene identifiers of the first 10 genes for the first 3 organisms in the KEGG database. You can modify the organism_limit and gene_limit variables to change the number of organisms and genes processed.

Remember that fetching all genes from the KEGG database might take a significant amount of time and may not be efficient. It’s usually more practical to focus on specific organisms or pathways of interest.

Running PICRUSt2: A Step-by-Step Guide to Predicting Microbial Community Functional Profiles

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a bioinformatics tool used to predict the functional content of a microbial community based on marker gene sequences (e.g., 16S rRNA gene sequences).

Here’s a step-by-step guide on how to run PICRUSt2:

  1. Install PICRUSt2 and its dependencies: You can install PICRUSt2 using conda. If you don’t have conda installed, first install Miniconda or Anaconda. Then, create a conda environment and install PICRUSt2 using the following commands:

    conda create -n picrust2_env
    conda activate picrust2_env
    conda install -c bioconda -c conda-forge picrust2
  2. Prepare input files: You will need two input files to run PICRUSt2: a sequence alignment file (in FASTA format) and an OTU (Operational Taxonomic Unit) table or ASV (Amplicon Sequence Variant) table (in BIOM format). Make sure your input files are properly formatted and contain the necessary information.

  3. Place sequences into reference phylogeny: Run the place_seqs.py script to place your input sequences into a reference phylogeny. Replace

    and with your input FASTA file and desired output directory, respectively. place_seqs.py -s -o /out.tre -p 1 –intermediate /intermediate
  4. Run PICRUSt2 pipeline: Execute the picrust2_pipeline.py script to run the PICRUSt2 pipeline. Replace

    with your OTU/ASV table, and with your desired output directory. picrust2_pipeline.py -i -s -o -p 1
  5. Generate output files: The pipeline will generate various output files in the specified output directory, including predicted functional profiles in TSV format. You can further analyze and visualize these results using other tools and software (e.g., STAMP or R).

For more detailed information and advanced options, consult the official PICRUSt2 documentation: https://github.com/picrust/picrust2/wiki

Creating a Simple Snake Game (贪吃蛇) in Python Using Pygame

贪吃蛇是一款经典的游戏,可以用Python的Pygame库来实现。以下是一个简单的贪吃蛇游戏实现:

  • 首先,确保已经安装了Python和Pygame库。如果没有安装Pygame,请使用以下命令进行安装:

    pip install pygame
  • 创建一个新的Python文件(例如:snake.py),然后复制以下代码到文件中:

    import sys
    import pygame
    import random
    
    pygame.init()
    
    # 游戏窗口设置
    width, height = 640, 480
    screen = pygame.display.set_mode((width, height))
    pygame.display.set_caption('贪吃蛇')
    
    # 定义颜色
    black = (0, 0, 0)
    white = (255, 255, 255)
    red = (255, 0, 0)
    
    # 蛇和食物的大小
    block_size = 20
    
    # 初始化蛇的位置
    snake = [(width // 2, height // 2)]
    snake_speed = block_size
    snake_direction = 'right'
    
    # 初始化食物的位置
    food = (random.randint(1, width // block_size - 1) * block_size, random.randint(1, height // block_size - 1) * block_size)
    
    clock = pygame.time.Clock()
    
    while True:
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                pygame.quit()
                sys.exit()
    
            # 检测键盘方向键
            if event.type == pygame.KEYDOWN:
                if event.key == pygame.K_UP and snake_direction != 'down':
                    snake_direction = 'up'
                elif event.key == pygame.K_DOWN and snake_direction != 'up':
                    snake_direction = 'down'
                elif event.key == pygame.K_LEFT and snake_direction != 'right':
                    snake_direction = 'left'
                elif event.key == pygame.K_RIGHT and snake_direction != 'left':
                    snake_direction = 'right'
    
        x, y = snake[0]
        if snake_direction == 'up':
            y -= snake_speed
        elif snake_direction == 'down':
            y += snake_speed
        elif snake_direction == 'left':
            x -= snake_speed
        elif snake_direction == 'right':
            x += snake_speed
    
        snake.insert(0, (x, y))
    
        # 检测蛇是否吃到食物
        if snake[0] == food:
            food = (random.randint(1, width // block_size - 1) * block_size, random.randint(1, height // block_size - 1) * block_size)
        else:
            snake.pop()
    
        # 检测蛇是否撞到边界或自身
        if (x < 0 or x >= width or y < 0 or y >= height) or snake[0] in snake[1:]:
            pygame.quit()
            sys.exit()
    
        screen.fill(black)
    
        # 绘制蛇
        for segment in snake:
            pygame.draw.rect(screen, white, pygame.Rect(segment[0], segment[1], block_size, block_size))
    
        # 绘制食物
        pygame.draw.rect(screen, red, pygame.Rect(food[0], food[1], block_size, block_size))
    
        pygame.display.flip()
    
        # 设置游戏速度
        clock.tick(4)
  • 运行贪吃蛇游戏:

    • 保存Python文件(例如:snake.py)。

    • 打开终端(Linux,macOS)或命令提示符(Windows)。

    • 导航到包含snake.py文件的目录。

    • 运行以下命令启动游戏:

      python3 snake.py

    • 现在,你应该看到一个贪吃蛇游戏窗口。你可以使用方向键来控制蛇的移动。当蛇吃到红色的食物时,它会变长。如果蛇碰到屏幕边缘或者自己的身体,游戏就会结束。

Enteroviruses (肠道病毒): Classification, Transmission, and Prevention of a Diverse Group of Viral Infections

Enteroviruses are a group of viruses that belong to the Picornaviridae family. They are small, single-stranded RNA viruses that can cause a wide range of illnesses in humans and animals. There are over 100 different types of enteroviruses, which can be further classified into several subgroups, such as:

  1. Polioviruses: These viruses are responsible for causing poliomyelitis (polio), a highly infectious disease that primarily affects young children and can lead to paralysis or even death in severe cases. The introduction of polio vaccines has significantly reduced the incidence of polio worldwide, and it is now eradicated in most countries.

  2. Coxsackieviruses (Group A and B): These viruses can cause a variety of illnesses, including hand, foot, and mouth disease (HFMD), herpangina, myocarditis, and meningitis.

  3. Echoviruses: These viruses can cause mild to severe illnesses, such as respiratory infections, gastrointestinal illnesses, and meningitis.

  4. Enteroviruses 68-71: These include the Enterovirus D68 (EV-D68), which can cause severe respiratory illness, particularly in children with asthma or other pre-existing respiratory conditions.

Enteroviruses are primarily spread through the fecal-oral route, contact with contaminated surfaces, and respiratory droplets from an infected person. They can cause a wide range of symptoms, from mild to severe, and may sometimes lead to more serious complications.

Prevention measures for enterovirus infections include practicing good personal hygiene, such as washing hands regularly, disinfecting surfaces, and avoiding close contact with infected individuals. There is no specific antiviral treatment for enterovirus infections, and management typically focuses on relieving symptoms and providing supportive care. In some cases, hospitalization may be necessary for severe infections or complications.

肠道病毒是属于Picornaviridae科的一组病毒。它们是小型的、单链RNA病毒,可在人类和动物中引起各种疾病。肠道病毒有100多种不同类型,可以进一步分类为几个亚组,如:

  1. 脊髓灰质炎病毒:这些病毒会导致小儿麻痹症(脊髓灰质炎),这是一种主要影响儿童的高度传染性疾病,严重情况下可导致瘫痪甚至死亡。脊髓灰质炎疫苗的推广已显著降低了全球脊髓灰质炎的发病率,现在在大多数国家已经消除。

  2. 库萨奇病毒(A组和B组):这些病毒可引起多种疾病,包括手足口病(HFMD)、口腔疱疹、心肌炎和脑膜炎。

  3. 回声病毒:这些病毒可引起从轻度到严重的各种疾病,如呼吸道感染、胃肠道疾病和脑膜炎。

  4. 肠道病毒68-71:包括肠道病毒D68(EV-D68),可导致严重的呼吸道疾病,尤其是在已有哮喘或其他呼吸道疾病的儿童中。

肠道病毒主要通过粪-口途径、接触污染表面和感染者的呼吸道飞沫传播。它们可引起各种症状,从轻度到严重,并可能导致更严重的并发症。

预防肠道病毒感染的措施包括保持良好的个人卫生,如勤洗手、消毒表面和避免与感染者密切接触。目前没有针对肠道病毒感染的特异性抗病毒治疗,治疗通常集中在缓解症状和提供支持性护理。在某些情况下,严重感染或并发症可能需要住院治疗。

The p53-R175H Mutation (p53-R175H突变): Implications for Cancer Development and Emerging Therapeutic Strategies

The p53-R175H mutation refers to a specific change in the TP53 gene, which encodes for the p53 protein. The p53 protein is a crucial tumor suppressor that plays a significant role in preventing the development of cancer by regulating cell growth and apoptosis (programmed cell death).

In the p53-R175H mutation, the amino acid arginine (R) at position 175 in the p53 protein is replaced by histidine (H). This single amino acid substitution can lead to a loss of p53 function, causing the protein to lose its ability to regulate cell growth and trigger apoptosis properly. As a result, cells with this mutation may grow uncontrollably and evade normal cell death processes, which can contribute to the development of cancer.

The p53-R175H mutation has been observed in various types of human cancers, including breast, ovarian, lung, and colorectal cancers. In some cases, it is associated with a more aggressive tumor behavior and a poorer prognosis.

Cancer treatments targeting the p53-R175H mutation are under investigation, with some researchers focusing on the development of small molecules or other strategies to restore the normal function of p53 or induce cell death in cells harboring the mutation. Additionally, understanding the role of p53-R175H and other TP53 mutations can help improve the development of personalized cancer therapies.

p53-R175H突变是指TP53基因中的一个特定变化,该基因编码p53蛋白。p53蛋白是一种关键的肿瘤抑制蛋白,通过调节细胞生长和凋亡(程序性细胞死亡)起着防止癌症发展的重要作用。

在p53-R175H突变中,p53蛋白中位于175位的氨基酸精氨酸(R)被组氨酸(H)取代。这种单一氨基酸替换可能导致p53功能丧失,使蛋白质失去正常调节细胞生长和触发凋亡的能力。因此,携带这种突变的细胞可能会失控地生长并逃避正常的细胞死亡过程,从而有助于癌症的发展。

p53-R175H突变已在各种类型的人类癌症中观察到,包括乳腺癌、卵巢癌、肺癌和结直肠癌。在某些情况下,它与更具侵袭性的肿瘤行为和较差的预后有关。

针对p53-R175H突变的癌症治疗正在研究中,一些研究人员专注于开发小分子或其他策略,以恢复p53的正常功能或诱导携带突变的细胞死亡。此外,了解p53-R175H和其他TP53突变在癌症发展中的作用有助于改善个性化癌症治疗的研发。