Author Archives: gene_x

TODO: mapping virus reads should use TopHat2, BWA or Bowtie2 could not detect splice events.

RNA sequencing (RNA-seq)

Library preparation for strand specific RNA sequencing was carried out using the NEXTflex Directional RNA-Seq Kit (Bioo Scientific) according to the manufacturer’s instructions.
Libraries were sequenced on the Illumina HiSeq 2500 platform.
To allow detection of splice events that extend over the origin, reads were mapped to two concatenated copies of the MCVSyn or MCVSyn-hpko genomes using TopHat2 v 2.0.13 [94].
The positions of mapped reads and junctions were subsequently collapsed back on unit-length genomes.
From the resulting SAM files, we counted the number of unspliced reads that extended over splice sites of junctions detected by TopHat to determine splice site efficacy and frequency of individual junctions.
To estimate transcript abundance, for each combination of splice junctions that mapped within either the major early or late transcription cassettes we calculated a relative strand-specific combinatorial frequency value by multiplying observed frequency values for individual donor sites.
The relative ratio of late to early transcripts was subsequently estimated by calculation of normalized RPKM (reads per kilobase per million mapped reads) for each of the transcripts.
Based on the Juliane paper, “A Comprehensive Analysis of Replicating Merkel Cell Polyomavirus Genomes Delineates the Viral Transcription Program and Suggests a Role for mcv-miR-M1 in Episomal Persistence,” particularly in the “RNA sequencing (RNA-seq)” methods section, they used strand-specific RNA sequencing. This explains why they are able to separate the coverage into blue (positive strand) and red (negative strand). The original text reads: “Library preparation for strand-specific RNA sequencing was carried out using the NEXTflex Directional RNA-Seq Kit (Bioo Scientific) according to the manufacturer’s instructions.”
Unfortunately, your RNA-seq data is not strand-specific. Additionally, the primary focus of your sequencing is on the human genome, while the virus is a by-product, meaning we only have a very limited number of virus reads. For example, on average, there are only about 10 reads for VP1 and VP2 per sample. For this reason, creating a plot similar to the one in Juliane’s paper, with separate coverage for positive and negative strands, is not feasible.

Deciphering S. aureus with metatranscriptomics (Marc’s project)

RNAseq analaysis using nextflow set reference as CP000255

prepare reference

mkdir raw_data; cd raw_data
ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_1.fq.gz P50_LH1and2_R1.fq.gz
ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_2.fq.gz P50_LH1and2_R2.fq.gz

Downloading CP000255.fasta and CP000255.gff from GenBank

mv ~/Downloads/sequence\(3\).fasta CP000255.fasta
mv ~/Downloads/sequence\(2\).gff3 CP000255.gff

MRSA: USA300
EMRSA (CC22): CC22 strain
MSSA: Newmann strain

#https://journals.asm.org/doi/10.1128/jb.01000-07
#Staphylococcus aureus subsp. aureus strain Newman (Taxonomy ID: 426430)
#https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=426430

#Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
https://www.ncbi.nlm.nih.gov/nuccore/CP000255.1

#Methicillin-resistant Staphylococcus aureus CC22-MRSA-IV as an agent of dairy cow intramammary infections
https://pubmed.ncbi.nlm.nih.gov/30473348/

sed 's/\tCDS\t/\texon\t/g' CP000255.gff > CP000255_m.gff
grep -P "\texon\t" CP000255_m.gff | sort | wc -l  #2630
# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead! (but there is no 'transcript' in the column 3, strange????) --
--gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff" --featurecounts_feature_type 'transcript'

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in P50_LH1and2; do \
    java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
done

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
P50_LH1and2,P50_LH1and2_R1.fq.gz,P50_LH1and2_R2.fq.gz,auto

nextflow run

mv trimmed/*.fq.gz .
# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255.fasta" --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- in results/genome/CP059040_genes.gtf -->
#P059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";
#CP059040.1      Genbank exon    1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QNT85048.1"; Name "QNT85048.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_004996698.1"; locus_tag "H0N29_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QNT85048.1"; transl_table "11";
#CP059040.1      Genbank transcript      1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Name "H0N29_00010"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "H0N29_00010";
#CP059040.1      Genbank exon    1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Dbxref "NCBI_GP:QNT85049.1"; Name "QNT85049.1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010116376.1"; locus_tag "H0N29_00010"; product "DNA polymerase III subunit beta"; protein_id "QNT85049.1"; transl_table "11";

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
>CP000255.1 Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome

Calculate Host and Bacterial Compositions using vrap

download all S epidermidis genomes and identified all ST2 isolates from them!

Staphylococcus aureus  ID: 1280
Staphylococcus epidermidis Taxonomy ID: 1282

esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta  # 23788

esearch -db nucleotide -query "txid1280[Organism:exp]" | \
efetch -format fasta -email j.huang@uke.de > genome_1280_ncbi.fasta
esearch -db nucleotide -query "txid1282[Organism:exp]" | \
efetch -format fasta -email j.huang@uke.de > genome_1282_ncbi.fasta
cd /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024
python ~/Scripts/filter_fasta.py genome_1280_ncbi.fasta complete_genome_1280_ncbi.fasta  #3478, 9.2G
python ~/Scripts/filter_fasta.py genome_1282_ncbi.fasta complete_genome_1282_ncbi.fasta  #447, 1.1G

# ---- Download related genomes from ENA ----
https://www.ebi.ac.uk/ena/browser/view/1280
#Click "Sequence" and download "Counts" and "Taxon descendants count" if there is enough time! Downloading time points is 10.01.2025.
https://www.ebi.ac.uk/ena/browser/view/1282

#python ~/Scripts/filter_fasta.py  ena_1282_sequence_taxon_descendants_count.fasta complete_genome_1282_ena_taxon_descendants.fasta
#S.aureus, Staphylococcus epidermidis
python ~/Scripts/filter_fasta.py ena_1280_sequence.fasta complete_genome_1280_ena.fasta  #2187, 5.8G
python ~/Scripts/filter_fasta.py ena_1282_sequence.fasta complete_genome_1282_ena.fasta  #243, 596M

replace virus to S.aureus

run vrap

ln -s ~/Tools/vrap/ .
conda activate /home/jhuang/miniconda3/envs/vrap
vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g

vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_1280 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/genomes_by_taxid/complete_genome_1280_ncbi.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g

#TODO: change virus_user_db --> S.aureus_user_db
#ref|NZ_CP075562.1| --> Staphylococcus aureus strain Sta2021010 chromosome, complete genome

mv mapped mapped.sam
samtools view -S -b mapped.sam > mapped.bam
samtools sort mapped.bam -o mapped_sorted.bam
samtools index mapped_sorted.bam
samtools view -H mapped_sorted.bam
samtools flagstat mapped_sorted.bam

@PG     ID:bowtie2      PN:bowtie2      VN:2.4.4        CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x /home/jhuang/REFs/genome -q --fast -p 100 --passthrough -1 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_1.fastq -2 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_2.fastq -U /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.extendedFrags.fastq"
24061654 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
22492206 + 0 mapped (93.48% : N/A)
12342028 + 0 paired in sequencing
6171014 + 0 read1
6171014 + 0 read2
10326710 + 0 properly paired (83.67% : N/A)
11134274 + 0 with itself and mate mapped
56254 + 0 singletons (0.46% : N/A)
421598 + 0 with mate mapped to a different chr
15942 + 0 with mate mapped to a different chr (mapQ>=5)

Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_on_CP015646 --host CP015646.fasta -t 100 -l 200 -g
cd bowtie
mv mapped mapped.sam
samtools view -S -b mapped.sam > mapped.bam
samtools sort mapped.bam -o mapped_sorted.bam
samtools index mapped_sorted.bam
samtools view -H mapped_sorted.bam
samtools flagstat mapped_sorted.bam
#24061651 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
541762 + 0 mapped (2.25% : N/A)
12342022 + 0 paired in sequencing
6171011 + 0 read1
6171011 + 0 read2
382256 + 0 properly paired (3.10% : N/A)
388562 + 0 with itself and mate mapped
22349 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

bamCoverage -b mapped_sorted.bam -o ../../P50_LH1and2_on_CP015646_reads_coverage.bw

samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {if (count > 0) print sum/count; else print 0}'
#21.4043

samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {print sum/count}'
samtools depth -a mapped_sorted.bam | awk -v len=2742807 '{sum+=$3} END {print sum/len}'
#21.4043
samtools depth mapped_sorted.bam | awk '$3 > 0 {count++} END {print count}'
samtools depth mapped_sorted.bam | awk '$3 > 10 {count++} END {print count}'

python ~/Scripts/calculate_coverage.py mapped_sorted.bam
#Average coverage: 24.41

Report

I hope this email finds you well. Below is a summary of the analysis results based on the sequencing data:

    Host and Bacterial Composition:
        * The human genome accounted for 22,492,206 reads out of the total 24,061,654 reads (93.48%).

    Genome Assembly and Annotation:
        * After removing the human reads, we assembled the bacterial genome.
        * For annotation, we used the S. aureus user database (S.aureus_user_db), which includes 3,478 complete S. aureus genomes available in GenBank as of January 10, 2025.
        * The annotation summary is provided in the attached file assembly_summary.xlsx. The sequence CP015646.1 appears to be the closest genome for the data, as indicated in the attached table (highlighted in yellow).

    Mapping and Coverage:
        * Using CP015646.1 as the reference genome, 541,762 reads (2.25%) were mapped to this reference.
        * The average coverage is approximately 21x, with 951,154 positions out of 2,742,807 in the reference genome covered by at least 10 reads.
        * The attached figure xxx.png shows the coverage of reads mapped to CP015646.1.

Conclusion:
The analysis suggests that the majority of reads originate from the human genome, and the S. aureus reads are insufficient for a robust RNA-seq analysis.

Calculate the percentages of reads are human rRNA and human mRNA, see the attached review “Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics”.

run nextflow set human as reference

(OPTION_1, NOT_USED) Use Annotation Tools: Annotation pipelines like FeatureCounts or HTSeq can classify reads based on genomic annotations (e.g., GTF/GFF files). Use a comprehensive human genome annotation to distinguish between mRNA and rRNA regions.

        diff /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /home/jhuang/REFs/genome.fa

        featureCounts -a /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf -o counts.txt -t exon -g gene_type aligned.bam

Here, gene_type in the GTF file may distinguish mRNA and rRNA features.

/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf

(OPTION_2, USED, under host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa" --gtf "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"   --star_index "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"    -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'
#--pseudo_aligner 'salmon'

'GRCh38' {
    fasta       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
    bwa         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BWAIndex/version0.6.0/"
    bowtie2     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/"
    star        = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"
    bismark     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BismarkIndex/"
    gtf         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"
    bed12       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.bed"
    mito_name   = "chrM"
    macs_gsize  = "2.7e9"
    blacklist   = "/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/blacklists/hg38-blacklist.bed"

Report

Indeed, until now I haven’t had direct experience with analyzing RNA-seq data from metatranscriptomic samples, as my expertise has primarily been with data derived from cultured samples. Ideally, sequencing from cultured samples would be the optimal approach.

However, if RNA-seq from metatranscriptomic samples becomes necessary, it would be crucial to deplete host (human) mRNA, host (human) rRNA, and microbial rRNA before sequencing (refer to the attached review PIIS0168952523001270.pdf, "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics"). This step is important for enriching bacterial mRNA and could potentially enhance the quality of sequencing data.

This morning, I explored the origins of reads within the human genome. Despite over 20 million reads originating from humans, only a small fraction successfully mapped to unique positions (83,500 reads). The remaining reads failed due to reasons such as Unassigned_MultiMapping (the most common issue) and Unassigned_NoFeatures. Among the 83,500 mapped reads, only 0.01% originated from rRNA. For more detailed biotype analysis, please refer to featurecounts_biotype_plot-1.png.

Based on these findings, depleting human mRNA appears to be the most critical step for improvement.

Time synchronization failed

Leave a reply

https://askubuntu.com/questions/1046214/enable-system-clock-synchronization

Steps to Enable NTP Support

Check if systemd-timesyncd is Running

Ensure that the default time synchronization service is installed and active. Run:

sudo systemctl status systemd-timesyncd

If it’s inactive or not installed, enable and start it:

sudo systemctl enable systemd-timesyncd
sudo systemctl start systemd-timesyncd

Install and Configure ntp (if needed)

If systemd-timesyncd is not available or you prefer using ntp, follow these steps:

Install the ntp package:

sudo apt update sudo apt install ntp

Ensure the ntp service is enabled and running:

sudo systemctl enable ntp
sudo systemctl start ntp

Manually Set NTP Servers (Optional)

If you’re using systemd-timesyncd, you can configure NTP servers in /etc/systemd/timesyncd.conf. Open the file:

sudo vim /etc/systemd/timesyncd.conf

Add or edit the following lines to include your preferred NTP servers:

[Time] NTP=time.google.com 1.de.pool.ntp.org 2.de.pool.ntp.org

Save the file and restart the service:

sudo systemctl restart systemd-timesyncd

Force Synchronization

After ensuring the service is running, you can force synchronization:

sudo timedatectl set-ntp false sudo timedatectl set-ntp true

Verify Synchronization

Check the time status again:

timedatectl

Confirm that System clock synchronized: yes is displayed.

Steps to Resolve

Check for systemd-timesyncd Installation

Run the following command to see if systemd-timesyncd is installed:

sudo systemctl status systemd-timesyncd

If the service is not found, install and enable it:

sudo apt update
sudo apt install systemd-timesyncd
sudo systemctl enable systemd-timesyncd
sudo systemctl start systemd-timesyncd

Verify NTP Settings

Check if the configuration for NTP servers is correct. Open the file /etc/systemd/timesyncd.conf:

sudo nano /etc/systemd/timesyncd.conf

Ensure the [Time] section includes valid NTP servers:

[Time] NTP=time.google.com 1.de.pool.ntp.org 2.de.pool.ntp.org

Save the file and restart the service:

sudo systemctl restart systemd-timesyncd

Manually Enable NTP Synchronization

If the service is installed and running but synchronization is still not active, force-enable NTP:

sudo timedatectl set-ntp true

Install ntp as an Alternative (Optional)

If systemd-timesyncd is not sufficient or unavailable, you can install and use the ntp package:

sudo apt update sudo apt install ntp sudo systemctl enable ntp sudo systemctl start ntp

Check Synchronization Status

Recheck the time synchronization status:

timedatectl

Look for System clock synchronized: yes.

Die 10 besten Skigebiete Deutschlands: Skifahren mit der Familie

Leave a reply

https://www.travanto.de/urlaubsmagazin/skigebiete-deutschland-fuer-familienurlaub/#Winkl

Deutschland ist ein wahres Wintersportparadies. Travanto verrät Ihnen die 10 besten Skigebiete Deutschlands und stellt Ihnen ausgewählte Ziele vom Mittelgebirge bis zu den Hochalpen für einen Familienurlaub vor. Die Anzahl an Skigebieten in Deutschland ist erstaunlich. Erfahren Sie, welche die 10 besten Skigebiete für einen Urlaub mit der Familie sind. Nach einem ausgiebigen Tag auf den Skiern und Snowboards kehren Sie in Ihr Ferienhaus oder die Ferienwohnung zurück, da es in den eigenen vier Wänden unweit der Piste immer am gemütlichsten ist. Sie und Ihre Liebsten genießen die Gemeinsamkeit im Urlaub inmitten von Bergen und Schnee und haben Gelegenheit, die schönsten Erlebnisse des Tages miteinander zu teilen. Die Skigebiete im Überblick

Oberstdorf-Kleinwalsertal, Bayern
Garmisch-Partenkirchen, Bayern
Reit im Winkl, Bayern
Bayerischer Wald
Berchtesgadener, Bayern
Winterberg, Nordrhein-Westfalen
Willingen im Sauerland, Hessen
Feldberg, Baden-Württemberg
Fichtelberg, Sachsen
Wurmberg, Niedersachen

Identify all occurrences of Phage HH1 MT880870 in S. epidermidis ST2 genomes from public and clinical isolates

Leave a reply

install the bacto environment

mamba create -n bacto -c conda-forge -c bioconda -c defaults python=3.7
mamba activate bacto
(bacto) mamba install -c conda-forge -c bioconda -c defaults trimmomatic fastqc shovill
(bacto) mamba install -c conda-forge -c bioconda -c defaults mlst snippy fasttree
(bacto) mamba install biopython
mamba install -c conda-forge perl=5.22
mamba install -c conda-forge -c bioconda roary

#------ NEED to be debugged for install roary(ERROR) --> install roary in the host env (SUCCESSFUL) ------
#ERROR (bacto) mamba install -c conda-forge -c bioconda -c defaults roary
#ERROR (bacto) cpanm Bio::Roary  #DEBUG NOT WORKING ! Installing Bio::Roary failed. See /home/jhuang/.cpanm/work/1736859390.2714327/build.log for details. Retry with --force to force install it.
#ERROR (bacto) export PERL5LIB=$CONDA_PREFIX/lib/perl5/site_perl/5.22.0/  #DEBUG SUCCESSFUL, repaired also snippy and snippy-core
sudo apt install roary

(bacto) mamba install -c conda-forge -c bioconda -c defaults raxml-ng gubbins snp-sites
(bacto) mamba install -c bioconda iqtree
#(bacto) mamba install -c conda-forge -c bioconda -c defaults perl-bioperl
#(bacto) cpanm Bio::SeqIO
#(bacto) cpanm Bio::Perl
#(bacto) mamba install -c conda-forge -c bioconda -c defaults biopython
#(bacto) mamba install -c conda-forge -c bioconda -c defaults snakemak

alignment of complete genomes

Alignment of Complete Genomes vs Alignment of Core Genes

Alignment of Complete Genomes:
    Useful when all genomes are complete and high-quality.
    Not ideal if many genomes are incomplete or fragmented, as this can introduce noise.

Alignment of Core Genes:
    Preferred when working with a mix of complete and draft genomes.
    Focuses on conserved regions, minimizing issues caused by missing or fragmented data.

How to Ensure Core Gene Alignment?

If your goal is to create a core gene alignment:

    Use a tool like Roary or Panaroo after genome annotation (e.g., with Prokka).
    Extract the core_gene_alignment.aln file for alignment, which represents the concatenated core genes shared across all genomes.

Prepare all gff-files

        bpseudomallei taylorella schromogenes bcereus orhinotracheale aactinomycetemcomitans vcholerae_2 streptomyces spyogenes paeruginosa sgallolyticus bhenselae ecoli senterica_achtman_2 brachyspira leptospira_2 streptothermophilus pdamselae campylobacter_nonjejuni_3 ypseudotuberculosis_achtman_3 abaumannii_2 staphlugdunensis mhominis_3 chlamydiales cdifficile yruckeri bsubtilis sbsec koxytoca kingella tenacibaculum pgingivalis miowae neisseria bordetella_3 cbotulinum edwardsiella ureaplasma pacnes_3 mhyorhinis gallibacterium bbacilliformis sdysgalactiae shominis mcanis borrelia smaltophilia brachyspira_2 campylobacter_nonjejuni_8 psalmonis suberis achromobacter    saureus    mhaemolytica efaecalis cronobacter rhodococcus scanis mplutonius shewanella campylobacter_nonjejuni_7 campylobacter hparasuis listeria_2 plarvae mabscessus hcinaedi csepticum    sepidermidis    mhyopneumoniae mbovis_2 wolbachia vcholerae helicobacter cperfringens bcc ranatipestifer szooepidemicus msynoviae campylobacter_nonjejuni_9 pmultocida_2 arcobacter fpsychrophilum abaumannii bwashoensis mcaseolyticus liberibacter aeromonas cmaltaromaticum vvulnificus mpneumoniae klebsiella brachyspira_5 efaecium pfluorescens hsuis ssuis dnodosus sinorhizobium msciuri spneumoniae vibrio campylobacter_nonjejuni geotrichum lsalivarius tpallidum magalactiae shaemolyticus sagalactiae ecoli_achtman_4 sthermophilus bfragilis kaerogenes blicheniformis_14 vparahaemolyticus mflocculare mgallisepticum vtapetis ecloacae ppentosaceus manserisalpingitidis spseudintermedius brachyspira_4 brachyspira_3 leptospira_3 mcatarrhalis_achtman_6 oralstrep campylobacter_nonjejuni_6 pmultocida aphagocytophilum campylobacter_nonjejuni_2 otsutsugamushi diphtheria_3 hinfluenzae pputida brucella cfreundii mycobacteria_2 mgallisepticum_2 llactis_phage xfastidiosa campylobacter_nonjejuni_4 campylobacter_nonjejuni_5 leptospira

        python ~/Scripts/run_mlst.py complete_genome_1282_ncbi.fasta sepidermidis  #73
        python ~/Scripts/run_mlst.py complete_genome_1282_ena.fasta sepidermidis   #73
        python ~/Scripts/run_mlst.py complete_genome_1282_ena_taxon_descendants.fasta sepidermidis
        grep -P "\tsepidermidis\t2\t" all.mlst > ST2.mlst  #73
        grep -v "NZ_" ST2.mlst > ST2_.mlst
        cut -d'.' -f1 ST2_.mlst | sort > ST2.ids
        cut -d'|' -f2 ST2.mlst | sort > ST2.ids
        diff ST2.ids ../ena_descendants_all_ST2_mlst/ST2.ids

        for sample in AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
            cp ../genomes_by_taxid/ncbi_all_ST2_mlst/${sample}.1.fasta ./
        done

        cd typing_complete.csv
        grep -P "\tsepidermidis\t2\t" typing_complete.csv > typing_ST2.csv

        echo -e "AP029227\nCP013943\nCP040883\nCP045648\nCP052939\nCP052940\nCP052955\nCP052959\nCP052981\nCP052984\nCP052990\nCP052994\nCP053088\nCP064461\nCP064467\nCP064473\nCP064476\nCP064480\nCP064485\nCP064499\nCP064508\nCP064516\nCP064525\nCP064554\nCP064557\nCP064560\nCP064572\nCP064574\nCP064587\nCP064590\nCP064599\nCP064603\nCP064606\nCP064609\nCP064613\nCP064616\nCP064619\nCP064624\nCP064631\nCP064637\nCP064650\nCP073821\nCP073824\nCP073827\nCP073830\nCP073835\nCP073836\nCP073840\nCP073841\nCP073844\nCP073847\nCP073850\nCP073852\nCP073855\nCP073857\nCP073859\nCP073862\nCP073863\nCP073878\nCP073900\nCP073904\nCP084008\nCP093173\nCP093179\nCP093181\nCP093193\nCP093196\nCP093198\nCP093212\nCP097512\nCP097514\nCP120425\nCP133663" > ids.txt

        cat ids.txt | while read id; do
            efetch -db nuccore -id $id -format gff3 > "${id}.gff"
            efetch -db nuccore -id $id -format fasta > "${id}.fasta"
            # Append the FASTA sequence to the GFF3 file with ##FASTA header
            echo "##FASTA" >> "${id}.gff"
            cat "${id}.fasta" >> "${id}.gff"
            rm "${id}.fasta"  # Optionally remove the separate FASTA file
        done

        for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319; do
            cp ../prokka/${sample}/${sample}.gff ./
        done

Example with Roary:

        roary -f roary_ST2 -e --mafft -p 100 roary_input_ST2/*.gff

        #roary -p 15 -f ./roary -i 95 -cd 99 -s -e -n -v roary_.fa.gb.gff CP012351.gff3 CP003084.gff3 CP023676.gff3
        -i        minimum percentage identity for blastp [95]
        -cd FLOAT percentage of isolates a gene must be in to be core [99]
        -f STR    output directory [.]
        -e        create a multiFASTA alignment of core genes using PRANK
        -n (==--mafft)       fast core gene alignment with MAFFT, use with -e

        -s        dont split paralogs
        -v        verbose output to STDOUT

The file core_gene_alignment.aln can then be used for tree construction:

        iqtree2 -s core_gene_alignment.aln -m MFP -bb 1000 -nt AUTO
        #Alignment was printed to core_gene_alignment.aln.uniqueseq.phy
        #For your convenience alignment with unique sequences printed to core_gene_alignment.aln.uniqueseq.phy
        #Create initial parsimony tree by phylogenetic likelihood library (PLL)... 0.030 seconds
        #Measuring multi-threading efficiency up to 128 CPU cores
        #Increase to 10 rounds for branch lengths

By focusing on core genes, you ensure that all genomes contribute meaningful data to the phylogenetic analysis.

The file that can be viewed as the phylogenetic tree among the generated files is:

    core_gene_alignment.aln.treefile

This file typically contains the tree in a Newick format, which is a widely used format for representing phylogenetic trees.

Public Database and Clinical Isolates

To differentiate the two groups of genomes effectively while maintaining clarity and professionalism, you can use category names that reflect their source or origin. Here are some suggestions:
Category Name Suggestions

    Source-based Names:
        Public Database and Clinical Isolates
        Public Genomes and Clinical Genomes
        Reference Genomes and Patient-derived Genomes

    Origin-based Names:
        External Sources and In-house Sequencing
        Global Genomes and Local Isolates
        Database Genomes and Clinical Strains

    Research Context Names:
        Published Genomes and Newly Sequenced Genomes
        Literature Genomes and Clinical Samples
        Archived Data and Current Study

    Abbreviated or Code-like Names:
        PUB and CLI (Public and Clinical)
        EXDB (External Database) and INHS (In-House Sequencing)
        PG (Public Genomes) and CI (Clinical Isolates)

Tips for Choosing the Best Category Names

    Clarity: Ensure that the names are intuitive and easy for others to understand.
    Neutrality: Avoid names that might imply bias or inferiority (e.g., avoid "old" vs. "new").
    Context Appropriateness: Consider your audience. For a clinical study, terms like "Clinical Isolates" may resonate more, while "Public Database" works better for bioinformatics research.

For instance:

    "Database" and "Clinical" could work well in a manuscript.
    "External" and "In-house" might be better for internal project tracking.

copy contigs of own clinical isolates

#extract all genes from the phages set as the column names, set the genome names from the public database and own isolates as row names.

for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319; do
  cp ../shovill/${sample}/contigs.fa ${sample}.fasta
done

makeblastdb

#MT880870_on_mibi1435.blastn: 2366
#MT880870_on_mibi1436.blastn: 3198
#MT880870_on_mibi1437.blastn: 3198

# -lenth(MT880870)=34053
rename 's/\.1\.fasta$/.fasta/' *.1.fasta
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    makeblastdb -in ${sample}.fasta -dbtype nucl
done

prepare typing_ST2_until_QPB07622.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07572.1_1" > QPB07572.fasta
mkdir QPB07572
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07572.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07572/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07572 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07573.1_2" > QPB07573.fasta
mkdir QPB07573
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07573.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07573/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07573 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_typing_ST2.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07574.1_3" > QPB07574.fasta
mkdir QPB07574
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07574.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07574/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07574 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_typing_ST2.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07575.1_4" > QPB07575.fasta
mkdir QPB07575
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07575.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07575/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07575 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_typing_ST2.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07576.1_5" > QPB07576.fasta
mkdir QPB07576
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07576.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07576/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07576 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_typing_ST2.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07577.1_6" > QPB07577.fasta
mkdir QPB07577
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07577.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07577/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07577 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_typing_ST2.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07578.1_7" > QPB07578.fasta
mkdir QPB07578
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07578.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07578/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07578 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_typing_ST2.csv

samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07579.1_8" > QPB07579.fasta
mkdir QPB07579
for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
    blastn -db ${sample}.fasta -query QPB07579.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07579/${sample}.blastn
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07579 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_updated_typing_ST2.csv

grep ">" MT880870.fasta > temp
cut -d' ' -f1 temp > temp1

#    region="lcl|MT880870.1_cds_${id}.1_$((i - 78))"  # Adjust index calculation for region
#    echo "samtools faidx MT880870.fasta \"$region\" > ${id}.fasta"
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07580.1_9" > QPB07580.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07581.1_10" > QPB07581.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07582.1_11" > QPB07582.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07583.1_12" > QPB07583.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07584.1_13" > QPB07584.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07585.1_14" > QPB07585.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07586.1_15" > QPB07586.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07587.1_16" > QPB07587.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07588.1_17" > QPB07588.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07589.1_18" > QPB07589.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07590.1_19" > QPB07590.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07591.1_20" > QPB07591.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07592.1_21" > QPB07592.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07593.1_22" > QPB07593.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07594.1_23" > QPB07594.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07595.1_24" > QPB07595.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07596.1_25" > QPB07596.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07597.1_26" > QPB07597.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07598.1_27" > QPB07598.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07599.1_28" > QPB07599.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07600.1_29" > QPB07600.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07601.1_30" > QPB07601.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07602.1_31" > QPB07602.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07603.1_32" > QPB07603.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07604.1_33" > QPB07604.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07605.1_34" > QPB07605.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07606.1_35" > QPB07606.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07607.1_36" > QPB07607.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07608.1_37" > QPB07608.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07609.1_38" > QPB07609.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07610.1_39" > QPB07610.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07611.1_40" > QPB07611.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07612.1_41" > QPB07612.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07613.1_42" > QPB07613.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07614.1_43" > QPB07614.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07615.1_44" > QPB07615.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07616.1_45" > QPB07616.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07617.1_46" > QPB07617.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07618.1_47" > QPB07618.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07619.1_48" > QPB07619.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07620.1_49" > QPB07620.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07621.1_50" > QPB07621.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07622.1_51" > QPB07622.fasta

for i in {580..580}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done
python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07580 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_updated_updated_typing_ST2.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_QPB07580.csv

for i in {581..622}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
    echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_${id}.csv"
done

cp typing_ST2_until_QPB07622.csv ../plotTreeHeatmap_ST2/typing_ST2.csv

plot tree heatmap

library(ggtree)
library(ggplot2)
library(dplyr)
setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/")
# -- edit tree --
#https://icytree.org/
#0.000780
info <- read.csv("typing_ST2.csv", sep="\t")
info$name <- info$Isolate
tree <- read.tree("core_gene_alignment.aln.treefile")
cols <- c("Public database"='purple2', "Clinical isolate"='darksalmon')  #purple2  skyblue2

heatmapData2 <- info %>% select(Isolate, QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622)  #ST,
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

heatmap.colours <- c("darkgreen","grey")
names(heatmap.colours) <- c("+","-")

#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=Source)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)

png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

#png("ggtree_and_gheatmap.png", width=1290, height=1000)
#svg("ggtree_and_gheatmap.svg", width=17, height=15)
#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))

png("ggtree_and_gheatmap.png", width=1290, height=1000)
gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

Identify all occurrences of Phages MT880870, MT880871 and MT880872 in S. epidermidis ST2 genomes from public and clinical isolates

Leave a reply

prepare typing_ST2_until_QPB07622.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

#grep ">" MT880870.fasta | cut -d' ' -f1 > headers.txt
# -- under presence_absence_ST2 --
samtools faidx MT880870.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07572.1_1" > QPB07572.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07573.1_2" > QPB07573.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07574.1_3" > QPB07574.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07575.1_4" > QPB07575.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07576.1_5" > QPB07576.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07577.1_6" > QPB07577.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07578.1_7" > QPB07578.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07579.1_8" > QPB07579.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07580.1_9" > QPB07580.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07581.1_10" > QPB07581.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07582.1_11" > QPB07582.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07583.1_12" > QPB07583.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07584.1_13" > QPB07584.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07585.1_14" > QPB07585.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07586.1_15" > QPB07586.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07587.1_16" > QPB07587.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07588.1_17" > QPB07588.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07589.1_18" > QPB07589.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07590.1_19" > QPB07590.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07591.1_20" > QPB07591.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07592.1_21" > QPB07592.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07593.1_22" > QPB07593.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07594.1_23" > QPB07594.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07595.1_24" > QPB07595.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07596.1_25" > QPB07596.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07597.1_26" > QPB07597.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07598.1_27" > QPB07598.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07599.1_28" > QPB07599.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07600.1_29" > QPB07600.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07601.1_30" > QPB07601.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07602.1_31" > QPB07602.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07603.1_32" > QPB07603.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07604.1_33" > QPB07604.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07605.1_34" > QPB07605.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07606.1_35" > QPB07606.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07607.1_36" > QPB07607.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07608.1_37" > QPB07608.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07609.1_38" > QPB07609.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07610.1_39" > QPB07610.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07611.1_40" > QPB07611.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07612.1_41" > QPB07612.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07613.1_42" > QPB07613.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07614.1_43" > QPB07614.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07615.1_44" > QPB07615.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07616.1_45" > QPB07616.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07617.1_46" > QPB07617.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07618.1_47" > QPB07618.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07619.1_48" > QPB07619.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07620.1_49" > QPB07620.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07621.1_50" > QPB07621.fasta
samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07622.1_51" > QPB07622.fasta

for i in {572..622}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"

#under plotTreeHeatmap_ST2
for i in {572..622}; do
    echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
done

#reprace all '+' with 'MT880870' in typing_ST2_until_QPB07622.csv
cp ../presence_absence_ST2/typing_ST2_until_QPB07622.csv ./
sed -i 's/+/MT880870/g' typing_ST2_until_QPB07622.csv

prepare typing_ST2_until_QPB07667.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

# -- under presence_absence_ST2 --
samtools faidx MT880871.fasta
#grep ">" MT880871.fasta | cut -d' ' -f1 > headers.txt
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07623.1_1" > QPB07623.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07624.1_2" > QPB07624.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07625.1_3" > QPB07625.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07626.1_4" > QPB07626.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07627.1_5" > QPB07627.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07628.1_6" > QPB07628.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07629.1_7" > QPB07629.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07630.1_8" > QPB07630.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07631.1_9" > QPB07631.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07632.1_10" > QPB07632.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07633.1_11" > QPB07633.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07634.1_12" > QPB07634.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07635.1_13" > QPB07635.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07636.1_14" > QPB07636.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07637.1_15" > QPB07637.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07638.1_16" > QPB07638.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07639.1_17" > QPB07639.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07640.1_18" > QPB07640.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07641.1_19" > QPB07641.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07642.1_20" > QPB07642.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07643.1_21" > QPB07643.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07644.1_22" > QPB07644.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07645.1_23" > QPB07645.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07646.1_24" > QPB07646.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07647.1_25" > QPB07647.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07648.1_26" > QPB07648.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07649.1_27" > QPB07649.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07650.1_28" > QPB07650.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07651.1_29" > QPB07651.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07652.1_30" > QPB07652.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07653.1_31" > QPB07653.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07654.1_32" > QPB07654.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07655.1_33" > QPB07655.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07656.1_34" > QPB07656.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07657.1_35" > QPB07657.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07658.1_36" > QPB07658.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07659.1_37" > QPB07659.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07660.1_38" > QPB07660.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07661.1_39" > QPB07661.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07662.1_40" > QPB07662.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07663.1_41" > QPB07663.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07664.1_42" > QPB07664.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07665.1_43" > QPB07665.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07666.1_44" > QPB07666.fasta
samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07667.1_45" > QPB07667.fasta

for i in {623..667}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

for i in {660..660}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

# -- under plotTreeHeatmap_ST2 --
for i in {623..667}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
done

sed -i 's/+/MT880871/g' typing_ST2_until_QPB07667.csv

prepare typing_ST2_until_QPB07667.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

# -- under presence_absence_ST2 --
samtools faidx MT880872.fasta
#grep ">" MT880872.fasta | cut -d' ' -f1 > headers.txt

samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07668.1_1" > QPB07668.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07669.1_2" > QPB07669.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07670.1_3" > QPB07670.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07671.1_4" > QPB07671.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07672.1_5" > QPB07672.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07673.1_6" > QPB07673.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07674.1_7" > QPB07674.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07675.1_8" > QPB07675.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07676.1_9" > QPB07676.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07677.1_10" > QPB07677.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07678.1_11" > QPB07678.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07679.1_12" > QPB07679.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07680.1_13" > QPB07680.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07681.1_14" > QPB07681.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07682.1_15" > QPB07682.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07683.1_16" > QPB07683.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07684.1_17" > QPB07684.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07685.1_18" > QPB07685.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07686.1_19" > QPB07686.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07687.1_20" > QPB07687.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07688.1_21" > QPB07688.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07689.1_22" > QPB07689.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07690.1_23" > QPB07690.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07691.1_24" > QPB07691.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07692.1_25" > QPB07692.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07693.1_26" > QPB07693.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07694.1_27" > QPB07694.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07695.1_28" > QPB07695.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07696.1_29" > QPB07696.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07697.1_30" > QPB07697.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07698.1_31" > QPB07698.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07699.1_32" > QPB07699.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07700.1_33" > QPB07700.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07701.1_34" > QPB07701.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07702.1_35" > QPB07702.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07703.1_36" > QPB07703.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07704.1_37" > QPB07704.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07705.1_38" > QPB07705.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07706.1_39" > QPB07706.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07707.1_40" > QPB07707.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07708.1_41" > QPB07708.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07709.1_42" > QPB07709.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07710.1_43" > QPB07710.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07711.1_44" > QPB07711.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07712.1_45" > QPB07712.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07713.1_46" > QPB07713.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07714.1_47" > QPB07714.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07715.1_48" > QPB07715.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07716.1_49" > QPB07716.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07717.1_50" > QPB07717.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07718.1_51" > QPB07718.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07719.1_52" > QPB07719.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07720.1_53" > QPB07720.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07721.1_54" > QPB07721.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07722.1_55" > QPB07722.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07723.1_56" > QPB07723.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07724.1_57" > QPB07724.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07725.1_58" > QPB07725.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07726.1_59" > QPB07726.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07727.1_60" > QPB07727.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07728.1_61" > QPB07728.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07729.1_62" > QPB07729.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07730.1_63" > QPB07730.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07731.1_64" > QPB07731.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07732.1_65" > QPB07732.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07733.1_66" > QPB07733.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07734.1_67" > QPB07734.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07735.1_68" > QPB07735.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07736.1_69" > QPB07736.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07737.1_70" > QPB07737.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07738.1_71" > QPB07738.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07739.1_72" > QPB07739.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07740.1_73" > QPB07740.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07741.1_74" > QPB07741.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07742.1_75" > QPB07742.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07743.1_76" > QPB07743.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07744.1_77" > QPB07744.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07745.1_78" > QPB07745.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07746.1_79" > QPB07746.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07747.1_80" > QPB07747.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07748.1_81" > QPB07748.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07749.1_82" > QPB07749.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07750.1_83" > QPB07750.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07751.1_84" > QPB07751.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07752.1_85" > QPB07752.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07753.1_86" > QPB07753.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07754.1_87" > QPB07754.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07755.1_88" > QPB07755.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07756.1_89" > QPB07756.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07757.1_90" > QPB07757.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07758.1_91" > QPB07758.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07759.1_92" > QPB07759.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07760.1_93" > QPB07760.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07761.1_94" > QPB07761.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07762.1_95" > QPB07762.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07763.1_96" > QPB07763.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07764.1_97" > QPB07764.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07765.1_98" > QPB07765.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07766.1_99" > QPB07766.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07767.1_100" > QPB07767.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07768.1_101" > QPB07768.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07769.1_102" > QPB07769.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07770.1_103" > QPB07770.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07771.1_104" > QPB07771.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07772.1_105" > QPB07772.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07773.1_106" > QPB07773.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07774.1_107" > QPB07774.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07775.1_108" > QPB07775.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07776.1_109" > QPB07776.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07777.1_110" > QPB07777.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07778.1_111" > QPB07778.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07779.1_112" > QPB07779.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07780.1_113" > QPB07780.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07781.1_114" > QPB07781.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07782.1_115" > QPB07782.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07783.1_116" > QPB07783.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07784.1_117" > QPB07784.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07785.1_118" > QPB07785.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07786.1_119" > QPB07786.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07787.1_120" > QPB07787.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07788.1_121" > QPB07788.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07789.1_122" > QPB07789.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07790.1_123" > QPB07790.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07791.1_124" > QPB07791.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07792.1_125" > QPB07792.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07793.1_126" > QPB07793.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07794.1_127" > QPB07794.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07795.1_128" > QPB07795.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07796.1_129" > QPB07796.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07797.1_130" > QPB07797.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07798.1_131" > QPB07798.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07799.1_132" > QPB07799.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07800.1_133" > QPB07800.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07801.1_134" > QPB07801.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07802.1_135" > QPB07802.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07803.1_136" > QPB07803.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07804.1_137" > QPB07804.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07805.1_138" > QPB07805.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07806.1_139" > QPB07806.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07807.1_140" > QPB07807.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07808.1_141" > QPB07808.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07809.1_142" > QPB07809.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07810.1_143" > QPB07810.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07811.1_144" > QPB07811.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07812.1_145" > QPB07812.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07813.1_146" > QPB07813.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07814.1_147" > QPB07814.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07815.1_148" > QPB07815.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07816.1_149" > QPB07816.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07817.1_150" > QPB07817.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07818.1_151" > QPB07818.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07819.1_152" > QPB07819.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07820.1_153" > QPB07820.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07821.1_154" > QPB07821.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07822.1_155" > QPB07822.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07823.1_156" > QPB07823.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07824.1_157" > QPB07824.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07825.1_158" > QPB07825.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07826.1_159" > QPB07826.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07827.1_160" > QPB07827.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07828.1_161" > QPB07828.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07829.1_162" > QPB07829.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07830.1_163" > QPB07830.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07831.1_164" > QPB07831.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07832.1_165" > QPB07832.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07833.1_166" > QPB07833.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07834.1_167" > QPB07834.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07835.1_168" > QPB07835.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07836.1_169" > QPB07836.fasta
samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07837.1_170" > QPB07837.fasta

for i in {668..750}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

for i in {751..837}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

    for i in {708..708}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
        echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

# -- under plotTreeHeatmap_ST2 --
for i in {668..837}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
done

sed -i 's/+/MT880872/g' typing_ST2_until_QPB07837.csv

plot tree heatmap under /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2_MT880871

library(ggtree)
library(ggplot2)
library(dplyr)
setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/")
# -- edit tree --
#https://icytree.org/
#0.000780
info <- read.csv("typing_ST2_until_QPB07837.csv", sep="\t")
info$name <- info$Isolate
tree <- read.tree("core_gene_alignment.aln.treefile")
cols <- c("Public database"='purple2', "Clinical isolate"='darksalmon')  #purple2  skyblue2
heatmapData2 <- info %>% select(Isolate, QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622, QPB07623, QPB07624, QPB07625, QPB07626, QPB07627, QPB07628, QPB07629, QPB07630, QPB07631, QPB07632, QPB07633, QPB07634, QPB07635, QPB07636, QPB07637, QPB07638, QPB07639, QPB07640, QPB07641, QPB07642, QPB07643, QPB07644, QPB07645, QPB07646, QPB07647, QPB07648, QPB07649, QPB07650, QPB07651, QPB07652, QPB07653, QPB07654, QPB07655, QPB07656, QPB07657, QPB07658, QPB07659, QPB07660, QPB07661, QPB07662, QPB07663, QPB07664, QPB07665, QPB07666, QPB07667, QPB07668, QPB07669, QPB07670, QPB07671, QPB07672, QPB07673, QPB07674, QPB07675, QPB07676, QPB07677, QPB07678, QPB07679, QPB07680, QPB07681, QPB07682, QPB07683, QPB07684, QPB07685, QPB07686, QPB07687, QPB07688, QPB07689, QPB07690, QPB07691, QPB07692, QPB07693, QPB07694, QPB07695, QPB07696, QPB07697, QPB07698, QPB07699, QPB07700, QPB07701, QPB07702, QPB07703, QPB07704, QPB07705, QPB07706, QPB07707, QPB07708, QPB07709, QPB07710, QPB07711, QPB07712, QPB07713, QPB07714, QPB07715, QPB07716, QPB07717, QPB07718, QPB07719, QPB07720, QPB07721, QPB07722, QPB07723, QPB07724, QPB07725, QPB07726, QPB07727, QPB07728, QPB07729, QPB07730, QPB07731, QPB07732, QPB07733, QPB07734, QPB07735, QPB07736, QPB07737, QPB07738, QPB07739, QPB07740, QPB07741, QPB07742, QPB07743, QPB07744, QPB07745, QPB07746, QPB07747, QPB07748, QPB07749, QPB07750, QPB07751, QPB07752, QPB07753, QPB07754, QPB07755, QPB07756, QPB07757, QPB07758, QPB07759, QPB07760, QPB07761, QPB07762, QPB07763, QPB07764, QPB07765, QPB07766, QPB07767, QPB07768, QPB07769, QPB07770, QPB07771, QPB07772, QPB07773, QPB07774, QPB07775, QPB07776, QPB07777, QPB07778, QPB07779, QPB07780, QPB07781, QPB07782, QPB07783, QPB07784, QPB07785, QPB07786, QPB07787, QPB07788, QPB07789, QPB07790, QPB07791, QPB07792, QPB07793, QPB07794, QPB07795, QPB07796, QPB07797, QPB07798, QPB07799, QPB07800, QPB07801, QPB07802, QPB07803, QPB07804, QPB07805, QPB07806, QPB07807, QPB07808, QPB07809, QPB07810, QPB07811, QPB07812, QPB07813, QPB07814, QPB07815, QPB07816, QPB07817, QPB07818, QPB07819, QPB07820, QPB07821, QPB07822, QPB07823, QPB07824, QPB07825, QPB07826, QPB07827, QPB07828, QPB07829, QPB07830, QPB07831, QPB07832, QPB07833, QPB07834, QPB07835, QPB07836, QPB07837)  #ST,
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn
heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=Source)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()
#png("ggtree_and_gheatmap.png", width=1290, height=1000)
#svg("ggtree_and_gheatmap.svg", width=17, height=15)
#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
png("ggtree_and_gheatmap.png", width=1290, height=1000)
gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

Report

Explanation for the discrepancy between the last analysis and this one: Note that many partial genes of the phages do not match because, in the previous analysis, only the longest BLASTn hits were considered and recorded in the table.

We have a total of 14 ST2 genomes from Luise’s project, and I found 73 ST2 genomes from NCBI.

I examined the presence and absence of all 51 + 45 + 170 genes (details below) across the 87 genomes and visualized the results in a heatmap (attached as ggtree_and_gheatmap_phages.png).
- MT880870: 51 putative genes
- MT880871: 45 putative genes
- MT880872: 170 putative genes
In the heatmap:

Genes not present in the corresponding isolate are shown in grey. Genes present are highlighted in the corresponding colors:
- Dark red for MT880870
- Dark blue for MT880871
- Dark green for MT880872
The graphic is based on the Excel file typing_ST2.xlsx. If you have any suggestions or improvements, please let me know.

Enhanced Visualization of Gene Presence in Luise_Sepi_STKN

Leave a reply

prepare typing_ST2_until_QPB07837.csv (under DIR ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates and ENV r_env)

cp ../presence_absence_ST2/QPB*.fasta ./
cp ../plotTreeHeatmap/typing.csv ./typing_until_QPB07571.csv

# -- makeblastdb --
#for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do
#    makeblastdb -in ../shovill/${sample}/contigs.fa -dbtype nucl
#done

for i in {572..622}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
        echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

for i in {572..622}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
done

#reprace all '+' with 'MT880870' in typing_until_QPB07622.csv
sed -i 's/+/MT880870/g' typing_until_QPB07622.csv

prepare typing_until_QPB07667.csv

for i in {623..667}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
        echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

for i in {623..667}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
done

sed -i 's/+/MT880871/g' typing_until_QPB07667.csv

prepare typing_until_QPB07667.csv

for i in {668..750}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
        echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

for i in {751..837}; do
    id=$(printf "QPB07%03d" "$i")
    echo "mkdir ${id}"
    echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
        echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
done

for i in {668..837}; do
    id_1=$(printf "QPB07%03d" "$((i-1))")
    id=$(printf "QPB07%03d" "$i")
    echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
done

sed -i 's/+/MT880872/g' typing_until_QPB07837.csv

cut -d$'\t' -f1-43 typing_until_QPB07837.csv > temp1
cut -d$'\t' -f44- typing_until_QPB07837.csv > temp2

sed -i 's/MT880870/+/g' temp1
paste -d$'\t' temp1 temp2 > ggtree_and_gheatmap_mibi_phages.csv

plot tree heatmap under /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates

cp ../plotTreeHeatmap/990_backup.tree ./

library(ggtree)
library(ggplot2)
library(dplyr)
setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/")
# -- edit tree --
#https://icytree.org/
#0.000780

# -- for the figure ggtree_and_gheatmap_mibi_phages.png --
info <- read.csv("ggtree_and_gheatmap_mibi.csv", sep="\t")
info$name <- info$Isolate
tree <- read.tree("990_backup.tree")
cols <- c("2"="cornflowerblue","5"="darkgreen","7"="seagreen3","9"="tan","14"="red",  "17"="navyblue", "23"="gold",     "35"="green","59"="orange","73"="pink","81"="purple","86"="magenta","87"="brown", "89"="darksalmon","130"="chocolate4","190"="darkkhaki", "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet")
#cols <- c("2"='purple2', "other"='darksalmon')  #purple2  skyblue2
heatmapData2 <- info %>% select(Isolate,  QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622, QPB07623, QPB07624, QPB07625, QPB07626, QPB07627, QPB07628, QPB07629, QPB07630, QPB07631, QPB07632, QPB07633, QPB07634, QPB07635, QPB07636, QPB07637, QPB07638, QPB07639, QPB07640, QPB07641, QPB07642, QPB07643, QPB07644, QPB07645, QPB07646, QPB07647, QPB07648, QPB07649, QPB07650, QPB07651, QPB07652, QPB07653, QPB07654, QPB07655, QPB07656, QPB07657, QPB07658, QPB07659, QPB07660, QPB07661, QPB07662, QPB07663, QPB07664, QPB07665, QPB07666, QPB07667, QPB07668, QPB07669, QPB07670, QPB07671, QPB07672, QPB07673, QPB07674, QPB07675, QPB07676, QPB07677, QPB07678, QPB07679, QPB07680, QPB07681, QPB07682, QPB07683, QPB07684, QPB07685, QPB07686, QPB07687, QPB07688, QPB07689, QPB07690, QPB07691, QPB07692, QPB07693, QPB07694, QPB07695, QPB07696, QPB07697, QPB07698, QPB07699, QPB07700, QPB07701, QPB07702, QPB07703, QPB07704, QPB07705, QPB07706, QPB07707, QPB07708, QPB07709, QPB07710, QPB07711, QPB07712, QPB07713, QPB07714, QPB07715, QPB07716, QPB07717, QPB07718, QPB07719, QPB07720, QPB07721, QPB07722, QPB07723, QPB07724, QPB07725, QPB07726, QPB07727, QPB07728, QPB07729, QPB07730, QPB07731, QPB07732, QPB07733, QPB07734, QPB07735, QPB07736, QPB07737, QPB07738, QPB07739, QPB07740, QPB07741, QPB07742, QPB07743, QPB07744, QPB07745, QPB07746, QPB07747, QPB07748, QPB07749, QPB07750, QPB07751, QPB07752, QPB07753, QPB07754, QPB07755, QPB07756, QPB07757, QPB07758, QPB07759, QPB07760, QPB07761, QPB07762, QPB07763, QPB07764, QPB07765, QPB07766, QPB07767, QPB07768, QPB07769, QPB07770, QPB07771, QPB07772, QPB07773, QPB07774, QPB07775, QPB07776, QPB07777, QPB07778, QPB07779, QPB07780, QPB07781, QPB07782, QPB07783, QPB07784, QPB07785, QPB07786, QPB07787, QPB07788, QPB07789, QPB07790, QPB07791, QPB07792, QPB07793, QPB07794, QPB07795, QPB07796, QPB07797, QPB07798, QPB07799, QPB07800, QPB07801, QPB07802, QPB07803, QPB07804, QPB07805, QPB07806, QPB07807, QPB07808, QPB07809, QPB07810, QPB07811, QPB07812, QPB07813, QPB07814, QPB07815, QPB07816, QPB07817, QPB07818, QPB07819, QPB07820, QPB07821, QPB07822, QPB07823, QPB07824, QPB07825, QPB07826, QPB07827, QPB07828, QPB07829, QPB07830, QPB07831, QPB07832, QPB07833, QPB07834, QPB07835, QPB07836, QPB07837)  #ST,
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")

#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular

p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))

png("ggtree_and_gheatmap_mibi_phages.png", width=1290, height=1000)
#svg("ggtree_and_gheatmap_mibi_phages.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 6) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

# -- for the figure ggtree_and_gheatmap_mibi_selected_genes.png --

info <- read.csv("ggtree_and_gheatmap_mibi.csv", sep="\t")
info$name <- info$Isolate
tree <- read.tree("990_backup.tree")
cols <- c("2"="cornflowerblue","5"="darkgreen","7"="seagreen3","9"="tan","14"="red",  "17"="navyblue", "23"="gold",     "35"="green","59"="orange","73"="pink","81"="purple","86"="magenta","87"="brown", "89"="darksalmon","130"="chocolate4","190"="darkkhaki", "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet")

heatmapData2 <- info %>% select(Isolate,       SCCmec,  agr.typing,      gyrB,    fumC,    gltA,    icd,     apsS,    sigB,    sarA,    agrC,    yycG,    psm.β,   psm.δ,   hlb,     atlE,    atl,     sdrG,    sdrH,    ebh,     ebpS,    tagB,    capC,    sepA,    dltA,    fmtC,    lipA,    sceD,    SE0760,  esp,     ecpA)  #ST,
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")

heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")

#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular

p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))

png("ggtree_and_gheatmap_mibi_selected_genes.png", width=1590, height=1300)
#svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

Report

I’ve attached a figure (ggtree_and_gheatmap_mibi_selected_genes.png) that illustrates the presence and absence of the selected genes. This is a visual representation of the table I sent to you earlier. The presence or absence of each gene in the corresponding genomes was determined using a BLASTn comparison between the genome and the gene sequences.

Additionally, I’ve updated the figure ggtree_and_gheatmap_mibi_phages.png. In this new version, all ST types are represented in distinct colors. The raw data for both figures can be found in the attached Excel file (ggtree_and_gheatmap_mibi.xlsx).

Enhanced Visualization of Gene Presence for the Selected Genes in Bongarts_S.epidermidis_HDRNA

Leave a reply

preparing gene seqences

(Optional online search) https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&BLAST_SPEC=MicrobialGenomes
#Title:Refseq prokaryote representative genomes (contains refseq assembly)
#Molecule Type:mixed DNA
#Update date:2024/10/16
#Number of sequences:1038672

#cat gyrB_revcomp.fasta fumC_revcomp.fasta gltA.fasta icd_revcomp.fasta apsS.fasta sigB_revcomp.fasta sarA_revcomp.fasta ...

# SCCmec,agr.typing,
# gyrB(House keeper),
# fumC,gltA,icd(Metabolic genes),
# apsS,sigB,sarA,agrC,yycG(Virulence regulators),
# psm.β1,hlb(Toxins),
# atlE,sdrG,sdrH,ebh,ebp,tagB(Biofilm formation),
# capC,sepA,dltA,fmtC,lipA,sceD,SE0760(Immune evasion & colonization),
# esp,ecpA(Serine protease)

gene            complement(1547773..1549233)
                /gene="ebp"
                /locus_tag="SAKG22_13940"

CDS             complement(1547773..1549233)
                /gene="ebp"
        samtools faidx AP019545.1.fasta
        samtools faidx AP019545.1.fasta "gi|1602080981|dbj|AP019545.1|":1547773-1549233 > ebp.fasta
        revcomp ebp.fasta > ebp_revcomp.fasta

        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gyrB_revcomp.fasta gyrB.fasta

        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fumC_revcomp.fasta fumC.fasta
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gltA.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/icd_revcomp.fasta icd.fasta

        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/apsS.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sigB_revcomp.fasta sigB.fasta
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sarA_revcomp.fasta sarA.fasta
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/agrC.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/yycG.fasta .

        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/psm-beta1.fasta psm.β1.fasta
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/hlb.fasta .

        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/atlE.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrG.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrH_revcomp.fasta sdrH.fasta
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebh_revcomp.fasta ebh.fasta
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebp_revcomp.fasta ebp.fasta
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/tagB.fasta .

        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/capC.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sepA.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/dltA.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fmtC.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/lipA.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sceD.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/SE0760.fasta .

        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/esp.fasta .
        cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ecpA.fasta .

        #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880870.fasta .
        #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880871.fasta .
        #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880872.fasta .

>gi|1602080981|dbj|AP019545.1|:1547773-1549233
>gi|34980227|gb|AY322150.1|:3091-8793

./gyrB_revcomp.fasta

./agrC.fasta
./CP133692.fasta
./sceD_mibi1435_revcomp.fasta
./CP133682.fasta
./CP133697.fasta
./sdrG.fasta
./sarA_revcomp.fasta
./CP133679.fasta
./CP133686.fasta
./sepA.fasta
./icd.fasta
./CP133691.fasta
./CP133698.fasta
./yycG.fasta
./CP133701.fasta

./sdrH_revcomp.fasta
./SE0760.fasta
./CP133681.fasta
./CP133677.fasta
./yycG_revcomp.fasta
./CP000029.fasta
./atlE_old.fasta
./ebpS.fasta
./ebpS_revcomp.fasta
./atlE_revcomp.fasta
./sceDAE.fasta
./hlb.fasta
>gi|441494790|gb|KC242859.1|:16-1008
ATGGTGAAAAAAACAAAATCCAATTCACTAAAAAAAGTTGCAACACTTGCATTAGCAAAT
TTATTATTAGTTGGTGCACTTACTGACAATAGTGCCAAAGCCGAATCTAAGAAAGATGAT
ACTGATTTGAAGTTAGTTAGTCATAACGTTTATATGTTATCGACCGTTTTGTATCCAAAC
TGGGGGCAATATAAACGCGCTGATTTAATCGGGCAATCTTCTTATATTAAAAATAATGAT
GTCGTAATATTCAATGAAGCATTTGATAATGGCGCATCAGATAAGCTATTAAGTAATGTA
AAAAAAGAATATCCTTACCAAACACCTGTACTCGGTCGTTCTCAATCAGGGTGGGACAAA
ACTGAAGGTAGCTACTCATCAACTGTTGCTGAAGATGGTGGCGTAGCGATTGTAAGTAAA
TATCCTATTAAAGAAAAAATCCAGCATGTTTTCAAAAGCGGTTGTGGATTCGACAATGAT
AGCAACAAAGGCTTTGTTTATACAAAAATAGAGAAAAATGGGAAGAACGTTCATGTTATC
GGTACACATACACAATCTGAAGATTCACGTTGTGGTGCTGGACATGATCGAAAAATTAGA
GCTGAACAAATGAAAGAAATCAGTGATTTTGTTAAAAAGAAAAATATCCCTAAAGATGAA
ACGGTATATATAGGTGGCGACCTTAATGTAAATAAAGGTACTCCAGAGTTCAAAGATATG
CTTAAAAACTTGAATGTAAATGATGTTCTATATGCAGGTCATAATAGCACATGGGACCCT
CAATCAAATTCAATTGCGAAATATAATTATCCTAATGGTAAACCAGAACATTTAGACTAT
ATATTTACAGATAAAGATCATAAACAACCAAAACAATTAGTCAATGAAGTTGTGACTGAA
AAACCTAAGCCATGGGATGTATATGCGTTCCCATATTACTACGTTTACAATGATTTTTCA
GATCATTACCCAATCAAAGCCTATAGTAAATAA
>hlb_
>gi|441494790|gb|KC242859.1| Staphylococcus aureus strain 226 beta-hemolysin (hlb) gene, complete cds
AAAGGAGTGATAATG  ATGGTGAAAAAAACAAAATCCAATTCACTAAAAAAAGTTGCAACACTTGCATTAGCAAATTTATTATTAGTTGGTGCACTTACTGACAATAGTGCCAAAGCCGAATCTAAGAAAGATGATACTGA
TTTGAAGTTAGTTAGTCATAACGTTTATATGTTATCGACCGTTTTGTATCCAAACTGGGGGCAATATAAA
CGCGCTGATTTAATCGGGCAATCTTCTTATATTAAAAATAATGATGTCGTAATATTCAATGAAGCATTTG
ATAATGGCGCATCAGATAAGCTATTAAGTAATGTAAAAAAAGAATATCCTTACCAAACACCTGTACTCGG
TCGTTCTCAATCAGGGTGGGACAAAACTGAAGGTAGCTACTCATCAACTGTTGCTGAAGATGGTGGCGTA
GCGATTGTAAGTAAATATCCTATTAAAGAAAAAATCCAGCATGTTTTCAAAAGCGGTTGTGGATTCGACA
ATGATAGCAACAAAGGCTTTGTTTATACAAAAATAGAGAAAAATGGGAAGAACGTTCATGTTATCGGTAC
ACATACACAATCTGAAGATTCACGTTGTGGTGCTGGACATGATCGAAAAATTAGAGCTGAACAAATGAAA
GAAATCAGTGATTTTGTTAAAAAGAAAAATATCCCTAAAGATGAAACGGTATATATAGGTGGCGACCTTA
ATGTAAATAAAGGTACTCCAGAGTTCAAAGATATGCTTAAAAACTTGAATGTAAATGATGTTCTATATGC
AGGTCATAATAGCACATGGGACCCTCAATCAAATTCAATTGCGAAATATAATTATCCTAATGGTAAACCA
GAACATTTAGACTATATATTTACAGATAAAGATCATAAACAACCAAAACAATTAGTCAATGAAGTTGTGA
CTGAAAAACCTAAGCCATGGGATGTATATGCGTTCCCATATTACTACGTTTACAATGATTTTTCAGATCA
TTACCCAATCAAAGCCTATAGTAAATAA

./dltA.fasta
./atlE.fasta
./gyrB.fasta
./ORF123_sepA_ORF5.fasta
./ebh.fasta
./CP133696.fasta
./yycG_old.fasta
./CP133700.fasta
./CP133695.fasta
./ecpA.fasta
./capBCA_ywtC.fasta
./CP133690.fasta
./gltA.fasta
./esp.fasta
./sigB.fasta
./CP133699.fasta
./CP133685.fasta
./CP133680.fasta
./MT880872.fasta
./icd_revcomp.fasta
./agrC_old.fasta
./ecpA_.fasta
./agrC_revcomp.fasta
./MT880870.fasta
./CP133687.fasta
./CP133693.fasta
./psm-beta1.fasta
>gi|380448412|gb|JQ066320.1| Staphylococcus aureus strain JP1 Psm beta1 (psm beta1) and Psm beta2 (psm beta2) genes, complete cds
ACACGTTTAACAACACAAGAATTATTATATCTAAATGAACTAAAATTAGCAATACCTTGTAAATAAAAAA
TGTTTATATTTTTCACTATTATAGAGCTATTTATCTAAAAAGGTTCAATAAGACTTAAATACGAATTCAG
GCAACTTAATTGTGTTAAATACAGTTTTGAATGCCTAACTGTATTTCTTTTCTCTTTAAAATACAGTTAA
GTACATTATAAGATGTTGTGCGGATAAACAAACTAATTGCATCAAATTTATTTTAAAATAACAACAACAA
AACGTTAAGAGAATAACATTTCGGTGATTTAAAAGCTACGCACGTTTTTGTTATCTTCAAATTTAAATTT
TAAGGAGTGTTTTCA
ATGGAAGGTTTATTTAACGCAATTAAAGATACCGTAACTGCAGCAATTAATAATGATGGC
GCAAAATTAGGCACAAGCATTGTGAGCATCGTTGAAAATGGCGTAGGTTTATTAGGTAAA
TTATTCGGATTCTAA
TTTCAATATGTTATGTAAGTAATCAGTATTATTTCAAAGGTGAGGGAGAGATTTAAATGA
CTGGACTAGCAGAAGCAATCGCAAATACTGTGCAAGCTGCACAACAACATGATAGTGTGAAATTAGGCAC
AAGTATCGTAGACATCGTTGCTAACGGTGTGGGTTTACTAGGTAAATTATTTGGATTCTAATATAATAAC
TAATATTCTTTAAAATAAACTGGGTGAGCATACTTTAATGTTATGCACTCAGTTTATTTTATTTGCAGAA
ATTTGAGCCTCTGTTAAGATTTAGATACATAGACAATATAGGAGATGGGGAAATTGGGATATAAAAATAT
TTTGATAGACTTTGATGATACAATTGTTGATTTTTATGATGCAGA

ATGGAAGGTTTATTTAACGCAATTAAAGATACCGTAACTGCAGCAATTAATAATGATGGC
GCAAAATTAGGCACAAGCATTGTGAGCATCGTTGAAAATGGCGTAGGTTTATTAGGTAAA
TTATTCGGATTCTAA
./psm-beta.fasta
./lipA.fasta
./capC.fasta
./fumC_revcomp.fasta
./CP133689.fasta
./sepA_mibi1435.fasta
./sigB_revcomp.fasta
./sceD.fasta
./fumC.fasta
./Enterococcus_faecium_isolate_E300_pathogenicity_island.fasta
./agrABCD_hld.fasta
./CP133688.fasta
./ecpA_mibi1435_revcomp.fasta
./hlb_.fasta
./CP133684.fasta
./tagB_mibi1435_revcomp.fasta
./CP133678.fasta
./tagB.fasta
./MT880871.fasta
./CP133676.fasta
./sdrH.fasta
./Staphylococcus_epidermidis_RP62A.fasta
./ebh_revcomp.fasta
./CP133683.fasta
./apsS.fasta
./Staphylococcus_aureus_MRSA252.fasta
./fmtC.fasta
./CP133694.fasta
./sarA.fasta

prepare the long fasta if we use only the long sequencing

ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_01_K01_conservative.fasta HDRNA_01_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_03_K01_bold_bandage.fasta HDRNA_03_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_06_K01_conservative.fasta HDRNA_06_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_07_K01_conservative.fasta HDRNA_07_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_08_K01_conservative.fasta HDRNA_08_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_12_K01_bold.fasta HDRNA_12_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_16_K01_conservative.fasta HDRNA_16_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_17_K01_conservative.fasta HDRNA_17_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_19_K01_bold.fasta HDRNA_19_K01.fasta
ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_20_K01_conservative.fasta HDRNA_20_K01.fasta

perform blastn searching for long-sequencing

# -- makeblastdb --
for sample in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do
        makeblastdb -in ../assembly/${sample}.fasta -dbtype nucl
done

for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
echo "mkdir ${id}"
echo "for sample in in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do"
        echo "blastn -db ../assembly/${sample}.fasta -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
echo "done"
done

#gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA
python ~/Scripts/process_directory.py gyrB typing.csv typing_until_gyrB.csv
python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv

python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
python ~/Scripts/process_directory.py psm.β1 typing_until_yycG.csv typing_until_psm.β1.csv
python ~/Scripts/process_directory.py hlb typing_until_psm.β1.csv typing_until_hlb.csv
python ~/Scripts/process_directory.py atlE typing_until_hlb.csv typing_until_atlE.csv
python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv

python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
python ~/Scripts/process_directory.py ebp typing_until_ebh.csv typing_until_ebp.csv
python ~/Scripts/process_directory.py tagB typing_until_ebp.csv typing_until_tagB.csv
python ~/Scripts/process_directory.py capC typing_until_tagB.csv typing_until_capC.csv
python ~/Scripts/process_directory.py sepA typing_until_capC.csv typing_until_sepA.csv
python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv

python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv

perform blastn searching for short-sequencing

# -- makeblastdb --
for sample in HDRNA_01_K01 HDRNA_01_K02 HDRNA_01_K03 HDRNA_01_K04 HDRNA_01_K05 HDRNA_01_K06 HDRNA_01_K07 HDRNA_01_K08 HDRNA_01_K09 HDRNA_01_K10 HDRNA_02_K01 HDRNA_03_K01 HDRNA_03_K02 HDRNA_03_K03 HDRNA_03_K04 HDRNA_03_K05 HDRNA_03_K06 HDRNA_03_K07 HDRNA_03_K08 HDRNA_03_K09 HDRNA_03_K10 HDRNA_04_K01 HDRNA_05_K01 HDRNA_06_K01 HDRNA_06_K02 HDRNA_06_K03 HDRNA_06_K04 HDRNA_06_K05 HDRNA_06_K06 HDRNA_06_K07 HDRNA_06_K08 HDRNA_06_K09 HDRNA_06_K10 HDRNA_07_K01 HDRNA_07_K02 HDRNA_07_K03 HDRNA_07_K04 HDRNA_07_K05 HDRNA_07_K06 HDRNA_07_K07 HDRNA_07_K08 HDRNA_07_K09 HDRNA_07_K10 HDRNA_08_K01 HDRNA_08_K02 HDRNA_08_K03 HDRNA_08_K04 HDRNA_08_K05 HDRNA_08_K06 HDRNA_08_K07 HDRNA_08_K08 HDRNA_08_K09 HDRNA_08_K10 HDRNA_10_K01 HDRNA_11_K01 HDRNA_12_K01 HDRNA_12_K02 HDRNA_12_K03 HDRNA_12_K04 HDRNA_12_K05 HDRNA_12_K06 HDRNA_12_K07 HDRNA_12_K08 HDRNA_12_K09 HDRNA_12_K10 HDRNA_16_K01 HDRNA_16_K02 HDRNA_16_K03 HDRNA_16_K04 HDRNA_16_K05 HDRNA_16_K06 HDRNA_16_K07 HDRNA_16_K08 HDRNA_16_K09 HDRNA_16_K10 HDRNA_17_K01 HDRNA_17_K02 HDRNA_17_K03 HDRNA_17_K04 HDRNA_17_K05 HDRNA_17_K06 HDRNA_17_K07 HDRNA_17_K08 HDRNA_17_K09 HDRNA_17_K10 HDRNA_19_K01 HDRNA_19_K02 HDRNA_19_K03 HDRNA_19_K04 HDRNA_19_K05 HDRNA_19_K06 HDRNA_19_K07 HDRNA_19_K08 HDRNA_19_K09 HDRNA_19_K10 HDRNA_20_K01 HDRNA_20_K02 HDRNA_20_K03 HDRNA_20_K04 HDRNA_20_K05 HDRNA_20_K06 HDRNA_20_K07 HDRNA_20_K08 HDRNA_20_K09; do
        makeblastdb -in ../Data_Holger_S.epidermidis_short/shovill/${sample}/contigs.fa -dbtype nucl
done

#Note: The current -evalue is set to 1e-1; it might need to be made more stringent.
for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
echo "mkdir ${id}"
echo "for sample in HDRNA_01_K01 HDRNA_01_K02 HDRNA_01_K03 HDRNA_01_K04 HDRNA_01_K05 HDRNA_01_K06 HDRNA_01_K07 HDRNA_01_K08 HDRNA_01_K09 HDRNA_01_K10 HDRNA_02_K01 HDRNA_03_K01 HDRNA_03_K02 HDRNA_03_K03 HDRNA_03_K04 HDRNA_03_K05 HDRNA_03_K06 HDRNA_03_K07 HDRNA_03_K08 HDRNA_03_K09 HDRNA_03_K10 HDRNA_04_K01 HDRNA_05_K01 HDRNA_06_K01 HDRNA_06_K02 HDRNA_06_K03 HDRNA_06_K04 HDRNA_06_K05 HDRNA_06_K06 HDRNA_06_K07 HDRNA_06_K08 HDRNA_06_K09 HDRNA_06_K10 HDRNA_07_K01 HDRNA_07_K02 HDRNA_07_K03 HDRNA_07_K04 HDRNA_07_K05 HDRNA_07_K06 HDRNA_07_K07 HDRNA_07_K08 HDRNA_07_K09 HDRNA_07_K10 HDRNA_08_K01 HDRNA_08_K02 HDRNA_08_K03 HDRNA_08_K04 HDRNA_08_K05 HDRNA_08_K06 HDRNA_08_K07 HDRNA_08_K08 HDRNA_08_K09 HDRNA_08_K10 HDRNA_10_K01 HDRNA_11_K01 HDRNA_12_K01 HDRNA_12_K02 HDRNA_12_K03 HDRNA_12_K04 HDRNA_12_K05 HDRNA_12_K06 HDRNA_12_K07 HDRNA_12_K08 HDRNA_12_K09 HDRNA_12_K10 HDRNA_16_K01 HDRNA_16_K02 HDRNA_16_K03 HDRNA_16_K04 HDRNA_16_K05 HDRNA_16_K06 HDRNA_16_K07 HDRNA_16_K08 HDRNA_16_K09 HDRNA_16_K10 HDRNA_17_K01 HDRNA_17_K02 HDRNA_17_K03 HDRNA_17_K04 HDRNA_17_K05 HDRNA_17_K06 HDRNA_17_K07 HDRNA_17_K08 HDRNA_17_K09 HDRNA_17_K10 HDRNA_19_K01 HDRNA_19_K02 HDRNA_19_K03 HDRNA_19_K04 HDRNA_19_K05 HDRNA_19_K06 HDRNA_19_K07 HDRNA_19_K08 HDRNA_19_K09 HDRNA_19_K10 HDRNA_20_K01 HDRNA_20_K02 HDRNA_20_K03 HDRNA_20_K04 HDRNA_20_K05 HDRNA_20_K06 HDRNA_20_K07 HDRNA_20_K08 HDRNA_20_K09; do"
        echo "blastn -db ../Data_Holger_S.epidermidis_short/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
echo "done"
done

#gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA
python ~/Scripts/process_directory.py gyrB typing_104.csv typing_until_gyrB.csv
python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv

python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
python ~/Scripts/process_directory.py psm.β1 typing_until_yycG.csv typing_until_psm.β1.csv
python ~/Scripts/process_directory.py hlb typing_until_psm.β1.csv typing_until_hlb.csv
python ~/Scripts/process_directory.py atlE typing_until_hlb.csv typing_until_atlE.csv
python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv

python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
python ~/Scripts/process_directory.py ebp typing_until_ebh.csv typing_until_ebp.csv
python ~/Scripts/process_directory.py tagB typing_until_ebp.csv typing_until_tagB.csv
python ~/Scripts/process_directory.py capC typing_until_tagB.csv typing_until_capC.csv
python ~/Scripts/process_directory.py sepA typing_until_capC.csv typing_until_sepA.csv
python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv

python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv

plotTreeHeatmap

~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/plotTreeHeatmap

#FastTree -gtr -nt variants/snippy.core_without_reference.aln > plotTreeHeatmap/snippy.core.tree

#cp ../Data_Holger_S.epidermidis_short/plotTreeHeatmap/typing_104.csv .
#cp ../Data_Holger_S.epidermidis_short/results_HDRNA_01-20/variants/snippy.core.aln_104.tree .
#Run step 4

library(ggtree)
library(ggplot2)
library(dplyr)
setwd("/home/jhuang/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/plotTreeHeatmap_short/")
# -- edit tree --
info <- read.csv("typing_until_ecpA.csv", sep="\t")

# Convert 'ST' column to factor with levels in the desired order
info$ST <- factor(info$ST, levels = c("5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none"))

info$name <- info$Isolate
tree <- read.tree("snippy.core.aln_104.tree")
cols <- c("5"="darkgreen", "23"="cornflowerblue", "35"="green","69"="orange","86"="magenta","87"="brown", "130"="purple","224"="darkkhaki", "487"="cyan", "640"="pink","none"="grey")
#"2"="",
#"7"="seagreen3",
#"9"="tan","14"="red",  "17"="navyblue",
#"73"="pink","81"="purple",
#"89"="darksalmon",
#"454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet"
#"5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none"

heatmapData2 <- info %>% select(Isolate,          gyrB,    fumC, gltA, icd,    apsS, sigB, sarA, agrC, yycG,    psm.β1, hlb,    atlE, sdrG, sdrH, ebh, ebp, tagB,    capC, sepA, dltA, fmtC, lipA, sceD, SE0760,    esp, ecpA)  #ST,
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")

#TEMP_DEACTIVATED!  heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
#TEMP_DEACTIVATED!  names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")

## The heatmap colours should correspond to the factor levels
#heatmap.colours <- #setNames(c("cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta","cyan",
#"magenta","navyblue","cornflowerblue","gold","orange","darkgreen", "green", "seagreen3", #"chocolate4","brown","purple","pink", "tan","cyan","red"),
#c("5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none", "P01", "P02", "P03", "P04", "P05", "P06", "P07", #"P08", "P10", "P11", "P12", "P16", "P17", "P19", "P20"))

#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()
#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
png("ggtree_and_gheatmap_selected_genes.png", width=1690, height=1400)
#svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 17) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

Response

As mentioned earlier, the presence-absence matrix of genes was generated using a BLASTn comparison between the genome and the gene sequences. Ensuring the accuracy of these gene sequences is crucial for reliable results. I prepared the sequences by searching GeneBank using the gene names, but there may be some ambiguity where the same gene name corresponds to different sequences.

Attached, you will find the sequences for the 11 genes that were not identified in the isolates: gltA, psm-beta1, hlb, ebp, tagB, capA, sepA, fmtC, sceD, esp, and ecpA. I think it would be helpful if you could verify whether these sequences correspond to the genes we are looking for.

In addition, I have attached a figure showing all samples with short-read sequencing. As illustrated in the plot, SCCmec and agr typing are still pending. If necessary, I can perform the typing and include the information in the plot. Alternatively, we could focus exclusively on the 10 samples obtained through long-read sequencing.

Please let me know how you would prefer to proceed.

RNA-seq Tam on Acinetobacter baumannii strain ATCC 19606 CP059040.1 (Data_Tam_RNAseq_2024)

Leave a reply

http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

Methods

Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).

The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:

nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results –fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta –gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume –max_cpus 55 –max_memory 512.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –gtf_group_features gene_id –gtf_extra_attributes gene_name –featurecounts_group_type gene_biotype –featurecounts_feature_type transcript

Preparing raw data

They are wildtype strains grown in different medium.
AUM - artificial urine medium
Urine - human urine
MHB - Mueller-Hinton broth
AUM（人工尿液培养基）：pH值、营养成分、无菌性、渗透压、温度、污染物。
Urine（人类尿液）：pH值、比重、温度、污染物、化学成分、微生物负荷。
MHB（Mueller-Hinton培养基）：pH值、无菌性、营养成分、温度、渗透压、抗生素浓度。

mkdir raw_data; cd raw_data
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz

ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz

ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz

(Optional) using trinity to find the most closely reference

Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12

#https://www.genome.jp/kegg/tables/br08606.html#prok
acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
aby     KGB     Acinetobacter baumannii AYE     2008    GenBank
abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
#Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CP059040.1) was choosen as reference!

Downloading CP059040.fasta and CP059040.gff from GenBank

(Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed

#Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta .     # Elements (Anna C.arnes)
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 .
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf .
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed .
rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
(base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*"
./CP059040.fasta
./CP059040.bed
./CP059040.gb
./CP059040.gff3
./CP059040.gff3_backup
./CP059040_full.gb
./CP059040_gene.gff3
./CP059040_gene.gtf
./CP059040_gene_old.gff3
./CP059040_rRNA.gff3
./CP059040_rRNA_v.gff3

# ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ----
#gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_
#grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
#sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
#gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed

grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript
sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
#grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l  #69=3887-3803

cp CP059040.gff3 CP059040_backup.gff3
sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3

#3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42
# --------------------------------------------------------------------------------------------------------------------------------------------------
# ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------
cp CP059040.gff3 CP059040_backup.gff3
sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3    #-->3796

#The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing.
# from gff3 to gtf
sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf
sed -i -e "s/;/\"; /g" CP059040_gene.gtf
sed -i -e "s/=/=\"/g" CP059040_gene.gtf

#sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf
#using editor instead!

#The following is GTF-format.
CP000521.1      Genbank exon    95      1492    .       +       .       transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001";

#NZ_MJHA01000001.1       RefSeq  region  1       8663    .       +       .       ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii
#NZ_MJHA01000001.1       RefSeq  gene    228     746     .       -       .       ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795
#NZ_MJHA01000001.1       Protein Homology        CDS     228     746     .       -       0       ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11

##gff-version 3
##sequence-region CP059040.1 1 3980852
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470

gffread -E -F --bed CP059040.gff3 -o CP059040.bed    #-->3796
##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3
##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown
#[01/21 10:57:46] Loading reference annotation (guides)..
#GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810)
#[01/21 10:57:46] 3796 reference transcripts loaded.
#Default stack size for threads: 8388608
#WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?)
#WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
#Please make sure the -G annotation file uses the same naming convention for the genome sequences.
#[01/21 10:58:30] All threads finished.

#  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
#  The specified gene identifier attribute is 'Name'
#  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
#  The program has to termin

#  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
#  The specified gene identifier attribute is 'gene_biotype'
#  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
#  The program has to terminate.

#grep "ID=cds-" CP059040.gff3 | wc -l
#grep "ID=exon-" CP059040.gff3 | wc -l
#grep "ID=gene-" CP059040.gff3 | wc -l   #the same as H0N29_18980/5=3796
grep "gbkey=" CP059040.gff3 | wc -l  7695
grep "ID=id-" CP059040.gff3 | wc -l  5
grep "locus_tag=" CP059040.gff3 | wc -l    7689
#...
cds   3701                             locus_tag=xxxx, no gene_biotype
exon   96                              locus_tag=xxxx, no gene_biotype
gene   3796                            locus_tag=xxxx, gene_biotype=xxxx,
id  (riboswitch+direct_repeat,5)       both no --> ignoring them!!  # grep "ID=id-" CP059040.gff3
rna    96                              locus_tag=xxxx, no gene_biotype
------------------
    7694

cp CP059040.gff3_backup CP059040.gff3
grep "^##" CP059040.gff3 > CP059040_gene.gff3
grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3
#!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'!
sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \
for sample_id in MHB_r1 MHB_r2 MHB_r3; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto
AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto
AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto
MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto
MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto
MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto
Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto
Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto

nextflow run

#Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

#Default: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
#(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_1 (CDS --> exon in CP059040.gff) --
#Checking the record (see below) in results/genome/CP059040.gtf
#In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
#--featurecounts_feature_type 'transcript' returns only the tRNA results
#Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon

grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!

grep -P "\tCDS\t" CP059040.gff | wc -l  #3701
sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
grep -P "\texon\t" CP059040_m.gff | sort | wc -l  #3797

# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)

setwd("~/DATA/Data_Tam_RNAseq_2024/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c("AUM_r1" = "./AUM_r1/quant.sf",
        "AUM_r2" = "./AUM_r2/quant.sf",
        "AUM_r3" = "./AUM_r3/quant.sf",
        "Urine_r1" = "./Urine_r1/quant.sf",
        "Urine_r2" = "./Urine_r2/quant.sf",
        "Urine_r3" = "./Urine_r3/quant.sf",
        "MHB_r1" = "./MHB_r1/quant.sf",
        "MHB_r2" = "./MHB_r2/quant.sf",
        "MHB_r3" = "./MHB_r3/quant.sf")
# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
condition <- factor(c("AUM","AUM","AUM", "Urine","Urine","Urine", "MHB","MHB","MHB"))
# Define the colData for DESeq2
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
# -- transcript-level count data (x2) --
# Create DESeqDataSet object
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
write.csv(counts(dds), file="transcript_counts.csv")
# -- gene-level count data (x2) --
# Read in the tx2gene map from salmon_tx2gene.tsv
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
# Set the column names
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
# Remove the gene_name column if not needed
tx2gene <- tx2gene[,1:2]
# Import and summarize the Salmon data with tximport
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
# Continue with the DESeq2 workflow as before...
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition+replicate)
#dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #3796->3487
write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)

#We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function.
#The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples.
#The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups.
#So, the typical workflow is:
#  - Create the DESeqDataSet object.
#  - Use estimateSizeFactors to normalize for library size.
#  - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.)
#  - Use DESeq for the differential expression analysis.
#  - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq.

# draw simple pca and heatmap
#mat <- assay(rld)
#mm <- model.matrix(~condition, colData(rld))
#mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
#assay(rld) <- mat
# -- pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

(Optional (ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
#Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
#So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).

sizeFactors(dds)
#NULL
# Estimate size factors
dds <- estimateSizeFactors(dds)
# Estimate dispersions
dds <- estimateDispersions(dds)
#> sizeFactors(dds)

#control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
#2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
#HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
#0.4604461  0.5799834  0.6803681

# (DEBUG) If avgTxLength is Necessary
#To simplify the computation and ensure sizeFactors are calculated:
assays(dds)$avgTxLength <- NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
#If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
sizeFactors(dds)

# If alone with virus data, the following BUG occured:
#Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                    HeLa_TO_r1                      HeLa_TO_r2
                    0.9978755                       1.1092227
data.frame(genes = rownames(dds), dispersions = dispersions(dds))

#Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
1/0.9978755=1.002129023
1/1.1092227=
#bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220

raw_counts <- counts(dds)
normalized_counts <- counts(dds, normalized=TRUE)
#write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
#write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))
    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    # Return the scaling factor
    return(med)
}
#https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
#http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
#https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
#https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
#https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
#DESeq2’s median of ratios [1]
#EdgeR’s trimmed mean of M values (TMM) [2]
#http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
sum(test_normcount != normalized_counts)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
#> dds$condition
#[1] AUM   AUM   AUM   Urine Urine Urine MHB   MHB   MHB
#Levels: AUM MHB Urine
#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")
#---- relevel to control ----
dds$condition <- relevel(dds$condition, "MHB")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("AUM_vs_MHB","Urine_vs_MHB")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.05 & log2FoldChange>=1.35)
  down <- subset(res_df, padj<=0.05 & log2FoldChange<=-1.35)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env --
grep -P "\tgene\t" CP059040.gff > CP059040_gene.gff
python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-all.txt AUM_vs_MHB-all.csv
python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-up.txt AUM_vs_MHB-up.csv
python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-down.txt AUM_vs_MHB-down.csv
python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-all.txt Urine_vs_MHB-all.csv
python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-up.txt Urine_vs_MHB-up.csv
python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-down.txt Urine_vs_MHB-down.csv

#for i in AUM_vs_MHB Urine_vs_MHB; do
#  echo "contrast = paste(\"condition\", \"${i}\", sep=\"_\")"
#  echo "res = results(dds, name=contrast)"
#  echo "res <- res[!is.na(res$log2FoldChange),]"
#  echo "res_df <- as.data.frame(res)"
#  #selectLab = selectLab_italics,
#  echo "png(\"${i}.png\",width=1200, height=1000)"
#  #legendPosition = 'right',legendLabSize = 12,  arrowheads = FALSE,
#  #subtitle=expression(~Delta*\"$(echo $i | cut -d'_' -f1) versus \" *~Delta*\"$(echo $i | cut -d'_' -f3)\"))"
#  echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.2, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(\"$(echo $i | cut -d'_' -f1) versus $(echo $i | cut -d'_' -f3)\"))"
#  echo "dev.off()"
#done

res <- read.csv("AUM_vs_MHB-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_AUM_vs_MHB.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("AUM_vs_MHB.png", width=1200, height=2000)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 1e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("AUM versus MHB"))
dev.off()

res <- read.csv("Urine_vs_MHB-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Urine_vs_MHB.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
res$padj[res$padj == 0] <- 1e-305

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Urine_vs_MHB.png", width=1200, height=2000)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 1e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Urine versus MHB"))
dev.off()

Report

Attached are the results of the analysis.

In the Urine_vs_MHB comparison, we identified a total of 259 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 138 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). Notably, the following genes have a p-adjusted value of 0, indicating very high confidence in their differential expression. The bas-series genes (basA, basB, basC, basD, basE, basJ) are particularly prominent:

GeneName    GeneID  baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
basJ    gene-H0N29_05120    11166.90    11.42   0.30    38.27   0   0
basE    gene-H0N29_05085    12006.52    10.45   0.23    45.76   0   0
basD    gene-H0N29_05080    12217.80    10.15   0.24    42.42   0   0
bauA    gene-H0N29_05070    25280.68    9.55    0.19    51.48   0   0
basA    gene-H0N29_05040    9750.68 9.02    0.18    48.90   0   0
basC    gene-H0N29_05075    5034.14 8.58    0.21    40.07   0   0
H0N29_08320 gene-H0N29_08320    4935.78 7.87    0.20    40.01   0   0
barB    gene-H0N29_05105    5187.29 7.81    0.18    43.39   0   0
H0N29_09380 gene-H0N29_09380    3477.26 7.41    0.19    38.91   0   0
H0N29_13950 gene-H0N29_13950    13959.05    6.85    0.15    45.70   0   0
H0N29_10825 gene-H0N29_10825    3664.70 6.44    0.17    37.59   0   0
H0N29_10790 gene-H0N29_10790    2574.12 6.41    0.17    37.86   0   0
H0N29_10010 gene-H0N29_10010    9376.84 -8.14   0.19    -43.70  0   0

In the AUM_vs_MHB comparison, we identified a total of 149 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 65 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). The following genes also show a p-adjusted value of 0, indicating very high confidence in their differential expression:

GeneName    GeneID  baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
putA    gene-H0N29_09870    36100.24    -7.25   0.15    -49.78  0   0
H0N29_10010 gene-H0N29_10010    9376.84 -7.43   0.18    -41.96  0   0

To ensure proper visualization, I replaced the padj = 0 values with small numbers: 1e-305 for Urine_vs_MHB and 1e-150 for AUM_vs_MHB.

We have now identified the significantly expressed genes. If you would like any further analysis based on these genes or need additional plots, please let me know.

(TODO) clustering the genes and draw heatmap

for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do echo "cut -d',' -f1-1 ${i}-up_annotated.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down_annotated.txt > ${i}-down.id"; done
cat *.id | sort -u > ids
#add Gene_Id in the first line, delete the ""
GOI <- read.csv("ids")$Gene_Id  #4647
RNASeq.NoCellLine <- assay(rld)
#install.packages("gplots")
library("gplots")
#clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
datamat = RNASeq.NoCellLine[GOI, ]
#datamat = RNASeq.NoCellLine
write.csv(as.data.frame(datamat), file ="gene_expressions.txt")
constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
if(any(constant_rows)) {
  cat("Removing", sum(constant_rows), "constant rows.\n")
  datamat <- datamat[!constant_rows, ]
}
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.05)
mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
mycol = mycol[as.vector(mycl)]
#png("DEGs_heatmap.png", width=900, height=800)
#cex.lab=10, labRow="",
png("DEGs_heatmap.png", width=800, height=1000)
heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',labRow="",
            scale='row',trace='none',col=bluered(75), cexCol=1.8,
            RowSideColors = mycol, margins=c(10,2), cexRow=1.5, srtCol=30, lhei = c(1, 8), lwid=c(2, 8))  #rownames(datamat)
#heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,5), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=30, lhei=c(0.1,2))
dev.off()
#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='cluster4.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
~/Tools/csv2xls-0.4/csv_to_xls.py \
significant_gene_expressions.txt \
-d',' -o DEGs_heatmap_expression_data.xls;
#### cluster members (advanced) #####
subset_1<-names(subset(mycl, mycl == '1'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
subset_2<-names(subset(mycl, mycl == '2'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
subset_3<-names(subset(mycl, mycl == '3'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
subset_4<-names(subset(mycl, mycl == '4'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
subset_5<-names(subset(mycl, mycl == '5'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
# Initialize an empty data frame for the annotated data
annotated_data <- data.frame()
# Determine total number of genes
total_genes <- length(rownames(data))
# Loop through each gene to annotate
for (i in 1:total_genes) {
    gene <- rownames(data)[i]
    result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                    filters = 'ensembl_gene_id',
                    values = gene,
                    mart = ensembl)
    # If multiple rows are returned, take the first one
    if (nrow(result) > 1) {
        result <- result[1, ]
    }
    # Check if the result is empty
    if (nrow(result) == 0) {
        result <- data.frame(ensembl_gene_id = gene,
                            external_gene_name = NA,
                            gene_biotype = NA,
                            entrezgene_id = NA,
                            chromosome_name = NA,
                            start_position = NA,
                            end_position = NA,
                            strand = NA,
                            description = NA)
    }
    # Transpose expression values
    expression_values <- t(data.frame(t(data[gene, ])))
    colnames(expression_values) <- colnames(data)
    # Combine gene information and expression data
    combined_result <- cbind(result, expression_values)
    # Append to the final dataframe
    annotated_data <- rbind(annotated_data, combined_result)
    # Print progress every 100 genes
    if (i %% 100 == 0) {
        cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    }
}
# Save the annotated data to a new CSV file
#write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Kostenart von Nebenkostenabrechnung

Leave a reply

Betriebskosten
- Abwasser / Entwässerung
- Allgemeinstrom
- Antenne
- Breitbandkabel
- Dachrinnenreinigung
- Deichabgabe
- Gartenpflege
- Gebäudereinigung
- Gebäudeversicherung
- Haftpflichtversicherung
- Hausbeleuchtung
- Hauswart
- Heizkosten
- Heizungsanlagen
- Kaltwasser
- Müllabfuhr
- Niederschlagswasser
- Recycling
- Schornsteinreinigung
- Straßenreinigung
- Ungezieferbekämpfung
- Warmwasser
- Warmwasserversorgungsanlagen
- Wartung Aufzug
- Wartung Brandschutz /Rauchwarnmelder
- Wartung Lüftungsanlagen
- Wartung Tiefgarage
- Wäscheeinrichtungen
- Winterdiest
- Sonstige Kostenart hinzufügen
- Kosten für Dichtigkeitsprüfung nicht umlegbar In einigen Bundesländern ist vorgeschrieben, dass Abwasserleitungen auf Dichtigkeit zu überprüfen sind. Die hierdurch entstehenden Kosten sind ebenfalls nicht umlagefähig.
- Schornsteinfegerkosten: Gibt es keine Zentralheizung und keine zentrale Warmwasserversorgung, dann sind die Schornsteinfegerkosten über die “kalten Betriebskosten” abzurechnen. Sie können dann nach Zahl der Wohnungen oder nach anteiliger Wohnfläche umgelegt werden.
- Müllabfuhr: Letztlich kommt es zunächst auf den Mietvertrag an, was Mieter und Vermieter vereinbart haben. Finden Sie nichts dazu im Mietvertrag, wird sich Ihr Vermieter an der gesetzlichen Regelung orientieren. Die Umlage erfolgt dann anhand der Wohnfläche Ihrer Wohnung, sofern die Müllmenge nicht konkret erfasst wird.
- Gebäudeversicherung: Die Kosten für die Gebäudeversicherung dürfen Immobilieneigentümer auf die Mieter umlegen. Bei Mehrfamilienhäusern erfolgt die Umlage meist je nach Anteil der einzelnen Wohnungen an der gesamten Wohnfläche. Vermieter müssen beim Abschluss einer Gebäudeversicherung darauf achten, dass die Prämie nicht überhöht ist.
- Grundsteuer: Zahlung der Grundsteuer als Betriebskosten für die Wohnung Üblich ist, dass die Grundsteuer nach der Wohnfläche der einzelnen Wohnungen (wenn im Mietvertrag keine andere Verteilung vereinbart ist) verteilt wird. Der Vermieter nutzt hierfür den Grundsteuerbescheid.
Direkte Kosten
- Grundsteuer
- Sonstige direkte Kosten

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Author Archives: gene_x

TODO: mapping virus reads should use TopHat2, BWA or Bowtie2 could not detect splice events.

Deciphering S. aureus with metatranscriptomics (Marc’s project)

Time synchronization failed

Die 10 besten Skigebiete Deutschlands: Skifahren mit der Familie

Identify all occurrences of Phage HH1 MT880870 in S. epidermidis ST2 genomes from public and clinical isolates

Identify all occurrences of Phages MT880870, MT880871 and MT880872 in S. epidermidis ST2 genomes from public and clinical isolates

Enhanced Visualization of Gene Presence in Luise_Sepi_STKN

Enhanced Visualization of Gene Presence for the Selected Genes in Bongarts_S.epidermidis_HDRNA

RNA-seq Tam on Acinetobacter baumannii strain ATCC 19606 CP059040.1 (Data_Tam_RNAseq_2024)

Kostenart von Nebenkostenabrechnung