Author Archives: gene_x

Deciphering S. aureus with metatranscriptomics (Marc’s project)

RNAseq analaysis using nextflow set reference as CP000255

  1. prepare reference

    mkdir raw_data; cd raw_data
    ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_1.fq.gz P50_LH1and2_R1.fq.gz
    ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_2.fq.gz P50_LH1and2_R2.fq.gz
  2. Downloading CP000255.fasta and CP000255.gff from GenBank

    mv ~/Downloads/sequence\(3\).fasta CP000255.fasta
    mv ~/Downloads/sequence\(2\).gff3 CP000255.gff
    
    MRSA: USA300
    EMRSA (CC22): CC22 strain
    MSSA: Newmann strain
    
    #https://journals.asm.org/doi/10.1128/jb.01000-07
    #Staphylococcus aureus subsp. aureus strain Newman (Taxonomy ID: 426430)
    #https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=426430
    
    #Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
    https://www.ncbi.nlm.nih.gov/nuccore/CP000255.1
    
    #Methicillin-resistant Staphylococcus aureus CC22-MRSA-IV as an agent of dairy cow intramammary infections
    https://pubmed.ncbi.nlm.nih.gov/30473348/
    
    sed 's/\tCDS\t/\texon\t/g' CP000255.gff > CP000255_m.gff
    grep -P "\texon\t" CP000255_m.gff | sort | wc -l  #2630
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead! (but there is no 'transcript' in the column 3, strange????) --
    --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff" --featurecounts_feature_type 'transcript'
  3. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in P50_LH1and2; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
    done
  4. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    P50_LH1and2,P50_LH1and2_R1.fq.gz,P50_LH1and2_R2.fq.gz,auto
  5. nextflow run

    mv trimmed/*.fq.gz .
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
    docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255.fasta" --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
    --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- in results/genome/CP059040_genes.gtf -->
    #P059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";
    #CP059040.1      Genbank exon    1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QNT85048.1"; Name "QNT85048.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_004996698.1"; locus_tag "H0N29_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QNT85048.1"; transl_table "11";
    #CP059040.1      Genbank transcript      1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Name "H0N29_00010"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "H0N29_00010";
    #CP059040.1      Genbank exon    1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Dbxref "NCBI_GP:QNT85049.1"; Name "QNT85049.1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010116376.1"; locus_tag "H0N29_00010"; product "DNA polymerase III subunit beta"; protein_id "QNT85049.1"; transl_table "11";
    
    # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
    >CP000255.1 Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome

Calculate Host and Bacterial Compositions using vrap

  1. download all S epidermidis genomes and identified all ST2 isolates from them!

    Staphylococcus aureus  ID: 1280
    Staphylococcus epidermidis Taxonomy ID: 1282
    
    esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta  # 23788
    
    esearch -db nucleotide -query "txid1280[Organism:exp]" | \
    efetch -format fasta -email j.huang@uke.de > genome_1280_ncbi.fasta
    esearch -db nucleotide -query "txid1282[Organism:exp]" | \
    efetch -format fasta -email j.huang@uke.de > genome_1282_ncbi.fasta
    cd /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024
    python ~/Scripts/filter_fasta.py genome_1280_ncbi.fasta complete_genome_1280_ncbi.fasta  #3478, 9.2G
    python ~/Scripts/filter_fasta.py genome_1282_ncbi.fasta complete_genome_1282_ncbi.fasta  #447, 1.1G
    
    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/1280
    #Click "Sequence" and download "Counts" and "Taxon descendants count" if there is enough time! Downloading time points is 10.01.2025.
    https://www.ebi.ac.uk/ena/browser/view/1282
    
    #python ~/Scripts/filter_fasta.py  ena_1282_sequence_taxon_descendants_count.fasta complete_genome_1282_ena_taxon_descendants.fasta
    #S.aureus, Staphylococcus epidermidis
    python ~/Scripts/filter_fasta.py ena_1280_sequence.fasta complete_genome_1280_ena.fasta  #2187, 5.8G
    python ~/Scripts/filter_fasta.py ena_1282_sequence.fasta complete_genome_1282_ena.fasta  #243, 596M
    
    replace virus to S.aureus
  2. run vrap

    ln -s ~/Tools/vrap/ .
    conda activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    
    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_1280 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/genomes_by_taxid/complete_genome_1280_ncbi.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    
    #TODO: change virus_user_db --> S.aureus_user_db
    #ref|NZ_CP075562.1| --> Staphylococcus aureus strain Sta2021010 chromosome, complete genome
    
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    
    @PG     ID:bowtie2      PN:bowtie2      VN:2.4.4        CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x /home/jhuang/REFs/genome -q --fast -p 100 --passthrough -1 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_1.fastq -2 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_2.fastq -U /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.extendedFrags.fastq"
    24061654 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    22492206 + 0 mapped (93.48% : N/A)
    12342028 + 0 paired in sequencing
    6171014 + 0 read1
    6171014 + 0 read2
    10326710 + 0 properly paired (83.67% : N/A)
    11134274 + 0 with itself and mate mapped
    56254 + 0 singletons (0.46% : N/A)
    421598 + 0 with mate mapped to a different chr
    15942 + 0 with mate mapped to a different chr (mapQ>=5)
  3. Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_on_CP015646 --host CP015646.fasta -t 100 -l 200 -g
    cd bowtie
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    #24061651 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    541762 + 0 mapped (2.25% : N/A)
    12342022 + 0 paired in sequencing
    6171011 + 0 read1
    6171011 + 0 read2
    382256 + 0 properly paired (3.10% : N/A)
    388562 + 0 with itself and mate mapped
    22349 + 0 singletons (0.18% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    bamCoverage -b mapped_sorted.bam -o ../../P50_LH1and2_on_CP015646_reads_coverage.bw
    
    samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {if (count > 0) print sum/count; else print 0}'
    #21.4043
    
    samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {print sum/count}'
    samtools depth -a mapped_sorted.bam | awk -v len=2742807 '{sum+=$3} END {print sum/len}'
    #21.4043
    samtools depth mapped_sorted.bam | awk '$3 > 0 {count++} END {print count}'
    samtools depth mapped_sorted.bam | awk '$3 > 10 {count++} END {print count}'
    
    python ~/Scripts/calculate_coverage.py mapped_sorted.bam
    #Average coverage: 24.41
  4. Report

    I hope this email finds you well. Below is a summary of the analysis results based on the sequencing data:
    
        Host and Bacterial Composition:
            * The human genome accounted for 22,492,206 reads out of the total 24,061,654 reads (93.48%).
    
        Genome Assembly and Annotation:
            * After removing the human reads, we assembled the bacterial genome.
            * For annotation, we used the S. aureus user database (S.aureus_user_db), which includes 3,478 complete S. aureus genomes available in GenBank as of January 10, 2025.
            * The annotation summary is provided in the attached file assembly_summary.xlsx. The sequence CP015646.1 appears to be the closest genome for the data, as indicated in the attached table (highlighted in yellow).
    
        Mapping and Coverage:
            * Using CP015646.1 as the reference genome, 541,762 reads (2.25%) were mapped to this reference.
            * The average coverage is approximately 21x, with 951,154 positions out of 2,742,807 in the reference genome covered by at least 10 reads.
            * The attached figure xxx.png shows the coverage of reads mapped to CP015646.1.
    
    Conclusion:
    The analysis suggests that the majority of reads originate from the human genome, and the S. aureus reads are insufficient for a robust RNA-seq analysis.

Calculate the percentages of reads are human rRNA and human mRNA, see the attached review “Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics”.

  1. run nextflow set human as reference

    (OPTION_1, NOT_USED) Use Annotation Tools: Annotation pipelines like FeatureCounts or HTSeq can classify reads based on genomic annotations (e.g., GTF/GFF files). Use a comprehensive human genome annotation to distinguish between mRNA and rRNA regions.
    
            diff /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /home/jhuang/REFs/genome.fa
    
            featureCounts -a /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf -o counts.txt -t exon -g gene_type aligned.bam
    
    Here, gene_type in the GTF file may distinguish mRNA and rRNA features.
    
    /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf
    
    (OPTION_2, USED, under host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa" --gtf "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"   --star_index "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"    -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'
    #--pseudo_aligner 'salmon'
    
    'GRCh38' {
        fasta       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
        bwa         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BWAIndex/version0.6.0/"
        bowtie2     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/"
        star        = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"
        bismark     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BismarkIndex/"
        gtf         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"
        bed12       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.bed"
        mito_name   = "chrM"
        macs_gsize  = "2.7e9"
        blacklist   = "/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/blacklists/hg38-blacklist.bed"
  2. Report

    Indeed, until now I haven’t had direct experience with analyzing RNA-seq data from metatranscriptomic samples, as my expertise has primarily been with data derived from cultured samples. Ideally, sequencing from cultured samples would be the optimal approach.
    
    However, if RNA-seq from metatranscriptomic samples becomes necessary, it would be crucial to deplete host (human) mRNA, host (human) rRNA, and microbial rRNA before sequencing (refer to the attached review PIIS0168952523001270.pdf, "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics"). This step is important for enriching bacterial mRNA and could potentially enhance the quality of sequencing data.
    
    This morning, I explored the origins of reads within the human genome. Despite over 20 million reads originating from humans, only a small fraction successfully mapped to unique positions (83,500 reads). The remaining reads failed due to reasons such as Unassigned_MultiMapping (the most common issue) and Unassigned_NoFeatures. Among the 83,500 mapped reads, only 0.01% originated from rRNA. For more detailed biotype analysis, please refer to featurecounts_biotype_plot-1.png.
    
    Based on these findings, depleting human mRNA appears to be the most critical step for improvement.

Time synchronization failed

https://askubuntu.com/questions/1046214/enable-system-clock-synchronization

Steps to Enable NTP Support

  1. Check if systemd-timesyncd is Running

Ensure that the default time synchronization service is installed and active. Run:

sudo systemctl status systemd-timesyncd

If it’s inactive or not installed, enable and start it:

sudo systemctl enable systemd-timesyncd
sudo systemctl start systemd-timesyncd
  1. Install and Configure ntp (if needed)

If systemd-timesyncd is not available or you prefer using ntp, follow these steps:

Install the ntp package:

sudo apt update sudo apt install ntp

Ensure the ntp service is enabled and running:

sudo systemctl enable ntp
sudo systemctl start ntp
  1. Manually Set NTP Servers (Optional)

If you’re using systemd-timesyncd, you can configure NTP servers in /etc/systemd/timesyncd.conf. Open the file:

sudo vim /etc/systemd/timesyncd.conf

Add or edit the following lines to include your preferred NTP servers:

[Time] NTP=time.google.com 1.de.pool.ntp.org 2.de.pool.ntp.org

Save the file and restart the service:

sudo systemctl restart systemd-timesyncd

  1. Force Synchronization

After ensuring the service is running, you can force synchronization:

sudo timedatectl set-ntp false sudo timedatectl set-ntp true

  1. Verify Synchronization

Check the time status again:

timedatectl

Confirm that System clock synchronized: yes is displayed.


Steps to Resolve

  1. Check for systemd-timesyncd Installation

Run the following command to see if systemd-timesyncd is installed:

sudo systemctl status systemd-timesyncd

If the service is not found, install and enable it:

sudo apt update
sudo apt install systemd-timesyncd
sudo systemctl enable systemd-timesyncd
sudo systemctl start systemd-timesyncd
  1. Verify NTP Settings

Check if the configuration for NTP servers is correct. Open the file /etc/systemd/timesyncd.conf:

sudo nano /etc/systemd/timesyncd.conf

Ensure the [Time] section includes valid NTP servers:

[Time] NTP=time.google.com 1.de.pool.ntp.org 2.de.pool.ntp.org

Save the file and restart the service:

sudo systemctl restart systemd-timesyncd

  1. Manually Enable NTP Synchronization

If the service is installed and running but synchronization is still not active, force-enable NTP:

sudo timedatectl set-ntp true

  1. Install ntp as an Alternative (Optional)

If systemd-timesyncd is not sufficient or unavailable, you can install and use the ntp package:

sudo apt update sudo apt install ntp sudo systemctl enable ntp sudo systemctl start ntp

  1. Check Synchronization Status

Recheck the time synchronization status:

timedatectl

Look for System clock synchronized: yes.

Die 10 besten Skigebiete Deutschlands: Skifahren mit der Familie

https://www.travanto.de/urlaubsmagazin/skigebiete-deutschland-fuer-familienurlaub/#Winkl

Deutschland ist ein wahres Wintersportparadies. Travanto verrät Ihnen die 10 besten Skigebiete Deutschlands und stellt Ihnen ausgewählte Ziele vom Mittelgebirge bis zu den Hochalpen für einen Familienurlaub vor. Die Anzahl an Skigebieten in Deutschland ist erstaunlich. Erfahren Sie, welche die 10 besten Skigebiete für einen Urlaub mit der Familie sind. Nach einem ausgiebigen Tag auf den Skiern und Snowboards kehren Sie in Ihr Ferienhaus oder die Ferienwohnung zurück, da es in den eigenen vier Wänden unweit der Piste immer am gemütlichsten ist. Sie und Ihre Liebsten genießen die Gemeinsamkeit im Urlaub inmitten von Bergen und Schnee und haben Gelegenheit, die schönsten Erlebnisse des Tages miteinander zu teilen. Die Skigebiete im Überblick

  • Oberstdorf-Kleinwalsertal, Bayern
  • Garmisch-Partenkirchen, Bayern
  • Reit im Winkl, Bayern
  • Bayerischer Wald
  • Berchtesgadener, Bayern
  • Winterberg, Nordrhein-Westfalen
  • Willingen im Sauerland, Hessen
  • Feldberg, Baden-Württemberg
  • Fichtelberg, Sachsen
  • Wurmberg, Niedersachen

Identify all occurrences of Phage HH1 MT880870 in S. epidermidis ST2 genomes from public and clinical isolates

ggtree_and_gheatmap_MT880870

  1. install the bacto environment

    mamba create -n bacto -c conda-forge -c bioconda -c defaults python=3.7
    mamba activate bacto
    (bacto) mamba install -c conda-forge -c bioconda -c defaults trimmomatic fastqc shovill
    (bacto) mamba install -c conda-forge -c bioconda -c defaults mlst snippy fasttree
    (bacto) mamba install biopython
    mamba install -c conda-forge perl=5.22
    mamba install -c conda-forge -c bioconda roary
    
    #------ NEED to be debugged for install roary(ERROR) --> install roary in the host env (SUCCESSFUL) ------
    #ERROR (bacto) mamba install -c conda-forge -c bioconda -c defaults roary
    #ERROR (bacto) cpanm Bio::Roary  #DEBUG NOT WORKING ! Installing Bio::Roary failed. See /home/jhuang/.cpanm/work/1736859390.2714327/build.log for details. Retry with --force to force install it.
    #ERROR (bacto) export PERL5LIB=$CONDA_PREFIX/lib/perl5/site_perl/5.22.0/  #DEBUG SUCCESSFUL, repaired also snippy and snippy-core
    sudo apt install roary
    
    (bacto) mamba install -c conda-forge -c bioconda -c defaults raxml-ng gubbins snp-sites
    (bacto) mamba install -c bioconda iqtree
    #(bacto) mamba install -c conda-forge -c bioconda -c defaults perl-bioperl
    #(bacto) cpanm Bio::SeqIO
    #(bacto) cpanm Bio::Perl
    #(bacto) mamba install -c conda-forge -c bioconda -c defaults biopython
    #(bacto) mamba install -c conda-forge -c bioconda -c defaults snakemak
  2. alignment of complete genomes

    Alignment of Complete Genomes vs Alignment of Core Genes
    
    Alignment of Complete Genomes:
        Useful when all genomes are complete and high-quality.
        Not ideal if many genomes are incomplete or fragmented, as this can introduce noise.
    
    Alignment of Core Genes:
        Preferred when working with a mix of complete and draft genomes.
        Focuses on conserved regions, minimizing issues caused by missing or fragmented data.
    
    How to Ensure Core Gene Alignment?
    
    If your goal is to create a core gene alignment:
    
        Use a tool like Roary or Panaroo after genome annotation (e.g., with Prokka).
        Extract the core_gene_alignment.aln file for alignment, which represents the concatenated core genes shared across all genomes.
    
    Prepare all gff-files
    
            bpseudomallei taylorella schromogenes bcereus orhinotracheale aactinomycetemcomitans vcholerae_2 streptomyces spyogenes paeruginosa sgallolyticus bhenselae ecoli senterica_achtman_2 brachyspira leptospira_2 streptothermophilus pdamselae campylobacter_nonjejuni_3 ypseudotuberculosis_achtman_3 abaumannii_2 staphlugdunensis mhominis_3 chlamydiales cdifficile yruckeri bsubtilis sbsec koxytoca kingella tenacibaculum pgingivalis miowae neisseria bordetella_3 cbotulinum edwardsiella ureaplasma pacnes_3 mhyorhinis gallibacterium bbacilliformis sdysgalactiae shominis mcanis borrelia smaltophilia brachyspira_2 campylobacter_nonjejuni_8 psalmonis suberis achromobacter    saureus    mhaemolytica efaecalis cronobacter rhodococcus scanis mplutonius shewanella campylobacter_nonjejuni_7 campylobacter hparasuis listeria_2 plarvae mabscessus hcinaedi csepticum    sepidermidis    mhyopneumoniae mbovis_2 wolbachia vcholerae helicobacter cperfringens bcc ranatipestifer szooepidemicus msynoviae campylobacter_nonjejuni_9 pmultocida_2 arcobacter fpsychrophilum abaumannii bwashoensis mcaseolyticus liberibacter aeromonas cmaltaromaticum vvulnificus mpneumoniae klebsiella brachyspira_5 efaecium pfluorescens hsuis ssuis dnodosus sinorhizobium msciuri spneumoniae vibrio campylobacter_nonjejuni geotrichum lsalivarius tpallidum magalactiae shaemolyticus sagalactiae ecoli_achtman_4 sthermophilus bfragilis kaerogenes blicheniformis_14 vparahaemolyticus mflocculare mgallisepticum vtapetis ecloacae ppentosaceus manserisalpingitidis spseudintermedius brachyspira_4 brachyspira_3 leptospira_3 mcatarrhalis_achtman_6 oralstrep campylobacter_nonjejuni_6 pmultocida aphagocytophilum campylobacter_nonjejuni_2 otsutsugamushi diphtheria_3 hinfluenzae pputida brucella cfreundii mycobacteria_2 mgallisepticum_2 llactis_phage xfastidiosa campylobacter_nonjejuni_4 campylobacter_nonjejuni_5 leptospira
    
            python ~/Scripts/run_mlst.py complete_genome_1282_ncbi.fasta sepidermidis  #73
            python ~/Scripts/run_mlst.py complete_genome_1282_ena.fasta sepidermidis   #73
            python ~/Scripts/run_mlst.py complete_genome_1282_ena_taxon_descendants.fasta sepidermidis
            grep -P "\tsepidermidis\t2\t" all.mlst > ST2.mlst  #73
            grep -v "NZ_" ST2.mlst > ST2_.mlst
            cut -d'.' -f1 ST2_.mlst | sort > ST2.ids
            cut -d'|' -f2 ST2.mlst | sort > ST2.ids
            diff ST2.ids ../ena_descendants_all_ST2_mlst/ST2.ids
    
            for sample in AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
                cp ../genomes_by_taxid/ncbi_all_ST2_mlst/${sample}.1.fasta ./
            done
    
            cd typing_complete.csv
            grep -P "\tsepidermidis\t2\t" typing_complete.csv > typing_ST2.csv
    
            echo -e "AP029227\nCP013943\nCP040883\nCP045648\nCP052939\nCP052940\nCP052955\nCP052959\nCP052981\nCP052984\nCP052990\nCP052994\nCP053088\nCP064461\nCP064467\nCP064473\nCP064476\nCP064480\nCP064485\nCP064499\nCP064508\nCP064516\nCP064525\nCP064554\nCP064557\nCP064560\nCP064572\nCP064574\nCP064587\nCP064590\nCP064599\nCP064603\nCP064606\nCP064609\nCP064613\nCP064616\nCP064619\nCP064624\nCP064631\nCP064637\nCP064650\nCP073821\nCP073824\nCP073827\nCP073830\nCP073835\nCP073836\nCP073840\nCP073841\nCP073844\nCP073847\nCP073850\nCP073852\nCP073855\nCP073857\nCP073859\nCP073862\nCP073863\nCP073878\nCP073900\nCP073904\nCP084008\nCP093173\nCP093179\nCP093181\nCP093193\nCP093196\nCP093198\nCP093212\nCP097512\nCP097514\nCP120425\nCP133663" > ids.txt
    
            cat ids.txt | while read id; do
                efetch -db nuccore -id $id -format gff3 > "${id}.gff"
                efetch -db nuccore -id $id -format fasta > "${id}.fasta"
                # Append the FASTA sequence to the GFF3 file with ##FASTA header
                echo "##FASTA" >> "${id}.gff"
                cat "${id}.fasta" >> "${id}.gff"
                rm "${id}.fasta"  # Optionally remove the separate FASTA file
            done
    
            for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319; do
                cp ../prokka/${sample}/${sample}.gff ./
            done
    
    Example with Roary:
    
            roary -f roary_ST2 -e --mafft -p 100 roary_input_ST2/*.gff
    
            #roary -p 15 -f ./roary -i 95 -cd 99 -s -e -n -v roary_.fa.gb.gff CP012351.gff3 CP003084.gff3 CP023676.gff3
            -i        minimum percentage identity for blastp [95]
            -cd FLOAT percentage of isolates a gene must be in to be core [99]
            -f STR    output directory [.]
            -e        create a multiFASTA alignment of core genes using PRANK
            -n (==--mafft)       fast core gene alignment with MAFFT, use with -e
    
            -s        dont split paralogs
            -v        verbose output to STDOUT
    
    The file core_gene_alignment.aln can then be used for tree construction:
    
            iqtree2 -s core_gene_alignment.aln -m MFP -bb 1000 -nt AUTO
            #Alignment was printed to core_gene_alignment.aln.uniqueseq.phy
            #For your convenience alignment with unique sequences printed to core_gene_alignment.aln.uniqueseq.phy
            #Create initial parsimony tree by phylogenetic likelihood library (PLL)... 0.030 seconds
            #Measuring multi-threading efficiency up to 128 CPU cores
            #Increase to 10 rounds for branch lengths
    
    By focusing on core genes, you ensure that all genomes contribute meaningful data to the phylogenetic analysis.
    
    The file that can be viewed as the phylogenetic tree among the generated files is:
    
        core_gene_alignment.aln.treefile
    
    This file typically contains the tree in a Newick format, which is a widely used format for representing phylogenetic trees.
  3. Public Database and Clinical Isolates

    To differentiate the two groups of genomes effectively while maintaining clarity and professionalism, you can use category names that reflect their source or origin. Here are some suggestions:
    Category Name Suggestions
    
        Source-based Names:
            Public Database and Clinical Isolates
            Public Genomes and Clinical Genomes
            Reference Genomes and Patient-derived Genomes
    
        Origin-based Names:
            External Sources and In-house Sequencing
            Global Genomes and Local Isolates
            Database Genomes and Clinical Strains
    
        Research Context Names:
            Published Genomes and Newly Sequenced Genomes
            Literature Genomes and Clinical Samples
            Archived Data and Current Study
    
        Abbreviated or Code-like Names:
            PUB and CLI (Public and Clinical)
            EXDB (External Database) and INHS (In-House Sequencing)
            PG (Public Genomes) and CI (Clinical Isolates)
    
    Tips for Choosing the Best Category Names
    
        Clarity: Ensure that the names are intuitive and easy for others to understand.
        Neutrality: Avoid names that might imply bias or inferiority (e.g., avoid "old" vs. "new").
        Context Appropriateness: Consider your audience. For a clinical study, terms like "Clinical Isolates" may resonate more, while "Public Database" works better for bioinformatics research.
    
    For instance:
    
        "Database" and "Clinical" could work well in a manuscript.
        "External" and "In-house" might be better for internal project tracking.
  4. copy contigs of own clinical isolates

    #extract all genes from the phages set as the column names, set the genome names from the public database and own isolates as row names.
    
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319; do
      cp ../shovill/${sample}/contigs.fa ${sample}.fasta
    done
  5. makeblastdb

    #MT880870_on_mibi1435.blastn: 2366
    #MT880870_on_mibi1436.blastn: 3198
    #MT880870_on_mibi1437.blastn: 3198
    
    # -lenth(MT880870)=34053
    rename 's/\.1\.fasta$/.fasta/' *.1.fasta
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        makeblastdb -in ${sample}.fasta -dbtype nucl
    done
  6. prepare typing_ST2_until_QPB07622.csv

    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07572.1_1" > QPB07572.fasta
    mkdir QPB07572
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07572.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07572/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07572 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07573.1_2" > QPB07573.fasta
    mkdir QPB07573
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07573.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07573/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07573 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07574.1_3" > QPB07574.fasta
    mkdir QPB07574
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07574.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07574/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07574 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07575.1_4" > QPB07575.fasta
    mkdir QPB07575
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07575.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07575/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07575 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07576.1_5" > QPB07576.fasta
    mkdir QPB07576
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07576.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07576/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07576 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07577.1_6" > QPB07577.fasta
    mkdir QPB07577
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07577.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07577/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07577 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07578.1_7" > QPB07578.fasta
    mkdir QPB07578
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07578.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07578/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07578 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07579.1_8" > QPB07579.fasta
    mkdir QPB07579
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07579.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07579/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07579 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_updated_typing_ST2.csv
    
    grep ">" MT880870.fasta > temp
    cut -d' ' -f1 temp > temp1
    
    #    region="lcl|MT880870.1_cds_${id}.1_$((i - 78))"  # Adjust index calculation for region
    #    echo "samtools faidx MT880870.fasta \"$region\" > ${id}.fasta"
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07580.1_9" > QPB07580.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07581.1_10" > QPB07581.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07582.1_11" > QPB07582.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07583.1_12" > QPB07583.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07584.1_13" > QPB07584.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07585.1_14" > QPB07585.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07586.1_15" > QPB07586.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07587.1_16" > QPB07587.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07588.1_17" > QPB07588.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07589.1_18" > QPB07589.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07590.1_19" > QPB07590.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07591.1_20" > QPB07591.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07592.1_21" > QPB07592.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07593.1_22" > QPB07593.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07594.1_23" > QPB07594.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07595.1_24" > QPB07595.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07596.1_25" > QPB07596.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07597.1_26" > QPB07597.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07598.1_27" > QPB07598.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07599.1_28" > QPB07599.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07600.1_29" > QPB07600.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07601.1_30" > QPB07601.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07602.1_31" > QPB07602.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07603.1_32" > QPB07603.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07604.1_33" > QPB07604.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07605.1_34" > QPB07605.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07606.1_35" > QPB07606.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07607.1_36" > QPB07607.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07608.1_37" > QPB07608.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07609.1_38" > QPB07609.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07610.1_39" > QPB07610.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07611.1_40" > QPB07611.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07612.1_41" > QPB07612.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07613.1_42" > QPB07613.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07614.1_43" > QPB07614.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07615.1_44" > QPB07615.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07616.1_45" > QPB07616.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07617.1_46" > QPB07617.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07618.1_47" > QPB07618.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07619.1_48" > QPB07619.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07620.1_49" > QPB07620.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07621.1_50" > QPB07621.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07622.1_51" > QPB07622.fasta
    
    for i in {580..580}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07580 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_updated_updated_typing_ST2.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_QPB07580.csv
    
    for i in {581..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_${id}.csv"
    done
    
    cp typing_ST2_until_QPB07622.csv ../plotTreeHeatmap_ST2/typing_ST2.csv
  7. plot tree heatmap

    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/")
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    info <- read.csv("typing_ST2.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("core_gene_alignment.aln.treefile")
    cols <- c("Public database"='purple2', "Clinical isolate"='darksalmon')  #purple2  skyblue2
    
    heatmapData2 <- info %>% select(Isolate, QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    heatmap.colours <- c("darkgreen","grey")
    names(heatmap.colours) <- c("+","-")
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=Source)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    #png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    
    png("ggtree_and_gheatmap.png", width=1290, height=1000)
    gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()

Identify all occurrences of Phages MT880870, MT880871 and MT880872 in S. epidermidis ST2 genomes from public and clinical isolates

ggtree_and_gheatmap_ST2_phages

  1. prepare typing_ST2_until_QPB07622.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

    #grep ">" MT880870.fasta | cut -d' ' -f1 > headers.txt
    # -- under presence_absence_ST2 --
    samtools faidx MT880870.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07572.1_1" > QPB07572.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07573.1_2" > QPB07573.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07574.1_3" > QPB07574.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07575.1_4" > QPB07575.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07576.1_5" > QPB07576.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07577.1_6" > QPB07577.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07578.1_7" > QPB07578.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07579.1_8" > QPB07579.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07580.1_9" > QPB07580.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07581.1_10" > QPB07581.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07582.1_11" > QPB07582.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07583.1_12" > QPB07583.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07584.1_13" > QPB07584.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07585.1_14" > QPB07585.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07586.1_15" > QPB07586.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07587.1_16" > QPB07587.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07588.1_17" > QPB07588.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07589.1_18" > QPB07589.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07590.1_19" > QPB07590.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07591.1_20" > QPB07591.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07592.1_21" > QPB07592.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07593.1_22" > QPB07593.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07594.1_23" > QPB07594.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07595.1_24" > QPB07595.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07596.1_25" > QPB07596.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07597.1_26" > QPB07597.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07598.1_27" > QPB07598.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07599.1_28" > QPB07599.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07600.1_29" > QPB07600.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07601.1_30" > QPB07601.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07602.1_31" > QPB07602.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07603.1_32" > QPB07603.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07604.1_33" > QPB07604.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07605.1_34" > QPB07605.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07606.1_35" > QPB07606.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07607.1_36" > QPB07607.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07608.1_37" > QPB07608.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07609.1_38" > QPB07609.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07610.1_39" > QPB07610.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07611.1_40" > QPB07611.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07612.1_41" > QPB07612.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07613.1_42" > QPB07613.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07614.1_43" > QPB07614.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07615.1_44" > QPB07615.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07616.1_45" > QPB07616.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07617.1_46" > QPB07617.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07618.1_47" > QPB07618.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07619.1_48" > QPB07619.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07620.1_49" > QPB07620.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07621.1_50" > QPB07621.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07622.1_51" > QPB07622.fasta
    
    for i in {572..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    
    #under plotTreeHeatmap_ST2
    for i in {572..622}; do
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
    done
    
    #reprace all '+' with 'MT880870' in typing_ST2_until_QPB07622.csv
    cp ../presence_absence_ST2/typing_ST2_until_QPB07622.csv ./
    sed -i 's/+/MT880870/g' typing_ST2_until_QPB07622.csv
  2. prepare typing_ST2_until_QPB07667.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

    # -- under presence_absence_ST2 --
    samtools faidx MT880871.fasta
    #grep ">" MT880871.fasta | cut -d' ' -f1 > headers.txt
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07623.1_1" > QPB07623.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07624.1_2" > QPB07624.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07625.1_3" > QPB07625.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07626.1_4" > QPB07626.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07627.1_5" > QPB07627.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07628.1_6" > QPB07628.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07629.1_7" > QPB07629.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07630.1_8" > QPB07630.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07631.1_9" > QPB07631.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07632.1_10" > QPB07632.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07633.1_11" > QPB07633.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07634.1_12" > QPB07634.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07635.1_13" > QPB07635.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07636.1_14" > QPB07636.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07637.1_15" > QPB07637.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07638.1_16" > QPB07638.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07639.1_17" > QPB07639.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07640.1_18" > QPB07640.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07641.1_19" > QPB07641.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07642.1_20" > QPB07642.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07643.1_21" > QPB07643.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07644.1_22" > QPB07644.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07645.1_23" > QPB07645.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07646.1_24" > QPB07646.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07647.1_25" > QPB07647.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07648.1_26" > QPB07648.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07649.1_27" > QPB07649.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07650.1_28" > QPB07650.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07651.1_29" > QPB07651.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07652.1_30" > QPB07652.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07653.1_31" > QPB07653.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07654.1_32" > QPB07654.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07655.1_33" > QPB07655.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07656.1_34" > QPB07656.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07657.1_35" > QPB07657.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07658.1_36" > QPB07658.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07659.1_37" > QPB07659.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07660.1_38" > QPB07660.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07661.1_39" > QPB07661.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07662.1_40" > QPB07662.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07663.1_41" > QPB07663.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07664.1_42" > QPB07664.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07665.1_43" > QPB07665.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07666.1_44" > QPB07666.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07667.1_45" > QPB07667.fasta
    
    for i in {623..667}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {660..660}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    # -- under plotTreeHeatmap_ST2 --
    for i in {623..667}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
    done
    
    sed -i 's/+/MT880871/g' typing_ST2_until_QPB07667.csv
  3. prepare typing_ST2_until_QPB07667.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

    # -- under presence_absence_ST2 --
    samtools faidx MT880872.fasta
    #grep ">" MT880872.fasta | cut -d' ' -f1 > headers.txt
    
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07668.1_1" > QPB07668.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07669.1_2" > QPB07669.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07670.1_3" > QPB07670.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07671.1_4" > QPB07671.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07672.1_5" > QPB07672.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07673.1_6" > QPB07673.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07674.1_7" > QPB07674.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07675.1_8" > QPB07675.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07676.1_9" > QPB07676.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07677.1_10" > QPB07677.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07678.1_11" > QPB07678.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07679.1_12" > QPB07679.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07680.1_13" > QPB07680.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07681.1_14" > QPB07681.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07682.1_15" > QPB07682.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07683.1_16" > QPB07683.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07684.1_17" > QPB07684.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07685.1_18" > QPB07685.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07686.1_19" > QPB07686.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07687.1_20" > QPB07687.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07688.1_21" > QPB07688.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07689.1_22" > QPB07689.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07690.1_23" > QPB07690.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07691.1_24" > QPB07691.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07692.1_25" > QPB07692.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07693.1_26" > QPB07693.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07694.1_27" > QPB07694.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07695.1_28" > QPB07695.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07696.1_29" > QPB07696.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07697.1_30" > QPB07697.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07698.1_31" > QPB07698.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07699.1_32" > QPB07699.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07700.1_33" > QPB07700.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07701.1_34" > QPB07701.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07702.1_35" > QPB07702.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07703.1_36" > QPB07703.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07704.1_37" > QPB07704.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07705.1_38" > QPB07705.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07706.1_39" > QPB07706.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07707.1_40" > QPB07707.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07708.1_41" > QPB07708.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07709.1_42" > QPB07709.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07710.1_43" > QPB07710.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07711.1_44" > QPB07711.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07712.1_45" > QPB07712.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07713.1_46" > QPB07713.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07714.1_47" > QPB07714.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07715.1_48" > QPB07715.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07716.1_49" > QPB07716.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07717.1_50" > QPB07717.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07718.1_51" > QPB07718.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07719.1_52" > QPB07719.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07720.1_53" > QPB07720.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07721.1_54" > QPB07721.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07722.1_55" > QPB07722.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07723.1_56" > QPB07723.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07724.1_57" > QPB07724.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07725.1_58" > QPB07725.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07726.1_59" > QPB07726.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07727.1_60" > QPB07727.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07728.1_61" > QPB07728.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07729.1_62" > QPB07729.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07730.1_63" > QPB07730.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07731.1_64" > QPB07731.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07732.1_65" > QPB07732.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07733.1_66" > QPB07733.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07734.1_67" > QPB07734.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07735.1_68" > QPB07735.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07736.1_69" > QPB07736.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07737.1_70" > QPB07737.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07738.1_71" > QPB07738.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07739.1_72" > QPB07739.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07740.1_73" > QPB07740.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07741.1_74" > QPB07741.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07742.1_75" > QPB07742.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07743.1_76" > QPB07743.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07744.1_77" > QPB07744.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07745.1_78" > QPB07745.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07746.1_79" > QPB07746.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07747.1_80" > QPB07747.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07748.1_81" > QPB07748.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07749.1_82" > QPB07749.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07750.1_83" > QPB07750.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07751.1_84" > QPB07751.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07752.1_85" > QPB07752.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07753.1_86" > QPB07753.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07754.1_87" > QPB07754.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07755.1_88" > QPB07755.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07756.1_89" > QPB07756.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07757.1_90" > QPB07757.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07758.1_91" > QPB07758.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07759.1_92" > QPB07759.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07760.1_93" > QPB07760.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07761.1_94" > QPB07761.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07762.1_95" > QPB07762.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07763.1_96" > QPB07763.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07764.1_97" > QPB07764.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07765.1_98" > QPB07765.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07766.1_99" > QPB07766.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07767.1_100" > QPB07767.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07768.1_101" > QPB07768.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07769.1_102" > QPB07769.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07770.1_103" > QPB07770.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07771.1_104" > QPB07771.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07772.1_105" > QPB07772.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07773.1_106" > QPB07773.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07774.1_107" > QPB07774.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07775.1_108" > QPB07775.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07776.1_109" > QPB07776.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07777.1_110" > QPB07777.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07778.1_111" > QPB07778.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07779.1_112" > QPB07779.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07780.1_113" > QPB07780.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07781.1_114" > QPB07781.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07782.1_115" > QPB07782.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07783.1_116" > QPB07783.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07784.1_117" > QPB07784.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07785.1_118" > QPB07785.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07786.1_119" > QPB07786.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07787.1_120" > QPB07787.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07788.1_121" > QPB07788.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07789.1_122" > QPB07789.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07790.1_123" > QPB07790.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07791.1_124" > QPB07791.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07792.1_125" > QPB07792.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07793.1_126" > QPB07793.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07794.1_127" > QPB07794.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07795.1_128" > QPB07795.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07796.1_129" > QPB07796.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07797.1_130" > QPB07797.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07798.1_131" > QPB07798.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07799.1_132" > QPB07799.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07800.1_133" > QPB07800.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07801.1_134" > QPB07801.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07802.1_135" > QPB07802.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07803.1_136" > QPB07803.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07804.1_137" > QPB07804.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07805.1_138" > QPB07805.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07806.1_139" > QPB07806.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07807.1_140" > QPB07807.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07808.1_141" > QPB07808.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07809.1_142" > QPB07809.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07810.1_143" > QPB07810.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07811.1_144" > QPB07811.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07812.1_145" > QPB07812.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07813.1_146" > QPB07813.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07814.1_147" > QPB07814.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07815.1_148" > QPB07815.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07816.1_149" > QPB07816.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07817.1_150" > QPB07817.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07818.1_151" > QPB07818.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07819.1_152" > QPB07819.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07820.1_153" > QPB07820.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07821.1_154" > QPB07821.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07822.1_155" > QPB07822.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07823.1_156" > QPB07823.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07824.1_157" > QPB07824.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07825.1_158" > QPB07825.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07826.1_159" > QPB07826.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07827.1_160" > QPB07827.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07828.1_161" > QPB07828.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07829.1_162" > QPB07829.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07830.1_163" > QPB07830.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07831.1_164" > QPB07831.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07832.1_165" > QPB07832.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07833.1_166" > QPB07833.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07834.1_167" > QPB07834.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07835.1_168" > QPB07835.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07836.1_169" > QPB07836.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07837.1_170" > QPB07837.fasta
    
    for i in {668..750}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {751..837}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
        for i in {708..708}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    # -- under plotTreeHeatmap_ST2 --
    for i in {668..837}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
    done
    
    sed -i 's/+/MT880872/g' typing_ST2_until_QPB07837.csv
  4. plot tree heatmap under /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2_MT880871

    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/")
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    info <- read.csv("typing_ST2_until_QPB07837.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("core_gene_alignment.aln.treefile")
    cols <- c("Public database"='purple2', "Clinical isolate"='darksalmon')  #purple2  skyblue2
    heatmapData2 <- info %>% select(Isolate, QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622, QPB07623, QPB07624, QPB07625, QPB07626, QPB07627, QPB07628, QPB07629, QPB07630, QPB07631, QPB07632, QPB07633, QPB07634, QPB07635, QPB07636, QPB07637, QPB07638, QPB07639, QPB07640, QPB07641, QPB07642, QPB07643, QPB07644, QPB07645, QPB07646, QPB07647, QPB07648, QPB07649, QPB07650, QPB07651, QPB07652, QPB07653, QPB07654, QPB07655, QPB07656, QPB07657, QPB07658, QPB07659, QPB07660, QPB07661, QPB07662, QPB07663, QPB07664, QPB07665, QPB07666, QPB07667, QPB07668, QPB07669, QPB07670, QPB07671, QPB07672, QPB07673, QPB07674, QPB07675, QPB07676, QPB07677, QPB07678, QPB07679, QPB07680, QPB07681, QPB07682, QPB07683, QPB07684, QPB07685, QPB07686, QPB07687, QPB07688, QPB07689, QPB07690, QPB07691, QPB07692, QPB07693, QPB07694, QPB07695, QPB07696, QPB07697, QPB07698, QPB07699, QPB07700, QPB07701, QPB07702, QPB07703, QPB07704, QPB07705, QPB07706, QPB07707, QPB07708, QPB07709, QPB07710, QPB07711, QPB07712, QPB07713, QPB07714, QPB07715, QPB07716, QPB07717, QPB07718, QPB07719, QPB07720, QPB07721, QPB07722, QPB07723, QPB07724, QPB07725, QPB07726, QPB07727, QPB07728, QPB07729, QPB07730, QPB07731, QPB07732, QPB07733, QPB07734, QPB07735, QPB07736, QPB07737, QPB07738, QPB07739, QPB07740, QPB07741, QPB07742, QPB07743, QPB07744, QPB07745, QPB07746, QPB07747, QPB07748, QPB07749, QPB07750, QPB07751, QPB07752, QPB07753, QPB07754, QPB07755, QPB07756, QPB07757, QPB07758, QPB07759, QPB07760, QPB07761, QPB07762, QPB07763, QPB07764, QPB07765, QPB07766, QPB07767, QPB07768, QPB07769, QPB07770, QPB07771, QPB07772, QPB07773, QPB07774, QPB07775, QPB07776, QPB07777, QPB07778, QPB07779, QPB07780, QPB07781, QPB07782, QPB07783, QPB07784, QPB07785, QPB07786, QPB07787, QPB07788, QPB07789, QPB07790, QPB07791, QPB07792, QPB07793, QPB07794, QPB07795, QPB07796, QPB07797, QPB07798, QPB07799, QPB07800, QPB07801, QPB07802, QPB07803, QPB07804, QPB07805, QPB07806, QPB07807, QPB07808, QPB07809, QPB07810, QPB07811, QPB07812, QPB07813, QPB07814, QPB07815, QPB07816, QPB07817, QPB07818, QPB07819, QPB07820, QPB07821, QPB07822, QPB07823, QPB07824, QPB07825, QPB07826, QPB07827, QPB07828, QPB07829, QPB07830, QPB07831, QPB07832, QPB07833, QPB07834, QPB07835, QPB07836, QPB07837)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=Source)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    #png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    png("ggtree_and_gheatmap.png", width=1290, height=1000)
    gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  5. Report

    Explanation for the discrepancy between the last analysis and this one: Note that many partial genes of the phages do not match because, in the previous analysis, only the longest BLASTn hits were considered and recorded in the table.

    We have a total of 14 ST2 genomes from Luise’s project, and I found 73 ST2 genomes from NCBI.

    I examined the presence and absence of all 51 + 45 + 170 genes (details below) across the 87 genomes and visualized the results in a heatmap (attached as ggtree_and_gheatmap_phages.png).

    • MT880870: 51 putative genes
    • MT880871: 45 putative genes
    • MT880872: 170 putative genes

    In the heatmap:

    Genes not present in the corresponding isolate are shown in grey. Genes present are highlighted in the corresponding colors:

    • Dark red for MT880870
    • Dark blue for MT880871
    • Dark green for MT880872

    The graphic is based on the Excel file typing_ST2.xlsx. If you have any suggestions or improvements, please let me know.

Enhanced Visualization of Gene Presence in Luise_Sepi_STKN

ggtree_and_gheatmap_mibi_selected_genes

ggtree_and_gheatmap_mibi_phages

  1. prepare typing_ST2_until_QPB07837.csv (under DIR ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates and ENV r_env)

    cp ../presence_absence_ST2/QPB*.fasta ./
    cp ../plotTreeHeatmap/typing.csv ./typing_until_QPB07571.csv
    
    # -- makeblastdb --
    #for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do
    #    makeblastdb -in ../shovill/${sample}/contigs.fa -dbtype nucl
    #done
    
    for i in {572..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {572..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    
    #reprace all '+' with 'MT880870' in typing_until_QPB07622.csv
    sed -i 's/+/MT880870/g' typing_until_QPB07622.csv
  2. prepare typing_until_QPB07667.csv

    for i in {623..667}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {623..667}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    
    sed -i 's/+/MT880871/g' typing_until_QPB07667.csv
  3. prepare typing_until_QPB07667.csv

    for i in {668..750}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {751..837}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {668..837}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    
    sed -i 's/+/MT880872/g' typing_until_QPB07837.csv
    
    cut -d$'\t' -f1-43 typing_until_QPB07837.csv > temp1
    cut -d$'\t' -f44- typing_until_QPB07837.csv > temp2
    
    sed -i 's/MT880870/+/g' temp1
    paste -d$'\t' temp1 temp2 > ggtree_and_gheatmap_mibi_phages.csv
  4. plot tree heatmap under /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates

    cp ../plotTreeHeatmap/990_backup.tree ./
    
    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/")
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    
    # -- for the figure ggtree_and_gheatmap_mibi_phages.png --
    info <- read.csv("ggtree_and_gheatmap_mibi.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("990_backup.tree")
    cols <- c("2"="cornflowerblue","5"="darkgreen","7"="seagreen3","9"="tan","14"="red",  "17"="navyblue", "23"="gold",     "35"="green","59"="orange","73"="pink","81"="purple","86"="magenta","87"="brown", "89"="darksalmon","130"="chocolate4","190"="darkkhaki", "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet")
    #cols <- c("2"='purple2', "other"='darksalmon')  #purple2  skyblue2
    heatmapData2 <- info %>% select(Isolate,  QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622, QPB07623, QPB07624, QPB07625, QPB07626, QPB07627, QPB07628, QPB07629, QPB07630, QPB07631, QPB07632, QPB07633, QPB07634, QPB07635, QPB07636, QPB07637, QPB07638, QPB07639, QPB07640, QPB07641, QPB07642, QPB07643, QPB07644, QPB07645, QPB07646, QPB07647, QPB07648, QPB07649, QPB07650, QPB07651, QPB07652, QPB07653, QPB07654, QPB07655, QPB07656, QPB07657, QPB07658, QPB07659, QPB07660, QPB07661, QPB07662, QPB07663, QPB07664, QPB07665, QPB07666, QPB07667, QPB07668, QPB07669, QPB07670, QPB07671, QPB07672, QPB07673, QPB07674, QPB07675, QPB07676, QPB07677, QPB07678, QPB07679, QPB07680, QPB07681, QPB07682, QPB07683, QPB07684, QPB07685, QPB07686, QPB07687, QPB07688, QPB07689, QPB07690, QPB07691, QPB07692, QPB07693, QPB07694, QPB07695, QPB07696, QPB07697, QPB07698, QPB07699, QPB07700, QPB07701, QPB07702, QPB07703, QPB07704, QPB07705, QPB07706, QPB07707, QPB07708, QPB07709, QPB07710, QPB07711, QPB07712, QPB07713, QPB07714, QPB07715, QPB07716, QPB07717, QPB07718, QPB07719, QPB07720, QPB07721, QPB07722, QPB07723, QPB07724, QPB07725, QPB07726, QPB07727, QPB07728, QPB07729, QPB07730, QPB07731, QPB07732, QPB07733, QPB07734, QPB07735, QPB07736, QPB07737, QPB07738, QPB07739, QPB07740, QPB07741, QPB07742, QPB07743, QPB07744, QPB07745, QPB07746, QPB07747, QPB07748, QPB07749, QPB07750, QPB07751, QPB07752, QPB07753, QPB07754, QPB07755, QPB07756, QPB07757, QPB07758, QPB07759, QPB07760, QPB07761, QPB07762, QPB07763, QPB07764, QPB07765, QPB07766, QPB07767, QPB07768, QPB07769, QPB07770, QPB07771, QPB07772, QPB07773, QPB07774, QPB07775, QPB07776, QPB07777, QPB07778, QPB07779, QPB07780, QPB07781, QPB07782, QPB07783, QPB07784, QPB07785, QPB07786, QPB07787, QPB07788, QPB07789, QPB07790, QPB07791, QPB07792, QPB07793, QPB07794, QPB07795, QPB07796, QPB07797, QPB07798, QPB07799, QPB07800, QPB07801, QPB07802, QPB07803, QPB07804, QPB07805, QPB07806, QPB07807, QPB07808, QPB07809, QPB07810, QPB07811, QPB07812, QPB07813, QPB07814, QPB07815, QPB07816, QPB07817, QPB07818, QPB07819, QPB07820, QPB07821, QPB07822, QPB07823, QPB07824, QPB07825, QPB07826, QPB07827, QPB07828, QPB07829, QPB07830, QPB07831, QPB07832, QPB07833, QPB07834, QPB07835, QPB07836, QPB07837)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    
    png("ggtree_and_gheatmap_mibi_phages.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap_mibi_phages.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 6) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
    
    # -- for the figure ggtree_and_gheatmap_mibi_selected_genes.png --
    
    info <- read.csv("ggtree_and_gheatmap_mibi.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("990_backup.tree")
    cols <- c("2"="cornflowerblue","5"="darkgreen","7"="seagreen3","9"="tan","14"="red",  "17"="navyblue", "23"="gold",     "35"="green","59"="orange","73"="pink","81"="purple","86"="magenta","87"="brown", "89"="darksalmon","130"="chocolate4","190"="darkkhaki", "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet")
    
    heatmapData2 <- info %>% select(Isolate,       SCCmec,  agr.typing,      gyrB,    fumC,    gltA,    icd,     apsS,    sigB,    sarA,    agrC,    yycG,    psm.β,   psm.δ,   hlb,     atlE,    atl,     sdrG,    sdrH,    ebh,     ebpS,    tagB,    capC,    sepA,    dltA,    fmtC,    lipA,    sceD,    SE0760,  esp,     ecpA)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    #heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")
    
    heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
    names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    
    png("ggtree_and_gheatmap_mibi_selected_genes.png", width=1590, height=1300)
    #svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  5. Report

    I’ve attached a figure (ggtree_and_gheatmap_mibi_selected_genes.png) that illustrates the presence and absence of the selected genes. This is a visual representation of the table I sent to you earlier. The presence or absence of each gene in the corresponding genomes was determined using a BLASTn comparison between the genome and the gene sequences.
    
    Additionally, I’ve updated the figure ggtree_and_gheatmap_mibi_phages.png. In this new version, all ST types are represented in distinct colors. The raw data for both figures can be found in the attached Excel file (ggtree_and_gheatmap_mibi.xlsx).

Enhanced Visualization of Gene Presence for the Selected Genes in Bongarts_S.epidermidis_HDRNA

ggtree_and_gheatmap_selected_genes

  1. preparing gene seqences

    (Optional online search) https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&BLAST_SPEC=MicrobialGenomes
    #Title:Refseq prokaryote representative genomes (contains refseq assembly)
    #Molecule Type:mixed DNA
    #Update date:2024/10/16
    #Number of sequences:1038672
    
    #cat gyrB_revcomp.fasta fumC_revcomp.fasta gltA.fasta icd_revcomp.fasta apsS.fasta sigB_revcomp.fasta sarA_revcomp.fasta ...
    
    # SCCmec,agr.typing,
    # gyrB(House keeper),
    # fumC,gltA,icd(Metabolic genes),
    # apsS,sigB,sarA,agrC,yycG(Virulence regulators),
    # psm.β1,hlb(Toxins),
    # atlE,sdrG,sdrH,ebh,ebp,tagB(Biofilm formation),
    # capC,sepA,dltA,fmtC,lipA,sceD,SE0760(Immune evasion & colonization),
    # esp,ecpA(Serine protease)
    
    gene            complement(1547773..1549233)
                    /gene="ebp"
                    /locus_tag="SAKG22_13940"
    
    CDS             complement(1547773..1549233)
                    /gene="ebp"
            samtools faidx AP019545.1.fasta
            samtools faidx AP019545.1.fasta "gi|1602080981|dbj|AP019545.1|":1547773-1549233 > ebp.fasta
            revcomp ebp.fasta > ebp_revcomp.fasta
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gyrB_revcomp.fasta gyrB.fasta
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fumC_revcomp.fasta fumC.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gltA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/icd_revcomp.fasta icd.fasta
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/apsS.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sigB_revcomp.fasta sigB.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sarA_revcomp.fasta sarA.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/agrC.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/yycG.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/psm-beta1.fasta psm.β1.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/hlb.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/atlE.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrG.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrH_revcomp.fasta sdrH.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebh_revcomp.fasta ebh.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebp_revcomp.fasta ebp.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/tagB.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/capC.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sepA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/dltA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fmtC.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/lipA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sceD.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/SE0760.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/esp.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ecpA.fasta .
    
            #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880870.fasta .
            #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880871.fasta .
            #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880872.fasta .
    
    >gi|1602080981|dbj|AP019545.1|:1547773-1549233
    >gi|34980227|gb|AY322150.1|:3091-8793
    
    ./gyrB_revcomp.fasta
    
    ./agrC.fasta
    ./CP133692.fasta
    ./sceD_mibi1435_revcomp.fasta
    ./CP133682.fasta
    ./CP133697.fasta
    ./sdrG.fasta
    ./sarA_revcomp.fasta
    ./CP133679.fasta
    ./CP133686.fasta
    ./sepA.fasta
    ./icd.fasta
    ./CP133691.fasta
    ./CP133698.fasta
    ./yycG.fasta
    ./CP133701.fasta
    
    ./sdrH_revcomp.fasta
    ./SE0760.fasta
    ./CP133681.fasta
    ./CP133677.fasta
    ./yycG_revcomp.fasta
    ./CP000029.fasta
    ./atlE_old.fasta
    ./ebpS.fasta
    ./ebpS_revcomp.fasta
    ./atlE_revcomp.fasta
    ./sceDAE.fasta
    ./hlb.fasta
    >gi|441494790|gb|KC242859.1|:16-1008
    ATGGTGAAAAAAACAAAATCCAATTCACTAAAAAAAGTTGCAACACTTGCATTAGCAAAT
    TTATTATTAGTTGGTGCACTTACTGACAATAGTGCCAAAGCCGAATCTAAGAAAGATGAT
    ACTGATTTGAAGTTAGTTAGTCATAACGTTTATATGTTATCGACCGTTTTGTATCCAAAC
    TGGGGGCAATATAAACGCGCTGATTTAATCGGGCAATCTTCTTATATTAAAAATAATGAT
    GTCGTAATATTCAATGAAGCATTTGATAATGGCGCATCAGATAAGCTATTAAGTAATGTA
    AAAAAAGAATATCCTTACCAAACACCTGTACTCGGTCGTTCTCAATCAGGGTGGGACAAA
    ACTGAAGGTAGCTACTCATCAACTGTTGCTGAAGATGGTGGCGTAGCGATTGTAAGTAAA
    TATCCTATTAAAGAAAAAATCCAGCATGTTTTCAAAAGCGGTTGTGGATTCGACAATGAT
    AGCAACAAAGGCTTTGTTTATACAAAAATAGAGAAAAATGGGAAGAACGTTCATGTTATC
    GGTACACATACACAATCTGAAGATTCACGTTGTGGTGCTGGACATGATCGAAAAATTAGA
    GCTGAACAAATGAAAGAAATCAGTGATTTTGTTAAAAAGAAAAATATCCCTAAAGATGAA
    ACGGTATATATAGGTGGCGACCTTAATGTAAATAAAGGTACTCCAGAGTTCAAAGATATG
    CTTAAAAACTTGAATGTAAATGATGTTCTATATGCAGGTCATAATAGCACATGGGACCCT
    CAATCAAATTCAATTGCGAAATATAATTATCCTAATGGTAAACCAGAACATTTAGACTAT
    ATATTTACAGATAAAGATCATAAACAACCAAAACAATTAGTCAATGAAGTTGTGACTGAA
    AAACCTAAGCCATGGGATGTATATGCGTTCCCATATTACTACGTTTACAATGATTTTTCA
    GATCATTACCCAATCAAAGCCTATAGTAAATAA
    >hlb_
    >gi|441494790|gb|KC242859.1| Staphylococcus aureus strain 226 beta-hemolysin (hlb) gene, complete cds
    AAAGGAGTGATAATG  ATGGTGAAAAAAACAAAATCCAATTCACTAAAAAAAGTTGCAACACTTGCATTAGCAAATTTATTATTAGTTGGTGCACTTACTGACAATAGTGCCAAAGCCGAATCTAAGAAAGATGATACTGA
    TTTGAAGTTAGTTAGTCATAACGTTTATATGTTATCGACCGTTTTGTATCCAAACTGGGGGCAATATAAA
    CGCGCTGATTTAATCGGGCAATCTTCTTATATTAAAAATAATGATGTCGTAATATTCAATGAAGCATTTG
    ATAATGGCGCATCAGATAAGCTATTAAGTAATGTAAAAAAAGAATATCCTTACCAAACACCTGTACTCGG
    TCGTTCTCAATCAGGGTGGGACAAAACTGAAGGTAGCTACTCATCAACTGTTGCTGAAGATGGTGGCGTA
    GCGATTGTAAGTAAATATCCTATTAAAGAAAAAATCCAGCATGTTTTCAAAAGCGGTTGTGGATTCGACA
    ATGATAGCAACAAAGGCTTTGTTTATACAAAAATAGAGAAAAATGGGAAGAACGTTCATGTTATCGGTAC
    ACATACACAATCTGAAGATTCACGTTGTGGTGCTGGACATGATCGAAAAATTAGAGCTGAACAAATGAAA
    GAAATCAGTGATTTTGTTAAAAAGAAAAATATCCCTAAAGATGAAACGGTATATATAGGTGGCGACCTTA
    ATGTAAATAAAGGTACTCCAGAGTTCAAAGATATGCTTAAAAACTTGAATGTAAATGATGTTCTATATGC
    AGGTCATAATAGCACATGGGACCCTCAATCAAATTCAATTGCGAAATATAATTATCCTAATGGTAAACCA
    GAACATTTAGACTATATATTTACAGATAAAGATCATAAACAACCAAAACAATTAGTCAATGAAGTTGTGA
    CTGAAAAACCTAAGCCATGGGATGTATATGCGTTCCCATATTACTACGTTTACAATGATTTTTCAGATCA
    TTACCCAATCAAAGCCTATAGTAAATAA
    
    ./dltA.fasta
    ./atlE.fasta
    ./gyrB.fasta
    ./ORF123_sepA_ORF5.fasta
    ./ebh.fasta
    ./CP133696.fasta
    ./yycG_old.fasta
    ./CP133700.fasta
    ./CP133695.fasta
    ./ecpA.fasta
    ./capBCA_ywtC.fasta
    ./CP133690.fasta
    ./gltA.fasta
    ./esp.fasta
    ./sigB.fasta
    ./CP133699.fasta
    ./CP133685.fasta
    ./CP133680.fasta
    ./MT880872.fasta
    ./icd_revcomp.fasta
    ./agrC_old.fasta
    ./ecpA_.fasta
    ./agrC_revcomp.fasta
    ./MT880870.fasta
    ./CP133687.fasta
    ./CP133693.fasta
    ./psm-beta1.fasta
    >gi|380448412|gb|JQ066320.1| Staphylococcus aureus strain JP1 Psm beta1 (psm beta1) and Psm beta2 (psm beta2) genes, complete cds
    ACACGTTTAACAACACAAGAATTATTATATCTAAATGAACTAAAATTAGCAATACCTTGTAAATAAAAAA
    TGTTTATATTTTTCACTATTATAGAGCTATTTATCTAAAAAGGTTCAATAAGACTTAAATACGAATTCAG
    GCAACTTAATTGTGTTAAATACAGTTTTGAATGCCTAACTGTATTTCTTTTCTCTTTAAAATACAGTTAA
    GTACATTATAAGATGTTGTGCGGATAAACAAACTAATTGCATCAAATTTATTTTAAAATAACAACAACAA
    AACGTTAAGAGAATAACATTTCGGTGATTTAAAAGCTACGCACGTTTTTGTTATCTTCAAATTTAAATTT
    TAAGGAGTGTTTTCA
    ATGGAAGGTTTATTTAACGCAATTAAAGATACCGTAACTGCAGCAATTAATAATGATGGC
    GCAAAATTAGGCACAAGCATTGTGAGCATCGTTGAAAATGGCGTAGGTTTATTAGGTAAA
    TTATTCGGATTCTAA
    TTTCAATATGTTATGTAAGTAATCAGTATTATTTCAAAGGTGAGGGAGAGATTTAAATGA
    CTGGACTAGCAGAAGCAATCGCAAATACTGTGCAAGCTGCACAACAACATGATAGTGTGAAATTAGGCAC
    AAGTATCGTAGACATCGTTGCTAACGGTGTGGGTTTACTAGGTAAATTATTTGGATTCTAATATAATAAC
    TAATATTCTTTAAAATAAACTGGGTGAGCATACTTTAATGTTATGCACTCAGTTTATTTTATTTGCAGAA
    ATTTGAGCCTCTGTTAAGATTTAGATACATAGACAATATAGGAGATGGGGAAATTGGGATATAAAAATAT
    TTTGATAGACTTTGATGATACAATTGTTGATTTTTATGATGCAGA
    
    ATGGAAGGTTTATTTAACGCAATTAAAGATACCGTAACTGCAGCAATTAATAATGATGGC
    GCAAAATTAGGCACAAGCATTGTGAGCATCGTTGAAAATGGCGTAGGTTTATTAGGTAAA
    TTATTCGGATTCTAA
    ./psm-beta.fasta
    ./lipA.fasta
    ./capC.fasta
    ./fumC_revcomp.fasta
    ./CP133689.fasta
    ./sepA_mibi1435.fasta
    ./sigB_revcomp.fasta
    ./sceD.fasta
    ./fumC.fasta
    ./Enterococcus_faecium_isolate_E300_pathogenicity_island.fasta
    ./agrABCD_hld.fasta
    ./CP133688.fasta
    ./ecpA_mibi1435_revcomp.fasta
    ./hlb_.fasta
    ./CP133684.fasta
    ./tagB_mibi1435_revcomp.fasta
    ./CP133678.fasta
    ./tagB.fasta
    ./MT880871.fasta
    ./CP133676.fasta
    ./sdrH.fasta
    ./Staphylococcus_epidermidis_RP62A.fasta
    ./ebh_revcomp.fasta
    ./CP133683.fasta
    ./apsS.fasta
    ./Staphylococcus_aureus_MRSA252.fasta
    ./fmtC.fasta
    ./CP133694.fasta
    ./sarA.fasta
  2. prepare the long fasta if we use only the long sequencing

    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_01_K01_conservative.fasta HDRNA_01_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_03_K01_bold_bandage.fasta HDRNA_03_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_06_K01_conservative.fasta HDRNA_06_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_07_K01_conservative.fasta HDRNA_07_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_08_K01_conservative.fasta HDRNA_08_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_12_K01_bold.fasta HDRNA_12_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_16_K01_conservative.fasta HDRNA_16_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_17_K01_conservative.fasta HDRNA_17_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_19_K01_bold.fasta HDRNA_19_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_20_K01_conservative.fasta HDRNA_20_K01.fasta
  3. perform blastn searching for long-sequencing

    # -- makeblastdb --
    for sample in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do
            makeblastdb -in ../assembly/${sample}.fasta -dbtype nucl
    done
    
    for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
    echo "mkdir ${id}"
    echo "for sample in in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do"
            echo "blastn -db ../assembly/${sample}.fasta -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
    done
    
    #gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA
    python ~/Scripts/process_directory.py gyrB typing.csv typing_until_gyrB.csv
    python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
    python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
    python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv
    
    python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
    python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
    python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
    python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
    python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
    python ~/Scripts/process_directory.py psm.β1 typing_until_yycG.csv typing_until_psm.β1.csv
    python ~/Scripts/process_directory.py hlb typing_until_psm.β1.csv typing_until_hlb.csv
    python ~/Scripts/process_directory.py atlE typing_until_hlb.csv typing_until_atlE.csv
    python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
    python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv
    
    python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
    python ~/Scripts/process_directory.py ebp typing_until_ebh.csv typing_until_ebp.csv
    python ~/Scripts/process_directory.py tagB typing_until_ebp.csv typing_until_tagB.csv
    python ~/Scripts/process_directory.py capC typing_until_tagB.csv typing_until_capC.csv
    python ~/Scripts/process_directory.py sepA typing_until_capC.csv typing_until_sepA.csv
    python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
    python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
    python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv
    
    python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
    python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
    python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
    python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv
  4. perform blastn searching for short-sequencing

    # -- makeblastdb --
    for sample in HDRNA_01_K01 HDRNA_01_K02 HDRNA_01_K03 HDRNA_01_K04 HDRNA_01_K05 HDRNA_01_K06 HDRNA_01_K07 HDRNA_01_K08 HDRNA_01_K09 HDRNA_01_K10 HDRNA_02_K01 HDRNA_03_K01 HDRNA_03_K02 HDRNA_03_K03 HDRNA_03_K04 HDRNA_03_K05 HDRNA_03_K06 HDRNA_03_K07 HDRNA_03_K08 HDRNA_03_K09 HDRNA_03_K10 HDRNA_04_K01 HDRNA_05_K01 HDRNA_06_K01 HDRNA_06_K02 HDRNA_06_K03 HDRNA_06_K04 HDRNA_06_K05 HDRNA_06_K06 HDRNA_06_K07 HDRNA_06_K08 HDRNA_06_K09 HDRNA_06_K10 HDRNA_07_K01 HDRNA_07_K02 HDRNA_07_K03 HDRNA_07_K04 HDRNA_07_K05 HDRNA_07_K06 HDRNA_07_K07 HDRNA_07_K08 HDRNA_07_K09 HDRNA_07_K10 HDRNA_08_K01 HDRNA_08_K02 HDRNA_08_K03 HDRNA_08_K04 HDRNA_08_K05 HDRNA_08_K06 HDRNA_08_K07 HDRNA_08_K08 HDRNA_08_K09 HDRNA_08_K10 HDRNA_10_K01 HDRNA_11_K01 HDRNA_12_K01 HDRNA_12_K02 HDRNA_12_K03 HDRNA_12_K04 HDRNA_12_K05 HDRNA_12_K06 HDRNA_12_K07 HDRNA_12_K08 HDRNA_12_K09 HDRNA_12_K10 HDRNA_16_K01 HDRNA_16_K02 HDRNA_16_K03 HDRNA_16_K04 HDRNA_16_K05 HDRNA_16_K06 HDRNA_16_K07 HDRNA_16_K08 HDRNA_16_K09 HDRNA_16_K10 HDRNA_17_K01 HDRNA_17_K02 HDRNA_17_K03 HDRNA_17_K04 HDRNA_17_K05 HDRNA_17_K06 HDRNA_17_K07 HDRNA_17_K08 HDRNA_17_K09 HDRNA_17_K10 HDRNA_19_K01 HDRNA_19_K02 HDRNA_19_K03 HDRNA_19_K04 HDRNA_19_K05 HDRNA_19_K06 HDRNA_19_K07 HDRNA_19_K08 HDRNA_19_K09 HDRNA_19_K10 HDRNA_20_K01 HDRNA_20_K02 HDRNA_20_K03 HDRNA_20_K04 HDRNA_20_K05 HDRNA_20_K06 HDRNA_20_K07 HDRNA_20_K08 HDRNA_20_K09; do
            makeblastdb -in ../Data_Holger_S.epidermidis_short/shovill/${sample}/contigs.fa -dbtype nucl
    done
    
    #Note: The current -evalue is set to 1e-1; it might need to be made more stringent.
    for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
    echo "mkdir ${id}"
    echo "for sample in HDRNA_01_K01 HDRNA_01_K02 HDRNA_01_K03 HDRNA_01_K04 HDRNA_01_K05 HDRNA_01_K06 HDRNA_01_K07 HDRNA_01_K08 HDRNA_01_K09 HDRNA_01_K10 HDRNA_02_K01 HDRNA_03_K01 HDRNA_03_K02 HDRNA_03_K03 HDRNA_03_K04 HDRNA_03_K05 HDRNA_03_K06 HDRNA_03_K07 HDRNA_03_K08 HDRNA_03_K09 HDRNA_03_K10 HDRNA_04_K01 HDRNA_05_K01 HDRNA_06_K01 HDRNA_06_K02 HDRNA_06_K03 HDRNA_06_K04 HDRNA_06_K05 HDRNA_06_K06 HDRNA_06_K07 HDRNA_06_K08 HDRNA_06_K09 HDRNA_06_K10 HDRNA_07_K01 HDRNA_07_K02 HDRNA_07_K03 HDRNA_07_K04 HDRNA_07_K05 HDRNA_07_K06 HDRNA_07_K07 HDRNA_07_K08 HDRNA_07_K09 HDRNA_07_K10 HDRNA_08_K01 HDRNA_08_K02 HDRNA_08_K03 HDRNA_08_K04 HDRNA_08_K05 HDRNA_08_K06 HDRNA_08_K07 HDRNA_08_K08 HDRNA_08_K09 HDRNA_08_K10 HDRNA_10_K01 HDRNA_11_K01 HDRNA_12_K01 HDRNA_12_K02 HDRNA_12_K03 HDRNA_12_K04 HDRNA_12_K05 HDRNA_12_K06 HDRNA_12_K07 HDRNA_12_K08 HDRNA_12_K09 HDRNA_12_K10 HDRNA_16_K01 HDRNA_16_K02 HDRNA_16_K03 HDRNA_16_K04 HDRNA_16_K05 HDRNA_16_K06 HDRNA_16_K07 HDRNA_16_K08 HDRNA_16_K09 HDRNA_16_K10 HDRNA_17_K01 HDRNA_17_K02 HDRNA_17_K03 HDRNA_17_K04 HDRNA_17_K05 HDRNA_17_K06 HDRNA_17_K07 HDRNA_17_K08 HDRNA_17_K09 HDRNA_17_K10 HDRNA_19_K01 HDRNA_19_K02 HDRNA_19_K03 HDRNA_19_K04 HDRNA_19_K05 HDRNA_19_K06 HDRNA_19_K07 HDRNA_19_K08 HDRNA_19_K09 HDRNA_19_K10 HDRNA_20_K01 HDRNA_20_K02 HDRNA_20_K03 HDRNA_20_K04 HDRNA_20_K05 HDRNA_20_K06 HDRNA_20_K07 HDRNA_20_K08 HDRNA_20_K09; do"
            echo "blastn -db ../Data_Holger_S.epidermidis_short/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
    done
    
    #gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA
    python ~/Scripts/process_directory.py gyrB typing_104.csv typing_until_gyrB.csv
    python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
    python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
    python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv
    
    python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
    python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
    python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
    python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
    python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
    python ~/Scripts/process_directory.py psm.β1 typing_until_yycG.csv typing_until_psm.β1.csv
    python ~/Scripts/process_directory.py hlb typing_until_psm.β1.csv typing_until_hlb.csv
    python ~/Scripts/process_directory.py atlE typing_until_hlb.csv typing_until_atlE.csv
    python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
    python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv
    
    python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
    python ~/Scripts/process_directory.py ebp typing_until_ebh.csv typing_until_ebp.csv
    python ~/Scripts/process_directory.py tagB typing_until_ebp.csv typing_until_tagB.csv
    python ~/Scripts/process_directory.py capC typing_until_tagB.csv typing_until_capC.csv
    python ~/Scripts/process_directory.py sepA typing_until_capC.csv typing_until_sepA.csv
    python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
    python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
    python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv
    
    python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
    python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
    python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
    python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv
  5. plotTreeHeatmap

    ~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/plotTreeHeatmap
    
    #FastTree -gtr -nt variants/snippy.core_without_reference.aln > plotTreeHeatmap/snippy.core.tree
    
    #cp ../Data_Holger_S.epidermidis_short/plotTreeHeatmap/typing_104.csv .
    #cp ../Data_Holger_S.epidermidis_short/results_HDRNA_01-20/variants/snippy.core.aln_104.tree .
    #Run step 4
    
    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/plotTreeHeatmap_short/")
    # -- edit tree --
    info <- read.csv("typing_until_ecpA.csv", sep="\t")
    
    # Convert 'ST' column to factor with levels in the desired order
    info$ST <- factor(info$ST, levels = c("5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none"))
    
    info$name <- info$Isolate
    tree <- read.tree("snippy.core.aln_104.tree")
    cols <- c("5"="darkgreen", "23"="cornflowerblue", "35"="green","69"="orange","86"="magenta","87"="brown", "130"="purple","224"="darkkhaki", "487"="cyan", "640"="pink","none"="grey")
    #"2"="",
    #"7"="seagreen3",
    #"9"="tan","14"="red",  "17"="navyblue",
    #"73"="pink","81"="purple",
    #"89"="darksalmon",
    #"454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet"
    #"5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none"
    
    heatmapData2 <- info %>% select(Isolate,          gyrB,    fumC, gltA, icd,    apsS, sigB, sarA, agrC, yycG,    psm.β1, hlb,    atlE, sdrG, sdrH, ebh, ebp, tagB,    capC, sepA, dltA, fmtC, lipA, sceD, SE0760,    esp, ecpA)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    #heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")
    
    #TEMP_DEACTIVATED!  heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
    #TEMP_DEACTIVATED!  names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")
    
    ## The heatmap colours should correspond to the factor levels
    #heatmap.colours <- #setNames(c("cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta","cyan",
    #"magenta","navyblue","cornflowerblue","gold","orange","darkgreen", "green", "seagreen3", #"chocolate4","brown","purple","pink", "tan","cyan","red"),
    #c("5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none", "P01", "P02", "P03", "P04", "P05", "P06", "P07", #"P08", "P10", "P11", "P12", "P16", "P17", "P19", "P20"))
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    png("ggtree_and_gheatmap_selected_genes.png", width=1690, height=1400)
    #svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 17) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  6. Response

    As mentioned earlier, the presence-absence matrix of genes was generated using a BLASTn comparison between the genome and the gene sequences. Ensuring the accuracy of these gene sequences is crucial for reliable results. I prepared the sequences by searching GeneBank using the gene names, but there may be some ambiguity where the same gene name corresponds to different sequences.
    
    Attached, you will find the sequences for the 11 genes that were not identified in the isolates: gltA, psm-beta1, hlb, ebp, tagB, capA, sepA, fmtC, sceD, esp, and ecpA. I think it would be helpful if you could verify whether these sequences correspond to the genes we are looking for.
    
    In addition, I have attached a figure showing all samples with short-read sequencing. As illustrated in the plot, SCCmec and agr typing are still pending. If necessary, I can perform the typing and include the information in the plot. Alternatively, we could focus exclusively on the 10 samples obtained through long-read sequencing.
    
    Please let me know how you would prefer to proceed.

RNA-seq Tam on Acinetobacter baumannii strain ATCC 19606 CP059040.1 (Data_Tam_RNAseq_2024)

Urine_vs_MHB

AUM_vs_MHB

http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

Methods

Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).

The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:

nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results –fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta –gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume –max_cpus 55 –max_memory 512.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –gtf_group_features gene_id –gtf_extra_attributes gene_name –featurecounts_group_type gene_biotype –featurecounts_feature_type transcript

  1. Preparing raw data

    They are wildtype strains grown in different medium.
    AUM - artificial urine medium
    Urine - human urine
    MHB - Mueller-Hinton broth
    AUM(人工尿液培养基):pH值、营养成分、无菌性、渗透压、温度、污染物。
    Urine(人类尿液):pH值、比重、温度、污染物、化学成分、微生物负荷。
    MHB(Mueller-Hinton培养基):pH值、无菌性、营养成分、温度、渗透压、抗生素浓度。
    
    mkdir raw_data; cd raw_data
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz
  2. (Optional) using trinity to find the most closely reference

    Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
    #https://www.genome.jp/kegg/tables/br08606.html#prok
    acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
    abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
    aby     KGB     Acinetobacter baumannii AYE     2008    GenBank
    abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
    abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
    abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
    abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
    abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
    abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
    abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
    abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
    abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
    abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
    abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
    abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
    abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
    abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
    abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
    abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
    abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
    abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
    #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CP059040.1) was choosen as reference!
  3. Downloading CP059040.fasta and CP059040.gff from GenBank

  4. (Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed

    #Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta .     # Elements (Anna C.arnes)
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed .
    rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    (base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*"
    ./CP059040.fasta
    ./CP059040.bed
    ./CP059040.gb
    ./CP059040.gff3
    ./CP059040.gff3_backup
    ./CP059040_full.gb
    ./CP059040_gene.gff3
    ./CP059040_gene.gtf
    ./CP059040_gene_old.gff3
    ./CP059040_rRNA.gff3
    ./CP059040_rRNA_v.gff3
    
    # ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ----
    #gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_
    #grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    #sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    #gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    
    grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript
    sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    #grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l  #69=3887-3803
    
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    
    #3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42
    # --------------------------------------------------------------------------------------------------------------------------------------------------
    # ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3    #-->3796
    
    #The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing.
    # from gff3 to gtf
    sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf
    sed -i -e "s/;/\"; /g" CP059040_gene.gtf
    sed -i -e "s/=/=\"/g" CP059040_gene.gtf
    
    #sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf
    #using editor instead!
    
    #The following is GTF-format.
    CP000521.1      Genbank exon    95      1492    .       +       .       transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001";
    
    #NZ_MJHA01000001.1       RefSeq  region  1       8663    .       +       .       ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii
    #NZ_MJHA01000001.1       RefSeq  gene    228     746     .       -       .       ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795
    #NZ_MJHA01000001.1       Protein Homology        CDS     228     746     .       -       0       ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11
    
    ##gff-version 3
    ##sequence-region CP059040.1 1 3980852
    ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470
    
    gffread -E -F --bed CP059040.gff3 -o CP059040.bed    #-->3796
    ##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3
    ##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown
    #[01/21 10:57:46] Loading reference annotation (guides)..
    #GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810)
    #[01/21 10:57:46] 3796 reference transcripts loaded.
    #Default stack size for threads: 8388608
    #WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?)
    #WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
    #Please make sure the -G annotation file uses the same naming convention for the genome sequences.
    #[01/21 10:58:30] All threads finished.
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'Name'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to termin
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'gene_biotype'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to terminate.
    
    #grep "ID=cds-" CP059040.gff3 | wc -l
    #grep "ID=exon-" CP059040.gff3 | wc -l
    #grep "ID=gene-" CP059040.gff3 | wc -l   #the same as H0N29_18980/5=3796
    grep "gbkey=" CP059040.gff3 | wc -l  7695
    grep "ID=id-" CP059040.gff3 | wc -l  5
    grep "locus_tag=" CP059040.gff3 | wc -l    7689
    #...
    cds   3701                             locus_tag=xxxx, no gene_biotype
    exon   96                              locus_tag=xxxx, no gene_biotype
    gene   3796                            locus_tag=xxxx, gene_biotype=xxxx,
    id  (riboswitch+direct_repeat,5)       both no --> ignoring them!!  # grep "ID=id-" CP059040.gff3
    rna    96                              locus_tag=xxxx, no gene_biotype
    ------------------
        7694
    
    cp CP059040.gff3_backup CP059040.gff3
    grep "^##" CP059040.gff3 > CP059040_gene.gff3
    grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3
    #!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'!
    sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3
  5. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \
    for sample_id in MHB_r1 MHB_r2 MHB_r3; do \
            java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
    done
  6. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto
    AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto
    AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto
    MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto
    MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto
    MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto
    Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
    Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto
    Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto
  7. nextflow run

    #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
    
    docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    
    #Default: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
    #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
    #Checking the record (see below) in results/genome/CP059040.gtf
    #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
    #--featurecounts_feature_type 'transcript' returns only the tRNA results
    #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon
    
    grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
    grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
    grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
    grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
    wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!
    
    grep -P "\tCDS\t" CP059040.gff | wc -l  #3701
    sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
    grep -P "\texon\t" CP059040_m.gff | sort | wc -l  #3797
    
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
    --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
  8. Import data and pca-plot

    #mamba activate r_env
    
    #install.packages("ggfun")
    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    #library("org.Hs.eg.db")
    library(dplyr)
    library(tidyverse)
    #install.packages("devtools")
    #devtools::install_version("gtable", version = "0.3.0")
    library(gplots)
    library("RColorBrewer")
    #install.packages("ggrepel")
    library("ggrepel")
    # install.packages("openxlsx")
    library(openxlsx)
    library(EnhancedVolcano)
    library(DESeq2)
    
    setwd("~/DATA/Data_Tam_RNAseq_2024/results/star_salmon")
    # Define paths to your Salmon output quantification files
    files <- c("AUM_r1" = "./AUM_r1/quant.sf",
            "AUM_r2" = "./AUM_r2/quant.sf",
            "AUM_r3" = "./AUM_r3/quant.sf",
            "Urine_r1" = "./Urine_r1/quant.sf",
            "Urine_r2" = "./Urine_r2/quant.sf",
            "Urine_r3" = "./Urine_r3/quant.sf",
            "MHB_r1" = "./MHB_r1/quant.sf",
            "MHB_r2" = "./MHB_r2/quant.sf",
            "MHB_r3" = "./MHB_r3/quant.sf")
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    # Define the replicates and condition of the samples
    replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
    condition <- factor(c("AUM","AUM","AUM", "Urine","Urine","Urine", "MHB","MHB","MHB"))
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    # -- transcript-level count data (x2) --
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    write.csv(counts(dds), file="transcript_counts.csv")
    # -- gene-level count data (x2) --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition+replicate)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #3796->3487
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    dim(counts(dds))
    head(counts(dds), 10)
    rld <- rlogTransformation(dds)
    
    #We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function.
    #The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples.
    #The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups.
    #So, the typical workflow is:
    #  - Create the DESeqDataSet object.
    #  - Use estimateSizeFactors to normalize for library size.
    #  - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.)
    #  - Use DESeq for the differential expression analysis.
    #  - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq.
    
    # draw simple pca and heatmap
    #mat <- assay(rld)
    #mm <- model.matrix(~condition, colData(rld))
    #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
    #assay(rld) <- mat
    # -- pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    # -- heatmap --
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
  9. (Optional (ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

    #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
    #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
    #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    
    sizeFactors(dds)
    #NULL
    # Estimate size factors
    dds <- estimateSizeFactors(dds)
    # Estimate dispersions
    dds <- estimateDispersions(dds)
    #> sizeFactors(dds)
    
    #control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
    #2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
    #HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
    #0.4604461  0.5799834  0.6803681
    
    # (DEBUG) If avgTxLength is Necessary
    #To simplify the computation and ensure sizeFactors are calculated:
    assays(dds)$avgTxLength <- NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    #If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
    dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
    sizeFactors(dds)
    
    # If alone with virus data, the following BUG occured:
    #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                        HeLa_TO_r1                      HeLa_TO_r2
                        0.9978755                       1.1092227
    data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    
    #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    1/0.9978755=1.002129023
    1/1.1092227=
    #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
    
    raw_counts <- counts(dds)
    normalized_counts <- counts(dds, normalized=TRUE)
    #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
    #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    estimSf <- function (cds){
        # Get the count matrix
        cts <- counts(cds)
        # Compute the geometric mean
        geomMean <- function(x) prod(x)^(1/length(x))
        # Compute the geometric mean over the line
        gm.mean  <-  apply(cts, 1, geomMean)
        # Zero values are set to NA (avoid subsequentcdsdivision by 0)
        gm.mean[gm.mean == 0] <- NA
        # Divide each line by its corresponding geometric mean
        # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
        # MARGIN: 1 or 2 (line or columns)
        # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
        # FUN: the function to be applied
        cts <- sweep(cts, 1, gm.mean, FUN="/")
        # Compute the median over the columns
        med <- apply(cts, 2, median, na.rm=TRUE)
        # Return the scaling factor
        return(med)
    }
    #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
    #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
    #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
    #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
    #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
    #DESeq2’s median of ratios [1]
    #EdgeR’s trimmed mean of M values (TMM) [2]
    #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
    test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
    sum(test_normcount != normalized_counts)
  10. Select the differentially expressed genes

    #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
    #https://www.biostars.org/p/282295/
    #https://www.biostars.org/p/335751/
    #> dds$condition
    #[1] AUM   AUM   AUM   Urine Urine Urine MHB   MHB   MHB
    #Levels: AUM MHB Urine
    #CONSOLE: mkdir star_salmon/degenes
    
    setwd("degenes")
    #---- relevel to control ----
    dds$condition <- relevel(dds$condition, "MHB")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("AUM_vs_MHB","Urine_vs_MHB")
    
    for (i in clist) {
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      res_df <- as.data.frame(res)
    
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.05 & log2FoldChange>=1.35)
      down <- subset(res_df, padj<=0.05 & log2FoldChange<=-1.35)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    # -- Under host-env --
    grep -P "\tgene\t" CP059040.gff > CP059040_gene.gff
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-all.txt AUM_vs_MHB-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-up.txt AUM_vs_MHB-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-down.txt AUM_vs_MHB-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-all.txt Urine_vs_MHB-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-up.txt Urine_vs_MHB-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-down.txt Urine_vs_MHB-down.csv
    
    #for i in AUM_vs_MHB Urine_vs_MHB; do
    #  echo "contrast = paste(\"condition\", \"${i}\", sep=\"_\")"
    #  echo "res = results(dds, name=contrast)"
    #  echo "res <- res[!is.na(res$log2FoldChange),]"
    #  echo "res_df <- as.data.frame(res)"
    #  #selectLab = selectLab_italics,
    #  echo "png(\"${i}.png\",width=1200, height=1000)"
    #  #legendPosition = 'right',legendLabSize = 12,  arrowheads = FALSE,
    #  #subtitle=expression(~Delta*\"$(echo $i | cut -d'_' -f1) versus \" *~Delta*\"$(echo $i | cut -d'_' -f3)\"))"
    #  echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.2, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(\"$(echo $i | cut -d'_' -f1) versus $(echo $i | cut -d'_' -f3)\"))"
    #  echo "dev.off()"
    #done
    
    res <- read.csv("AUM_vs_MHB-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    #print(duplicated_genes)
    # [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
    #[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
    #[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
    #res[res$GeneName == "bfr", ]
    
    #1st_strategy First occurrence is kept and Subsequent duplicates are removed
    #res <- res[!duplicated(res$GeneName), ]
    #2nd_strategy keep the row with the smallest padj value for each GeneName
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_AUM_vs_MHB.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-150
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("AUM_vs_MHB.png", width=1200, height=2000)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("AUM versus MHB"))
    dev.off()
    
    res <- read.csv("Urine_vs_MHB-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-305
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("Urine_vs_MHB.png", width=1200, height=2000)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("Urine versus MHB"))
    dev.off()
  11. Report

    Attached are the results of the analysis.
    
    In the Urine_vs_MHB comparison, we identified a total of 259 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 138 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). Notably, the following genes have a p-adjusted value of 0, indicating very high confidence in their differential expression. The bas-series genes (basA, basB, basC, basD, basE, basJ) are particularly prominent:
    
    GeneName    GeneID  baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
    basJ    gene-H0N29_05120    11166.90    11.42   0.30    38.27   0   0
    basE    gene-H0N29_05085    12006.52    10.45   0.23    45.76   0   0
    basD    gene-H0N29_05080    12217.80    10.15   0.24    42.42   0   0
    bauA    gene-H0N29_05070    25280.68    9.55    0.19    51.48   0   0
    basA    gene-H0N29_05040    9750.68 9.02    0.18    48.90   0   0
    basC    gene-H0N29_05075    5034.14 8.58    0.21    40.07   0   0
    H0N29_08320 gene-H0N29_08320    4935.78 7.87    0.20    40.01   0   0
    barB    gene-H0N29_05105    5187.29 7.81    0.18    43.39   0   0
    H0N29_09380 gene-H0N29_09380    3477.26 7.41    0.19    38.91   0   0
    H0N29_13950 gene-H0N29_13950    13959.05    6.85    0.15    45.70   0   0
    H0N29_10825 gene-H0N29_10825    3664.70 6.44    0.17    37.59   0   0
    H0N29_10790 gene-H0N29_10790    2574.12 6.41    0.17    37.86   0   0
    H0N29_10010 gene-H0N29_10010    9376.84 -8.14   0.19    -43.70  0   0
    
    In the AUM_vs_MHB comparison, we identified a total of 149 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 65 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). The following genes also show a p-adjusted value of 0, indicating very high confidence in their differential expression:
    
    GeneName    GeneID  baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
    putA    gene-H0N29_09870    36100.24    -7.25   0.15    -49.78  0   0
    H0N29_10010 gene-H0N29_10010    9376.84 -7.43   0.18    -41.96  0   0
    
    To ensure proper visualization, I replaced the padj = 0 values with small numbers: 1e-305 for Urine_vs_MHB and 1e-150 for AUM_vs_MHB.
    
    We have now identified the significantly expressed genes. If you would like any further analysis based on these genes or need additional plots, please let me know.
  12. (TODO) clustering the genes and draw heatmap

    for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do echo "cut -d',' -f1-1 ${i}-up_annotated.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down_annotated.txt > ${i}-down.id"; done
    cat *.id | sort -u > ids
    #add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id  #4647
    RNASeq.NoCellLine <- assay(rld)
    #install.packages("gplots")
    library("gplots")
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    #datamat = RNASeq.NoCellLine
    write.csv(as.data.frame(datamat), file ="gene_expressions.txt")
    constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
    if(any(constant_rows)) {
      cat("Removing", sum(constant_rows), "constant rows.\n")
      datamat <- datamat[!constant_rows, ]
    }
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.05)
    mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
    mycol = mycol[as.vector(mycl)]
    #png("DEGs_heatmap.png", width=900, height=800)
    #cex.lab=10, labRow="",
    png("DEGs_heatmap.png", width=800, height=1000)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',labRow="",
                scale='row',trace='none',col=bluered(75), cexCol=1.8,
                RowSideColors = mycol, margins=c(10,2), cexRow=1.5, srtCol=30, lhei = c(1, 8), lwid=c(2, 8))  #rownames(datamat)
    #heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,5), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=30, lhei=c(0.1,2))
    dev.off()
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
    write.csv(names(subset(mycl, mycl == '4')),file='cluster4.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    significant_gene_expressions.txt \
    -d',' -o DEGs_heatmap_expression_data.xls;
    #### cluster members (advanced) #####
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
    subset_2<-names(subset(mycl, mycl == '2'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
    subset_3<-names(subset(mycl, mycl == '3'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
    subset_4<-names(subset(mycl, mycl == '4'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
    subset_5<-names(subset(mycl, mycl == '5'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        colnames(expression_values) <- colnames(data)
        # Combine gene information and expression data
        combined_result <- cbind(result, expression_values)
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    #write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Kostenart von Nebenkostenabrechnung

  • Betriebskosten

    • Abwasser / Entwässerung

    • Allgemeinstrom

    • Antenne

    • Breitbandkabel

    • Dachrinnenreinigung

    • Deichabgabe

    • Gartenpflege

    • Gebäudereinigung

    • Gebäudeversicherung

    • Haftpflichtversicherung

    • Hausbeleuchtung

    • Hauswart

    • Heizkosten

    • Heizungsanlagen

    • Kaltwasser

    • Müllabfuhr

    • Niederschlagswasser

    • Recycling

    • Schornsteinreinigung

    • Straßenreinigung

    • Ungezieferbekämpfung

    • Warmwasser

    • Warmwasserversorgungsanlagen

    • Wartung Aufzug

    • Wartung Brandschutz /Rauchwarnmelder

    • Wartung Lüftungsanlagen

    • Wartung Tiefgarage

    • Wäscheeinrichtungen

    • Winterdiest

    • Sonstige Kostenart hinzufügen

    • Kosten für Dichtigkeitsprüfung nicht umlegbar In einigen Bundesländern ist vorgeschrieben, dass Abwasserleitungen auf Dichtigkeit zu überprüfen sind. Die hierdurch entstehenden Kosten sind ebenfalls nicht umlagefähig.

    • Schornsteinfegerkosten: Gibt es keine Zentralheizung und keine zentrale Warmwasserversorgung, dann sind die Schornsteinfegerkosten über die “kalten Betriebskosten” abzurechnen. Sie können dann nach Zahl der Wohnungen oder nach anteiliger Wohnfläche umgelegt werden.

    • Müllabfuhr: Letztlich kommt es zunächst auf den Mietvertrag an, was Mieter und Vermieter vereinbart haben. Finden Sie nichts dazu im Mietvertrag, wird sich Ihr Vermieter an der gesetzlichen Regelung orientieren. Die Umlage erfolgt dann anhand der Wohnfläche Ihrer Wohnung, sofern die Müllmenge nicht konkret erfasst wird.

    • Gebäudeversicherung: Die Kosten für die Gebäudeversicherung dürfen Immobilieneigentümer auf die Mieter umlegen. Bei Mehrfamilienhäusern erfolgt die Umlage meist je nach Anteil der einzelnen Wohnungen an der gesamten Wohnfläche. Vermieter müssen beim Abschluss einer Gebäudeversicherung darauf achten, dass die Prämie nicht überhöht ist.

    • Grundsteuer: Zahlung der Grundsteuer als Betriebskosten für die Wohnung Üblich ist, dass die Grundsteuer nach der Wohnfläche der einzelnen Wohnungen (wenn im Mietvertrag keine andere Verteilung vereinbart ist) verteilt wird. Der Vermieter nutzt hierfür den Grundsteuerbescheid.

  • Direkte Kosten

    • Grundsteuer
    • Sonstige direkte Kosten

DeepSeek-R1 Setup with Ollama

https://cristianzsh.medium.com/installing-and-using-deepseek-ai-c7fd97332e0f

  1. Installing CUDA on WSL

    Depending on your setup, you can go directly to the second section of this article. In my setup, I used WSL. To enable CUDA in Ubuntu, I generally follow these steps:

  2. Install Ollama (Ollama is an open-source framework designed to facilitate the deployment of large language models on local environments.)

    Ollama is a powerful tool that enables new ways to create and run LLM applications in the cloud. It simplifies the development process and offers flexible deployment options, as well as easy management and scaling of applications. You can download and install it with:

    curl -fsSL https://ollama.com/install.sh | sh
  3. Download DeepSeek

    Finally we can download the DeepSeek model. DeepSeek-R1 comes with multiple distilled models derived from Qwen and Llama architectures, each tailored to meet distinct performance and resource needs.

    The sizes of these models vary. For example, the 1.5b model is around 2.3 GB, the 7b model is roughly 4.7 GB, and the 70b model exceeds 40 GB. In my case, I went with the default deepseek-r1 model.

    The 1.5b model requires fewer resources, while models like 14b and 32b are geared toward higher performance. Check all models https://ollama.com/library/deepseek-r1 (DeepSeek’s first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.).

    ollama run deepseek-r1
    #DeepSeek-R1-Distill-Qwen-1.5B (通义千问 by Alibaba)
    ollama run deepseek-r1:1.5b
    #DeepSeek-R1-Distill-Qwen-7B
    ollama run deepseek-r1:7b
    #DeepSeek-R1-Distill-Llama-8B
    ollama run deepseek-r1:8b
    #DeepSeek-R1-Distill-Qwen-14B
    ollama run deepseek-r1:14b
    #DeepSeek-R1-Distill-Qwen-32B
    ollama run deepseek-r1:32b
    #DeepSeek-R1-Distill-Llama-70B (The open-source AI models by Meta you can fine-tune, distill and deploy anywhere. Choose from our collection of models: Llama 3.1, Llama 3.2, Llama 3.3.)
    #https://github.com/ollama/ollama    Llama 3.3  70B     43GB    "ollama run llama3.3"
    ollama run deepseek-r1:70b

    As you can see in the image, it immediately switches to a prompt after downloading.

    Write me a pyhton script to download the file not_malicious.zip from localhost on TCP port 443.

    以下是一个简单的Python脚本,使用requests库从localhost的TCP端口443下载文件not_malicious.zip。此脚本假设文件通过HTTPS协议提供服务,并且localhost运行了一个支持HTTPS的服务器:

    ```python
    import requests
    
    def download_file(url, filename):
        try:
            # 发送HTTP GET请求
            response = requests.get(url)
    
            # 检查是否请求成功
            if response.status_code == 200:
                # 将内容保存到文件中
                with open(filename, 'wb') as file:
                    file.write(response.content)
                print(f"成功下载文件并保存为 {filename}")
            else:
                print(f"无法下载文件,状态码:{response.status_code}")
        except Exception as e:
            print(f"发生错误:{str(e)}")
    
    # URL假设为localhost的TCP端口443(HTTPS)
    url = "https://localhost:443/not_malicious.zip"
    filename = "not_malicious.zip"
    
    download_file(url, filename)
    ```
    
    ### 依赖项
    在运行此脚本之前,确保已经安装了`requests`库。你可以使用以下命令进行安装:
    
    ```bash
    pip install requests
    ```
    
    ### 注意事项:
    1. **HTTPS证书**:如果服务器使用自签名证书,可能会出现SSL验证错误。可以通过设置`verify=False`来绕过此检查(不推荐用于生产环境):
      ```python
      response = requests.get(url, verify=False)
      ```
    
    2. **权限问题**:确保脚本有权限写入目标文件夹。
    
    3. **服务器配置**:确认localhost的TCP端口443上正在运行一个HTTPS服务器,并且`not_malicious.zip`文件在该服务器上可访问。
    
    4. **安全性**:仅从信任的来源下载文件。确保你了解`not_malicious.zip`的内容。
    
    5. **错误处理**:此脚本包含基本的错误处理,你可以根据需要添加更多的错误检查和异常处理逻辑。
    
    如果服务器未正确配置或文件路径不正确,可能会导致下载失败。在运行脚本之前,请确保所有服务已正确设置。
  4. Using Chatbox (https://chatboxai.app/en) GUI

    Chatbox offers a user-friendly interface for interacting with AI models. After installation, open Settings, choose “OLLAMA API” as the Model Provider, and select the DeepSeek model you prefer. It will automatically recognize the recently installed models if there were no errors.

  5. Python example

    DeepSeek provides an API that’s fully compatible with ChatGPT. You can simply run pip3 install openai to get started. In my setup, I’ll be using the ollama Python package instead.

    #python3 -m venv venv
    #source venv/bin/activate
    #pip3 install ollama
    #-->Defaulting to user installation because normal site-packages is not writeable
    #-->Requirement already satisfied: ollama in /home/jhuang/.local/lib/python3.10/site-packages (0.4.7)
    
    import ollama
    
    response = ollama.chat(model="deepseek-r1:70b", messages=[{"role":"user", "content":"Tell me the most beatiful chinese city.",},])
    print(response["message"]["content"])
    
    python3 deepseek_test1.py > most_beatiful_chinese_city.txt
Okay, so I need to figure out which Chinese city is considered the most beautiful. Hmm, where do I start? Well, China has a lot of cities with rich history and stunning landscapes, so it’s not going to be easy. Let me think about what makes a city beautiful. It could be natural scenery, historical architecture, cultural vibe, or maybe a mix of traditional and modern elements. I remember hearing that some cities in China have really impressive natural surroundings. For instance, Guilin is famous for its karst mountains and scenic rivers. People often talk about how picturesque it is, with those iconic limestone peaks. Then there’s Hangzhou, which is known for West Lake. I think West Lake is a major tourist spot with beautiful gardens and temples around it. Suzhou also comes to mind. It’s called the “Venice of the East” because of its canals and classical gardens. Those gardens are supposed to be some of the best examples of Chinese garden design, which probably adds a lot to the city’s beauty. Lijiang is another one I’ve heard about; it has an old town with cobblestone streets and traditional Naxi architecture. It sounds really charming and well-preserved. Then there’s Pingyao, which is known for its ancient city walls and Ming-era architecture. It’s more of a historical beauty, giving a glimpse into China’s past. Jiuzhaigou is up there too with its colorful lakes and waterfalls. It’s more of a natural wonder but still part of the discussion when talking about beautiful places in China. I also think about cities like Chengdu and Chongqing, which have their own unique vibes. Chengdu has a relaxed atmosphere and is close to the panda research centers, while Chongqing is known for its foggy mountains and spicy cuisine. But I’m not sure if they’re typically considered among the most beautiful. Wait, there’s also Yangshuo near Guilin that people often visit after Guilin because of the stunning landscapes from the Li River. And maybe cities like Nanjing or Luoyang with their historical sites could be in the running too. I guess it depends on what aspect of beauty we’re focusing on—natural vs. man-made, historical vs. modern. But from what I’ve gathered, Guilin is often highlighted for its natural beauty, Hangzhou for West Lake, Suzhou for canals and gardens, Lijiang for old town charm, Pingyao for ancient architecture, and Jiuzhaigou for its vibrant natural scenery. So maybe the answer isn’t just one city but a few that stand out in different categories. But if I had to pick one based on the most frequently mentioned, Guilin might be it because of those iconic landscapes that are often used in paintings and movies. Alternatively, Suzhou is also very popular for its classical beauty with gardens and waterways. I should also consider cultural significance. For example, Hangzhou’s West Lake has a lot of historical and cultural importance, which adds to its allure. Similarly, Lijiang’s old town isn’t just pretty; it’s also a UNESCO site, which gives it an international recognition of its beauty and historical value. In the end, while there are several contenders, Guilin is often cited as one of the most beautiful due to its unique natural features. But Hangzhou and Suzhou are strong competitors because they combine nature with man-made beauty effectively.
    The question of which Chinese city is the most beautiful can be answered by considering various aspects such as natural scenery, historical architecture, and cultural significance. Here are some of the top contenders:

    1. **Guilin**: Renowned for its stunning karst mountains and scenic rivers, Guilin is often depicted in traditional paintings and films, making it a top choice for natural beauty.

    2. **Hangzhou**: Famous for West Lake, a UNESCO World Heritage site, Hangzhou combines natural beauty with historical and cultural significance, enhancing its allure.

    3. **Suzhou**: Known as the "Venice of the East," Suzhou boasts canals and classical gardens that exemplify Chinese garden design, offering a blend of natural and man-made beauty.

    4. **Lijiang**: With its well-preserved old town and traditional Naxi architecture, Lijiang is a UNESCO site that charms visitors with its cobblestone streets and historic ambiance.

    5. **Pingyao**: This ancient city is celebrated for its Ming-era architecture and well-preserved city walls, offering a glimpse into China's rich history.

    6. **Jiuzhaigou**: Known for its colorful lakes and waterfalls, Jiuzhaigou is a natural wonder that attracts visitors with its vibrant landscapes.

    While each city has unique qualities, Guilin is often highlighted as one of the most beautiful due to its iconic natural landscapes. However, Hangzhou and Suzhou are strong contenders because they effectively blend nature with cultural and historical elements. Ultimately, the choice depends on the aspect of beauty being emphasized—natural, historical, or cultural.
  1. Conclusion

    AI models are here to stay in our daily lives. Having the ability to run a model offline, even with limited computational resources, is a huge advantage compared to closed-source models.

    This opens up multiple possibilities for both defenders and attackers from a cybersecurity standpoint. While a defender can use it for learning, improving scripts, and detecting malicious behavior, an attacker may use it to generate ransomware or craft convincing phishing campaigns.

    I am very curious to see what awaits us in this field in the coming years.