Author Archives: gene_x

TODO: mapping virus reads should use TopHat2, BWA or Bowtie2 could not detect splice events.

RNA sequencing (RNA-seq)

  • Library preparation for strand specific RNA sequencing was carried out using the NEXTflex Directional RNA-Seq Kit (Bioo Scientific) according to the manufacturer’s instructions.

  • Libraries were sequenced on the Illumina HiSeq 2500 platform.

  • To allow detection of splice events that extend over the origin, reads were mapped to two concatenated copies of the MCVSyn or MCVSyn-hpko genomes using TopHat2 v 2.0.13 [94].

  • The positions of mapped reads and junctions were subsequently collapsed back on unit-length genomes.

  • From the resulting SAM files, we counted the number of unspliced reads that extended over splice sites of junctions detected by TopHat to determine splice site efficacy and frequency of individual junctions.

  • To estimate transcript abundance, for each combination of splice junctions that mapped within either the major early or late transcription cassettes we calculated a relative strand-specific combinatorial frequency value by multiplying observed frequency values for individual donor sites.

  • The relative ratio of late to early transcripts was subsequently estimated by calculation of normalized RPKM (reads per kilobase per million mapped reads) for each of the transcripts.

  • Based on the Juliane paper, “A Comprehensive Analysis of Replicating Merkel Cell Polyomavirus Genomes Delineates the Viral Transcription Program and Suggests a Role for mcv-miR-M1 in Episomal Persistence,” particularly in the “RNA sequencing (RNA-seq)” methods section, they used strand-specific RNA sequencing. This explains why they are able to separate the coverage into blue (positive strand) and red (negative strand). The original text reads: “Library preparation for strand-specific RNA sequencing was carried out using the NEXTflex Directional RNA-Seq Kit (Bioo Scientific) according to the manufacturer’s instructions.”

  • Unfortunately, your RNA-seq data is not strand-specific. Additionally, the primary focus of your sequencing is on the human genome, while the virus is a by-product, meaning we only have a very limited number of virus reads. For example, on average, there are only about 10 reads for VP1 and VP2 per sample. For this reason, creating a plot similar to the one in Juliane’s paper, with separate coverage for positive and negative strands, is not feasible.

Deciphering S. aureus with metatranscriptomics (Marc’s project)

RNAseq analaysis using nextflow set reference as CP000255

  1. prepare reference

    mkdir raw_data; cd raw_data
    ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_1.fq.gz P50_LH1and2_R1.fq.gz
    ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_2.fq.gz P50_LH1and2_R2.fq.gz
  2. Downloading CP000255.fasta and CP000255.gff from GenBank

    mv ~/Downloads/sequence\(3\).fasta CP000255.fasta
    mv ~/Downloads/sequence\(2\).gff3 CP000255.gff
    
    MRSA: USA300
    EMRSA (CC22): CC22 strain
    MSSA: Newmann strain
    
    #https://journals.asm.org/doi/10.1128/jb.01000-07
    #Staphylococcus aureus subsp. aureus strain Newman (Taxonomy ID: 426430)
    #https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=426430
    
    #Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
    https://www.ncbi.nlm.nih.gov/nuccore/CP000255.1
    
    #Methicillin-resistant Staphylococcus aureus CC22-MRSA-IV as an agent of dairy cow intramammary infections
    https://pubmed.ncbi.nlm.nih.gov/30473348/
    
    sed 's/\tCDS\t/\texon\t/g' CP000255.gff > CP000255_m.gff
    grep -P "\texon\t" CP000255_m.gff | sort | wc -l  #2630
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead! (but there is no 'transcript' in the column 3, strange????) --
    --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff" --featurecounts_feature_type 'transcript'
  3. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in P50_LH1and2; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
    done
  4. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    P50_LH1and2,P50_LH1and2_R1.fq.gz,P50_LH1and2_R2.fq.gz,auto
  5. nextflow run

    mv trimmed/*.fq.gz .
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
    docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255.fasta" --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
    --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- in results/genome/CP059040_genes.gtf -->
    #P059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";
    #CP059040.1      Genbank exon    1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QNT85048.1"; Name "QNT85048.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_004996698.1"; locus_tag "H0N29_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QNT85048.1"; transl_table "11";
    #CP059040.1      Genbank transcript      1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Name "H0N29_00010"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "H0N29_00010";
    #CP059040.1      Genbank exon    1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Dbxref "NCBI_GP:QNT85049.1"; Name "QNT85049.1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010116376.1"; locus_tag "H0N29_00010"; product "DNA polymerase III subunit beta"; protein_id "QNT85049.1"; transl_table "11";
    
    # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
    >CP000255.1 Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome

Calculate Host and Bacterial Compositions using vrap

  1. download all S epidermidis genomes and identified all ST2 isolates from them!

    Staphylococcus aureus  ID: 1280
    Staphylococcus epidermidis Taxonomy ID: 1282
    
    esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta  # 23788
    
    esearch -db nucleotide -query "txid1280[Organism:exp]" | \
    efetch -format fasta -email j.huang@uke.de > genome_1280_ncbi.fasta
    esearch -db nucleotide -query "txid1282[Organism:exp]" | \
    efetch -format fasta -email j.huang@uke.de > genome_1282_ncbi.fasta
    cd /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024
    python ~/Scripts/filter_fasta.py genome_1280_ncbi.fasta complete_genome_1280_ncbi.fasta  #3478, 9.2G
    python ~/Scripts/filter_fasta.py genome_1282_ncbi.fasta complete_genome_1282_ncbi.fasta  #447, 1.1G
    
    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/1280
    #Click "Sequence" and download "Counts" and "Taxon descendants count" if there is enough time! Downloading time points is 10.01.2025.
    https://www.ebi.ac.uk/ena/browser/view/1282
    
    #python ~/Scripts/filter_fasta.py  ena_1282_sequence_taxon_descendants_count.fasta complete_genome_1282_ena_taxon_descendants.fasta
    #S.aureus, Staphylococcus epidermidis
    python ~/Scripts/filter_fasta.py ena_1280_sequence.fasta complete_genome_1280_ena.fasta  #2187, 5.8G
    python ~/Scripts/filter_fasta.py ena_1282_sequence.fasta complete_genome_1282_ena.fasta  #243, 596M
    
    replace virus to S.aureus
  2. run vrap

    ln -s ~/Tools/vrap/ .
    conda activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    
    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_1280 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/genomes_by_taxid/complete_genome_1280_ncbi.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    
    #TODO: change virus_user_db --> S.aureus_user_db
    #ref|NZ_CP075562.1| --> Staphylococcus aureus strain Sta2021010 chromosome, complete genome
    
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    
    @PG     ID:bowtie2      PN:bowtie2      VN:2.4.4        CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x /home/jhuang/REFs/genome -q --fast -p 100 --passthrough -1 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_1.fastq -2 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_2.fastq -U /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.extendedFrags.fastq"
    24061654 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    22492206 + 0 mapped (93.48% : N/A)
    12342028 + 0 paired in sequencing
    6171014 + 0 read1
    6171014 + 0 read2
    10326710 + 0 properly paired (83.67% : N/A)
    11134274 + 0 with itself and mate mapped
    56254 + 0 singletons (0.46% : N/A)
    421598 + 0 with mate mapped to a different chr
    15942 + 0 with mate mapped to a different chr (mapQ>=5)
  3. Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_on_CP015646 --host CP015646.fasta -t 100 -l 200 -g
    cd bowtie
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    #24061651 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    541762 + 0 mapped (2.25% : N/A)
    12342022 + 0 paired in sequencing
    6171011 + 0 read1
    6171011 + 0 read2
    382256 + 0 properly paired (3.10% : N/A)
    388562 + 0 with itself and mate mapped
    22349 + 0 singletons (0.18% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    bamCoverage -b mapped_sorted.bam -o ../../P50_LH1and2_on_CP015646_reads_coverage.bw
    
    samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {if (count > 0) print sum/count; else print 0}'
    #21.4043
    
    samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {print sum/count}'
    samtools depth -a mapped_sorted.bam | awk -v len=2742807 '{sum+=$3} END {print sum/len}'
    #21.4043
    samtools depth mapped_sorted.bam | awk '$3 > 0 {count++} END {print count}'
    samtools depth mapped_sorted.bam | awk '$3 > 10 {count++} END {print count}'
    
    python ~/Scripts/calculate_coverage.py mapped_sorted.bam
    #Average coverage: 24.41
  4. Report

    I hope this email finds you well. Below is a summary of the analysis results based on the sequencing data:
    
        Host and Bacterial Composition:
            * The human genome accounted for 22,492,206 reads out of the total 24,061,654 reads (93.48%).
    
        Genome Assembly and Annotation:
            * After removing the human reads, we assembled the bacterial genome.
            * For annotation, we used the S. aureus user database (S.aureus_user_db), which includes 3,478 complete S. aureus genomes available in GenBank as of January 10, 2025.
            * The annotation summary is provided in the attached file assembly_summary.xlsx. The sequence CP015646.1 appears to be the closest genome for the data, as indicated in the attached table (highlighted in yellow).
    
        Mapping and Coverage:
            * Using CP015646.1 as the reference genome, 541,762 reads (2.25%) were mapped to this reference.
            * The average coverage is approximately 21x, with 951,154 positions out of 2,742,807 in the reference genome covered by at least 10 reads.
            * The attached figure xxx.png shows the coverage of reads mapped to CP015646.1.
    
    Conclusion:
    The analysis suggests that the majority of reads originate from the human genome, and the S. aureus reads are insufficient for a robust RNA-seq analysis.

Calculate the percentages of reads are human rRNA and human mRNA, see the attached review “Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics”.

  1. run nextflow set human as reference

    (OPTION_1, NOT_USED) Use Annotation Tools: Annotation pipelines like FeatureCounts or HTSeq can classify reads based on genomic annotations (e.g., GTF/GFF files). Use a comprehensive human genome annotation to distinguish between mRNA and rRNA regions.
    
            diff /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /home/jhuang/REFs/genome.fa
    
            featureCounts -a /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf -o counts.txt -t exon -g gene_type aligned.bam
    
    Here, gene_type in the GTF file may distinguish mRNA and rRNA features.
    
    /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf
    
    (OPTION_2, USED, under host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa" --gtf "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"   --star_index "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"    -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'
    #--pseudo_aligner 'salmon'
    
    'GRCh38' {
        fasta       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
        bwa         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BWAIndex/version0.6.0/"
        bowtie2     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/"
        star        = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"
        bismark     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BismarkIndex/"
        gtf         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"
        bed12       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.bed"
        mito_name   = "chrM"
        macs_gsize  = "2.7e9"
        blacklist   = "/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/blacklists/hg38-blacklist.bed"
  2. Report

    Indeed, until now I haven’t had direct experience with analyzing RNA-seq data from metatranscriptomic samples, as my expertise has primarily been with data derived from cultured samples. Ideally, sequencing from cultured samples would be the optimal approach.
    
    However, if RNA-seq from metatranscriptomic samples becomes necessary, it would be crucial to deplete host (human) mRNA, host (human) rRNA, and microbial rRNA before sequencing (refer to the attached review PIIS0168952523001270.pdf, "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics"). This step is important for enriching bacterial mRNA and could potentially enhance the quality of sequencing data.
    
    This morning, I explored the origins of reads within the human genome. Despite over 20 million reads originating from humans, only a small fraction successfully mapped to unique positions (83,500 reads). The remaining reads failed due to reasons such as Unassigned_MultiMapping (the most common issue) and Unassigned_NoFeatures. Among the 83,500 mapped reads, only 0.01% originated from rRNA. For more detailed biotype analysis, please refer to featurecounts_biotype_plot-1.png.
    
    Based on these findings, depleting human mRNA appears to be the most critical step for improvement.

Time synchronization failed

https://askubuntu.com/questions/1046214/enable-system-clock-synchronization

Steps to Enable NTP Support

  1. Check if systemd-timesyncd is Running

Ensure that the default time synchronization service is installed and active. Run:

sudo systemctl status systemd-timesyncd

If it’s inactive or not installed, enable and start it:

sudo systemctl enable systemd-timesyncd
sudo systemctl start systemd-timesyncd
  1. Install and Configure ntp (if needed)

If systemd-timesyncd is not available or you prefer using ntp, follow these steps:

Install the ntp package:

sudo apt update sudo apt install ntp

Ensure the ntp service is enabled and running:

sudo systemctl enable ntp
sudo systemctl start ntp
  1. Manually Set NTP Servers (Optional)

If you’re using systemd-timesyncd, you can configure NTP servers in /etc/systemd/timesyncd.conf. Open the file:

sudo vim /etc/systemd/timesyncd.conf

Add or edit the following lines to include your preferred NTP servers:

[Time] NTP=time.google.com 1.de.pool.ntp.org 2.de.pool.ntp.org

Save the file and restart the service:

sudo systemctl restart systemd-timesyncd

  1. Force Synchronization

After ensuring the service is running, you can force synchronization:

sudo timedatectl set-ntp false sudo timedatectl set-ntp true

  1. Verify Synchronization

Check the time status again:

timedatectl

Confirm that System clock synchronized: yes is displayed.


Steps to Resolve

  1. Check for systemd-timesyncd Installation

Run the following command to see if systemd-timesyncd is installed:

sudo systemctl status systemd-timesyncd

If the service is not found, install and enable it:

sudo apt update
sudo apt install systemd-timesyncd
sudo systemctl enable systemd-timesyncd
sudo systemctl start systemd-timesyncd
  1. Verify NTP Settings

Check if the configuration for NTP servers is correct. Open the file /etc/systemd/timesyncd.conf:

sudo nano /etc/systemd/timesyncd.conf

Ensure the [Time] section includes valid NTP servers:

[Time] NTP=time.google.com 1.de.pool.ntp.org 2.de.pool.ntp.org

Save the file and restart the service:

sudo systemctl restart systemd-timesyncd

  1. Manually Enable NTP Synchronization

If the service is installed and running but synchronization is still not active, force-enable NTP:

sudo timedatectl set-ntp true

  1. Install ntp as an Alternative (Optional)

If systemd-timesyncd is not sufficient or unavailable, you can install and use the ntp package:

sudo apt update sudo apt install ntp sudo systemctl enable ntp sudo systemctl start ntp

  1. Check Synchronization Status

Recheck the time synchronization status:

timedatectl

Look for System clock synchronized: yes.

Die 10 besten Skigebiete Deutschlands: Skifahren mit der Familie

https://www.travanto.de/urlaubsmagazin/skigebiete-deutschland-fuer-familienurlaub/#Winkl

Deutschland ist ein wahres Wintersportparadies. Travanto verrät Ihnen die 10 besten Skigebiete Deutschlands und stellt Ihnen ausgewählte Ziele vom Mittelgebirge bis zu den Hochalpen für einen Familienurlaub vor. Die Anzahl an Skigebieten in Deutschland ist erstaunlich. Erfahren Sie, welche die 10 besten Skigebiete für einen Urlaub mit der Familie sind. Nach einem ausgiebigen Tag auf den Skiern und Snowboards kehren Sie in Ihr Ferienhaus oder die Ferienwohnung zurück, da es in den eigenen vier Wänden unweit der Piste immer am gemütlichsten ist. Sie und Ihre Liebsten genießen die Gemeinsamkeit im Urlaub inmitten von Bergen und Schnee und haben Gelegenheit, die schönsten Erlebnisse des Tages miteinander zu teilen. Die Skigebiete im Überblick

  • Oberstdorf-Kleinwalsertal, Bayern
  • Garmisch-Partenkirchen, Bayern
  • Reit im Winkl, Bayern
  • Bayerischer Wald
  • Berchtesgadener, Bayern
  • Winterberg, Nordrhein-Westfalen
  • Willingen im Sauerland, Hessen
  • Feldberg, Baden-Württemberg
  • Fichtelberg, Sachsen
  • Wurmberg, Niedersachen

Identify all occurrences of Phage HH1 MT880870 in S. epidermidis ST2 genomes from public and clinical isolates

ggtree_and_gheatmap_MT880870

  1. install the bacto environment

    mamba create -n bacto -c conda-forge -c bioconda -c defaults python=3.7
    mamba activate bacto
    (bacto) mamba install -c conda-forge -c bioconda -c defaults trimmomatic fastqc shovill
    (bacto) mamba install -c conda-forge -c bioconda -c defaults mlst snippy fasttree
    (bacto) mamba install biopython
    mamba install -c conda-forge perl=5.22
    mamba install -c conda-forge -c bioconda roary
    
    #------ NEED to be debugged for install roary(ERROR) --> install roary in the host env (SUCCESSFUL) ------
    #ERROR (bacto) mamba install -c conda-forge -c bioconda -c defaults roary
    #ERROR (bacto) cpanm Bio::Roary  #DEBUG NOT WORKING ! Installing Bio::Roary failed. See /home/jhuang/.cpanm/work/1736859390.2714327/build.log for details. Retry with --force to force install it.
    #ERROR (bacto) export PERL5LIB=$CONDA_PREFIX/lib/perl5/site_perl/5.22.0/  #DEBUG SUCCESSFUL, repaired also snippy and snippy-core
    sudo apt install roary
    
    (bacto) mamba install -c conda-forge -c bioconda -c defaults raxml-ng gubbins snp-sites
    (bacto) mamba install -c bioconda iqtree
    #(bacto) mamba install -c conda-forge -c bioconda -c defaults perl-bioperl
    #(bacto) cpanm Bio::SeqIO
    #(bacto) cpanm Bio::Perl
    #(bacto) mamba install -c conda-forge -c bioconda -c defaults biopython
    #(bacto) mamba install -c conda-forge -c bioconda -c defaults snakemak
  2. alignment of complete genomes

    Alignment of Complete Genomes vs Alignment of Core Genes
    
    Alignment of Complete Genomes:
        Useful when all genomes are complete and high-quality.
        Not ideal if many genomes are incomplete or fragmented, as this can introduce noise.
    
    Alignment of Core Genes:
        Preferred when working with a mix of complete and draft genomes.
        Focuses on conserved regions, minimizing issues caused by missing or fragmented data.
    
    How to Ensure Core Gene Alignment?
    
    If your goal is to create a core gene alignment:
    
        Use a tool like Roary or Panaroo after genome annotation (e.g., with Prokka).
        Extract the core_gene_alignment.aln file for alignment, which represents the concatenated core genes shared across all genomes.
    
    Prepare all gff-files
    
            bpseudomallei taylorella schromogenes bcereus orhinotracheale aactinomycetemcomitans vcholerae_2 streptomyces spyogenes paeruginosa sgallolyticus bhenselae ecoli senterica_achtman_2 brachyspira leptospira_2 streptothermophilus pdamselae campylobacter_nonjejuni_3 ypseudotuberculosis_achtman_3 abaumannii_2 staphlugdunensis mhominis_3 chlamydiales cdifficile yruckeri bsubtilis sbsec koxytoca kingella tenacibaculum pgingivalis miowae neisseria bordetella_3 cbotulinum edwardsiella ureaplasma pacnes_3 mhyorhinis gallibacterium bbacilliformis sdysgalactiae shominis mcanis borrelia smaltophilia brachyspira_2 campylobacter_nonjejuni_8 psalmonis suberis achromobacter    saureus    mhaemolytica efaecalis cronobacter rhodococcus scanis mplutonius shewanella campylobacter_nonjejuni_7 campylobacter hparasuis listeria_2 plarvae mabscessus hcinaedi csepticum    sepidermidis    mhyopneumoniae mbovis_2 wolbachia vcholerae helicobacter cperfringens bcc ranatipestifer szooepidemicus msynoviae campylobacter_nonjejuni_9 pmultocida_2 arcobacter fpsychrophilum abaumannii bwashoensis mcaseolyticus liberibacter aeromonas cmaltaromaticum vvulnificus mpneumoniae klebsiella brachyspira_5 efaecium pfluorescens hsuis ssuis dnodosus sinorhizobium msciuri spneumoniae vibrio campylobacter_nonjejuni geotrichum lsalivarius tpallidum magalactiae shaemolyticus sagalactiae ecoli_achtman_4 sthermophilus bfragilis kaerogenes blicheniformis_14 vparahaemolyticus mflocculare mgallisepticum vtapetis ecloacae ppentosaceus manserisalpingitidis spseudintermedius brachyspira_4 brachyspira_3 leptospira_3 mcatarrhalis_achtman_6 oralstrep campylobacter_nonjejuni_6 pmultocida aphagocytophilum campylobacter_nonjejuni_2 otsutsugamushi diphtheria_3 hinfluenzae pputida brucella cfreundii mycobacteria_2 mgallisepticum_2 llactis_phage xfastidiosa campylobacter_nonjejuni_4 campylobacter_nonjejuni_5 leptospira
    
            python ~/Scripts/run_mlst.py complete_genome_1282_ncbi.fasta sepidermidis  #73
            python ~/Scripts/run_mlst.py complete_genome_1282_ena.fasta sepidermidis   #73
            python ~/Scripts/run_mlst.py complete_genome_1282_ena_taxon_descendants.fasta sepidermidis
            grep -P "\tsepidermidis\t2\t" all.mlst > ST2.mlst  #73
            grep -v "NZ_" ST2.mlst > ST2_.mlst
            cut -d'.' -f1 ST2_.mlst | sort > ST2.ids
            cut -d'|' -f2 ST2.mlst | sort > ST2.ids
            diff ST2.ids ../ena_descendants_all_ST2_mlst/ST2.ids
    
            for sample in AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
                cp ../genomes_by_taxid/ncbi_all_ST2_mlst/${sample}.1.fasta ./
            done
    
            cd typing_complete.csv
            grep -P "\tsepidermidis\t2\t" typing_complete.csv > typing_ST2.csv
    
            echo -e "AP029227\nCP013943\nCP040883\nCP045648\nCP052939\nCP052940\nCP052955\nCP052959\nCP052981\nCP052984\nCP052990\nCP052994\nCP053088\nCP064461\nCP064467\nCP064473\nCP064476\nCP064480\nCP064485\nCP064499\nCP064508\nCP064516\nCP064525\nCP064554\nCP064557\nCP064560\nCP064572\nCP064574\nCP064587\nCP064590\nCP064599\nCP064603\nCP064606\nCP064609\nCP064613\nCP064616\nCP064619\nCP064624\nCP064631\nCP064637\nCP064650\nCP073821\nCP073824\nCP073827\nCP073830\nCP073835\nCP073836\nCP073840\nCP073841\nCP073844\nCP073847\nCP073850\nCP073852\nCP073855\nCP073857\nCP073859\nCP073862\nCP073863\nCP073878\nCP073900\nCP073904\nCP084008\nCP093173\nCP093179\nCP093181\nCP093193\nCP093196\nCP093198\nCP093212\nCP097512\nCP097514\nCP120425\nCP133663" > ids.txt
    
            cat ids.txt | while read id; do
                efetch -db nuccore -id $id -format gff3 > "${id}.gff"
                efetch -db nuccore -id $id -format fasta > "${id}.fasta"
                # Append the FASTA sequence to the GFF3 file with ##FASTA header
                echo "##FASTA" >> "${id}.gff"
                cat "${id}.fasta" >> "${id}.gff"
                rm "${id}.fasta"  # Optionally remove the separate FASTA file
            done
    
            for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319; do
                cp ../prokka/${sample}/${sample}.gff ./
            done
    
    Example with Roary:
    
            roary -f roary_ST2 -e --mafft -p 100 roary_input_ST2/*.gff
    
            #roary -p 15 -f ./roary -i 95 -cd 99 -s -e -n -v roary_.fa.gb.gff CP012351.gff3 CP003084.gff3 CP023676.gff3
            -i        minimum percentage identity for blastp [95]
            -cd FLOAT percentage of isolates a gene must be in to be core [99]
            -f STR    output directory [.]
            -e        create a multiFASTA alignment of core genes using PRANK
            -n (==--mafft)       fast core gene alignment with MAFFT, use with -e
    
            -s        dont split paralogs
            -v        verbose output to STDOUT
    
    The file core_gene_alignment.aln can then be used for tree construction:
    
            iqtree2 -s core_gene_alignment.aln -m MFP -bb 1000 -nt AUTO
            #Alignment was printed to core_gene_alignment.aln.uniqueseq.phy
            #For your convenience alignment with unique sequences printed to core_gene_alignment.aln.uniqueseq.phy
            #Create initial parsimony tree by phylogenetic likelihood library (PLL)... 0.030 seconds
            #Measuring multi-threading efficiency up to 128 CPU cores
            #Increase to 10 rounds for branch lengths
    
    By focusing on core genes, you ensure that all genomes contribute meaningful data to the phylogenetic analysis.
    
    The file that can be viewed as the phylogenetic tree among the generated files is:
    
        core_gene_alignment.aln.treefile
    
    This file typically contains the tree in a Newick format, which is a widely used format for representing phylogenetic trees.
  3. Public Database and Clinical Isolates

    To differentiate the two groups of genomes effectively while maintaining clarity and professionalism, you can use category names that reflect their source or origin. Here are some suggestions:
    Category Name Suggestions
    
        Source-based Names:
            Public Database and Clinical Isolates
            Public Genomes and Clinical Genomes
            Reference Genomes and Patient-derived Genomes
    
        Origin-based Names:
            External Sources and In-house Sequencing
            Global Genomes and Local Isolates
            Database Genomes and Clinical Strains
    
        Research Context Names:
            Published Genomes and Newly Sequenced Genomes
            Literature Genomes and Clinical Samples
            Archived Data and Current Study
    
        Abbreviated or Code-like Names:
            PUB and CLI (Public and Clinical)
            EXDB (External Database) and INHS (In-House Sequencing)
            PG (Public Genomes) and CI (Clinical Isolates)
    
    Tips for Choosing the Best Category Names
    
        Clarity: Ensure that the names are intuitive and easy for others to understand.
        Neutrality: Avoid names that might imply bias or inferiority (e.g., avoid "old" vs. "new").
        Context Appropriateness: Consider your audience. For a clinical study, terms like "Clinical Isolates" may resonate more, while "Public Database" works better for bioinformatics research.
    
    For instance:
    
        "Database" and "Clinical" could work well in a manuscript.
        "External" and "In-house" might be better for internal project tracking.
  4. copy contigs of own clinical isolates

    #extract all genes from the phages set as the column names, set the genome names from the public database and own isolates as row names.
    
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319; do
      cp ../shovill/${sample}/contigs.fa ${sample}.fasta
    done
  5. makeblastdb

    #MT880870_on_mibi1435.blastn: 2366
    #MT880870_on_mibi1436.blastn: 3198
    #MT880870_on_mibi1437.blastn: 3198
    
    # -lenth(MT880870)=34053
    rename 's/\.1\.fasta$/.fasta/' *.1.fasta
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        makeblastdb -in ${sample}.fasta -dbtype nucl
    done
  6. prepare typing_ST2_until_QPB07622.csv

    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07572.1_1" > QPB07572.fasta
    mkdir QPB07572
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07572.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07572/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07572 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07573.1_2" > QPB07573.fasta
    mkdir QPB07573
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07573.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07573/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07573 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07574.1_3" > QPB07574.fasta
    mkdir QPB07574
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07574.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07574/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07574 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07575.1_4" > QPB07575.fasta
    mkdir QPB07575
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07575.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07575/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07575 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07576.1_5" > QPB07576.fasta
    mkdir QPB07576
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07576.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07576/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07576 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07577.1_6" > QPB07577.fasta
    mkdir QPB07577
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07577.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07577/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07577 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07578.1_7" > QPB07578.fasta
    mkdir QPB07578
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07578.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07578/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07578 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_typing_ST2.csv
    
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07579.1_8" > QPB07579.fasta
    mkdir QPB07579
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        blastn -db ${sample}.fasta -query QPB07579.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./QPB07579/${sample}.blastn
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07579 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_updated_typing_ST2.csv
    
    grep ">" MT880870.fasta > temp
    cut -d' ' -f1 temp > temp1
    
    #    region="lcl|MT880870.1_cds_${id}.1_$((i - 78))"  # Adjust index calculation for region
    #    echo "samtools faidx MT880870.fasta \"$region\" > ${id}.fasta"
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07580.1_9" > QPB07580.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07581.1_10" > QPB07581.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07582.1_11" > QPB07582.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07583.1_12" > QPB07583.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07584.1_13" > QPB07584.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07585.1_14" > QPB07585.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07586.1_15" > QPB07586.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07587.1_16" > QPB07587.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07588.1_17" > QPB07588.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07589.1_18" > QPB07589.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07590.1_19" > QPB07590.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07591.1_20" > QPB07591.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07592.1_21" > QPB07592.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07593.1_22" > QPB07593.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07594.1_23" > QPB07594.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07595.1_24" > QPB07595.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07596.1_25" > QPB07596.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07597.1_26" > QPB07597.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07598.1_27" > QPB07598.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07599.1_28" > QPB07599.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07600.1_29" > QPB07600.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07601.1_30" > QPB07601.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07602.1_31" > QPB07602.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07603.1_32" > QPB07603.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07604.1_33" > QPB07604.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07605.1_34" > QPB07605.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07606.1_35" > QPB07606.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07607.1_36" > QPB07607.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07608.1_37" > QPB07608.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07609.1_38" > QPB07609.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07610.1_39" > QPB07610.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07611.1_40" > QPB07611.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07612.1_41" > QPB07612.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07613.1_42" > QPB07613.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07614.1_43" > QPB07614.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07615.1_44" > QPB07615.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07616.1_45" > QPB07616.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07617.1_46" > QPB07617.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07618.1_47" > QPB07618.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07619.1_48" > QPB07619.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07620.1_49" > QPB07620.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07621.1_50" > QPB07621.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07622.1_51" > QPB07622.fasta
    
    for i in {580..580}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/QPB07580 /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/updated_updated_updated_updated_updated_updated_updated_updated_typing_ST2.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_QPB07580.csv
    
    for i in {581..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/typing_ST2_until_${id}.csv"
    done
    
    cp typing_ST2_until_QPB07622.csv ../plotTreeHeatmap_ST2/typing_ST2.csv
  7. plot tree heatmap

    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/")
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    info <- read.csv("typing_ST2.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("core_gene_alignment.aln.treefile")
    cols <- c("Public database"='purple2', "Clinical isolate"='darksalmon')  #purple2  skyblue2
    
    heatmapData2 <- info %>% select(Isolate, QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    heatmap.colours <- c("darkgreen","grey")
    names(heatmap.colours) <- c("+","-")
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=Source)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    #png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    
    png("ggtree_and_gheatmap.png", width=1290, height=1000)
    gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()

Identify all occurrences of Phages MT880870, MT880871 and MT880872 in S. epidermidis ST2 genomes from public and clinical isolates

ggtree_and_gheatmap_ST2_phages

  1. prepare typing_ST2_until_QPB07622.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

    #grep ">" MT880870.fasta | cut -d' ' -f1 > headers.txt
    # -- under presence_absence_ST2 --
    samtools faidx MT880870.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07572.1_1" > QPB07572.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07573.1_2" > QPB07573.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07574.1_3" > QPB07574.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07575.1_4" > QPB07575.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07576.1_5" > QPB07576.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07577.1_6" > QPB07577.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07578.1_7" > QPB07578.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07579.1_8" > QPB07579.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07580.1_9" > QPB07580.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07581.1_10" > QPB07581.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07582.1_11" > QPB07582.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07583.1_12" > QPB07583.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07584.1_13" > QPB07584.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07585.1_14" > QPB07585.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07586.1_15" > QPB07586.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07587.1_16" > QPB07587.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07588.1_17" > QPB07588.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07589.1_18" > QPB07589.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07590.1_19" > QPB07590.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07591.1_20" > QPB07591.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07592.1_21" > QPB07592.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07593.1_22" > QPB07593.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07594.1_23" > QPB07594.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07595.1_24" > QPB07595.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07596.1_25" > QPB07596.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07597.1_26" > QPB07597.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07598.1_27" > QPB07598.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07599.1_28" > QPB07599.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07600.1_29" > QPB07600.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07601.1_30" > QPB07601.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07602.1_31" > QPB07602.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07603.1_32" > QPB07603.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07604.1_33" > QPB07604.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07605.1_34" > QPB07605.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07606.1_35" > QPB07606.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07607.1_36" > QPB07607.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07608.1_37" > QPB07608.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07609.1_38" > QPB07609.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07610.1_39" > QPB07610.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07611.1_40" > QPB07611.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07612.1_41" > QPB07612.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07613.1_42" > QPB07613.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07614.1_43" > QPB07614.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07615.1_44" > QPB07615.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07616.1_45" > QPB07616.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07617.1_46" > QPB07617.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07618.1_47" > QPB07618.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07619.1_48" > QPB07619.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07620.1_49" > QPB07620.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07621.1_50" > QPB07621.fasta
    samtools faidx MT880870.fasta "lcl|MT880870.1_cds_QPB07622.1_51" > QPB07622.fasta
    
    for i in {572..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    
    #under plotTreeHeatmap_ST2
    for i in {572..622}; do
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
    done
    
    #reprace all '+' with 'MT880870' in typing_ST2_until_QPB07622.csv
    cp ../presence_absence_ST2/typing_ST2_until_QPB07622.csv ./
    sed -i 's/+/MT880870/g' typing_ST2_until_QPB07622.csv
  2. prepare typing_ST2_until_QPB07667.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

    # -- under presence_absence_ST2 --
    samtools faidx MT880871.fasta
    #grep ">" MT880871.fasta | cut -d' ' -f1 > headers.txt
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07623.1_1" > QPB07623.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07624.1_2" > QPB07624.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07625.1_3" > QPB07625.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07626.1_4" > QPB07626.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07627.1_5" > QPB07627.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07628.1_6" > QPB07628.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07629.1_7" > QPB07629.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07630.1_8" > QPB07630.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07631.1_9" > QPB07631.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07632.1_10" > QPB07632.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07633.1_11" > QPB07633.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07634.1_12" > QPB07634.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07635.1_13" > QPB07635.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07636.1_14" > QPB07636.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07637.1_15" > QPB07637.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07638.1_16" > QPB07638.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07639.1_17" > QPB07639.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07640.1_18" > QPB07640.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07641.1_19" > QPB07641.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07642.1_20" > QPB07642.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07643.1_21" > QPB07643.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07644.1_22" > QPB07644.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07645.1_23" > QPB07645.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07646.1_24" > QPB07646.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07647.1_25" > QPB07647.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07648.1_26" > QPB07648.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07649.1_27" > QPB07649.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07650.1_28" > QPB07650.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07651.1_29" > QPB07651.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07652.1_30" > QPB07652.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07653.1_31" > QPB07653.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07654.1_32" > QPB07654.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07655.1_33" > QPB07655.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07656.1_34" > QPB07656.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07657.1_35" > QPB07657.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07658.1_36" > QPB07658.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07659.1_37" > QPB07659.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07660.1_38" > QPB07660.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07661.1_39" > QPB07661.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07662.1_40" > QPB07662.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07663.1_41" > QPB07663.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07664.1_42" > QPB07664.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07665.1_43" > QPB07665.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07666.1_44" > QPB07666.fasta
    samtools faidx MT880871.fasta "lcl|MT880871.1_cds_QPB07667.1_45" > QPB07667.fasta
    
    for i in {623..667}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {660..660}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    # -- under plotTreeHeatmap_ST2 --
    for i in {623..667}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
    done
    
    sed -i 's/+/MT880871/g' typing_ST2_until_QPB07667.csv
  3. prepare typing_ST2_until_QPB07667.csv under ~/DATA/Data_Luise_Sepi_STKN/presence_absence_ST2

    # -- under presence_absence_ST2 --
    samtools faidx MT880872.fasta
    #grep ">" MT880872.fasta | cut -d' ' -f1 > headers.txt
    
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07668.1_1" > QPB07668.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07669.1_2" > QPB07669.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07670.1_3" > QPB07670.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07671.1_4" > QPB07671.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07672.1_5" > QPB07672.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07673.1_6" > QPB07673.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07674.1_7" > QPB07674.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07675.1_8" > QPB07675.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07676.1_9" > QPB07676.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07677.1_10" > QPB07677.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07678.1_11" > QPB07678.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07679.1_12" > QPB07679.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07680.1_13" > QPB07680.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07681.1_14" > QPB07681.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07682.1_15" > QPB07682.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07683.1_16" > QPB07683.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07684.1_17" > QPB07684.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07685.1_18" > QPB07685.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07686.1_19" > QPB07686.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07687.1_20" > QPB07687.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07688.1_21" > QPB07688.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07689.1_22" > QPB07689.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07690.1_23" > QPB07690.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07691.1_24" > QPB07691.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07692.1_25" > QPB07692.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07693.1_26" > QPB07693.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07694.1_27" > QPB07694.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07695.1_28" > QPB07695.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07696.1_29" > QPB07696.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07697.1_30" > QPB07697.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07698.1_31" > QPB07698.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07699.1_32" > QPB07699.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07700.1_33" > QPB07700.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07701.1_34" > QPB07701.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07702.1_35" > QPB07702.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07703.1_36" > QPB07703.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07704.1_37" > QPB07704.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07705.1_38" > QPB07705.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07706.1_39" > QPB07706.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07707.1_40" > QPB07707.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07708.1_41" > QPB07708.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07709.1_42" > QPB07709.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07710.1_43" > QPB07710.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07711.1_44" > QPB07711.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07712.1_45" > QPB07712.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07713.1_46" > QPB07713.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07714.1_47" > QPB07714.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07715.1_48" > QPB07715.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07716.1_49" > QPB07716.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07717.1_50" > QPB07717.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07718.1_51" > QPB07718.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07719.1_52" > QPB07719.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07720.1_53" > QPB07720.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07721.1_54" > QPB07721.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07722.1_55" > QPB07722.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07723.1_56" > QPB07723.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07724.1_57" > QPB07724.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07725.1_58" > QPB07725.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07726.1_59" > QPB07726.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07727.1_60" > QPB07727.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07728.1_61" > QPB07728.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07729.1_62" > QPB07729.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07730.1_63" > QPB07730.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07731.1_64" > QPB07731.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07732.1_65" > QPB07732.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07733.1_66" > QPB07733.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07734.1_67" > QPB07734.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07735.1_68" > QPB07735.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07736.1_69" > QPB07736.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07737.1_70" > QPB07737.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07738.1_71" > QPB07738.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07739.1_72" > QPB07739.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07740.1_73" > QPB07740.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07741.1_74" > QPB07741.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07742.1_75" > QPB07742.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07743.1_76" > QPB07743.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07744.1_77" > QPB07744.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07745.1_78" > QPB07745.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07746.1_79" > QPB07746.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07747.1_80" > QPB07747.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07748.1_81" > QPB07748.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07749.1_82" > QPB07749.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07750.1_83" > QPB07750.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07751.1_84" > QPB07751.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07752.1_85" > QPB07752.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07753.1_86" > QPB07753.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07754.1_87" > QPB07754.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07755.1_88" > QPB07755.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07756.1_89" > QPB07756.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07757.1_90" > QPB07757.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07758.1_91" > QPB07758.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07759.1_92" > QPB07759.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07760.1_93" > QPB07760.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07761.1_94" > QPB07761.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07762.1_95" > QPB07762.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07763.1_96" > QPB07763.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07764.1_97" > QPB07764.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07765.1_98" > QPB07765.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07766.1_99" > QPB07766.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07767.1_100" > QPB07767.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07768.1_101" > QPB07768.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07769.1_102" > QPB07769.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07770.1_103" > QPB07770.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07771.1_104" > QPB07771.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07772.1_105" > QPB07772.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07773.1_106" > QPB07773.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07774.1_107" > QPB07774.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07775.1_108" > QPB07775.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07776.1_109" > QPB07776.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07777.1_110" > QPB07777.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07778.1_111" > QPB07778.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07779.1_112" > QPB07779.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07780.1_113" > QPB07780.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07781.1_114" > QPB07781.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07782.1_115" > QPB07782.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07783.1_116" > QPB07783.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07784.1_117" > QPB07784.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07785.1_118" > QPB07785.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07786.1_119" > QPB07786.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07787.1_120" > QPB07787.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07788.1_121" > QPB07788.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07789.1_122" > QPB07789.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07790.1_123" > QPB07790.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07791.1_124" > QPB07791.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07792.1_125" > QPB07792.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07793.1_126" > QPB07793.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07794.1_127" > QPB07794.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07795.1_128" > QPB07795.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07796.1_129" > QPB07796.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07797.1_130" > QPB07797.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07798.1_131" > QPB07798.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07799.1_132" > QPB07799.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07800.1_133" > QPB07800.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07801.1_134" > QPB07801.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07802.1_135" > QPB07802.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07803.1_136" > QPB07803.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07804.1_137" > QPB07804.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07805.1_138" > QPB07805.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07806.1_139" > QPB07806.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07807.1_140" > QPB07807.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07808.1_141" > QPB07808.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07809.1_142" > QPB07809.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07810.1_143" > QPB07810.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07811.1_144" > QPB07811.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07812.1_145" > QPB07812.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07813.1_146" > QPB07813.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07814.1_147" > QPB07814.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07815.1_148" > QPB07815.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07816.1_149" > QPB07816.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07817.1_150" > QPB07817.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07818.1_151" > QPB07818.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07819.1_152" > QPB07819.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07820.1_153" > QPB07820.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07821.1_154" > QPB07821.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07822.1_155" > QPB07822.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07823.1_156" > QPB07823.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07824.1_157" > QPB07824.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07825.1_158" > QPB07825.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07826.1_159" > QPB07826.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07827.1_160" > QPB07827.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07828.1_161" > QPB07828.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07829.1_162" > QPB07829.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07830.1_163" > QPB07830.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07831.1_164" > QPB07831.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07832.1_165" > QPB07832.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07833.1_166" > QPB07833.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07834.1_167" > QPB07834.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07835.1_168" > QPB07835.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07836.1_169" > QPB07836.fasta
    samtools faidx MT880872.fasta "lcl|MT880872.1_cds_QPB07837.1_170" > QPB07837.fasta
    
    for i in {668..750}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {751..837}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
        for i in {708..708}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319 AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do"
            echo "blastn -db \${sample}.fasta -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    # -- under plotTreeHeatmap_ST2 --
    for i in {668..837}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/presence_absence_ST2/${id} /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id_1}.csv /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/typing_ST2_until_${id}.csv"
    done
    
    sed -i 's/+/MT880872/g' typing_ST2_until_QPB07837.csv
  4. plot tree heatmap under /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2_MT880871

    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_ST2/")
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    info <- read.csv("typing_ST2_until_QPB07837.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("core_gene_alignment.aln.treefile")
    cols <- c("Public database"='purple2', "Clinical isolate"='darksalmon')  #purple2  skyblue2
    heatmapData2 <- info %>% select(Isolate, QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622, QPB07623, QPB07624, QPB07625, QPB07626, QPB07627, QPB07628, QPB07629, QPB07630, QPB07631, QPB07632, QPB07633, QPB07634, QPB07635, QPB07636, QPB07637, QPB07638, QPB07639, QPB07640, QPB07641, QPB07642, QPB07643, QPB07644, QPB07645, QPB07646, QPB07647, QPB07648, QPB07649, QPB07650, QPB07651, QPB07652, QPB07653, QPB07654, QPB07655, QPB07656, QPB07657, QPB07658, QPB07659, QPB07660, QPB07661, QPB07662, QPB07663, QPB07664, QPB07665, QPB07666, QPB07667, QPB07668, QPB07669, QPB07670, QPB07671, QPB07672, QPB07673, QPB07674, QPB07675, QPB07676, QPB07677, QPB07678, QPB07679, QPB07680, QPB07681, QPB07682, QPB07683, QPB07684, QPB07685, QPB07686, QPB07687, QPB07688, QPB07689, QPB07690, QPB07691, QPB07692, QPB07693, QPB07694, QPB07695, QPB07696, QPB07697, QPB07698, QPB07699, QPB07700, QPB07701, QPB07702, QPB07703, QPB07704, QPB07705, QPB07706, QPB07707, QPB07708, QPB07709, QPB07710, QPB07711, QPB07712, QPB07713, QPB07714, QPB07715, QPB07716, QPB07717, QPB07718, QPB07719, QPB07720, QPB07721, QPB07722, QPB07723, QPB07724, QPB07725, QPB07726, QPB07727, QPB07728, QPB07729, QPB07730, QPB07731, QPB07732, QPB07733, QPB07734, QPB07735, QPB07736, QPB07737, QPB07738, QPB07739, QPB07740, QPB07741, QPB07742, QPB07743, QPB07744, QPB07745, QPB07746, QPB07747, QPB07748, QPB07749, QPB07750, QPB07751, QPB07752, QPB07753, QPB07754, QPB07755, QPB07756, QPB07757, QPB07758, QPB07759, QPB07760, QPB07761, QPB07762, QPB07763, QPB07764, QPB07765, QPB07766, QPB07767, QPB07768, QPB07769, QPB07770, QPB07771, QPB07772, QPB07773, QPB07774, QPB07775, QPB07776, QPB07777, QPB07778, QPB07779, QPB07780, QPB07781, QPB07782, QPB07783, QPB07784, QPB07785, QPB07786, QPB07787, QPB07788, QPB07789, QPB07790, QPB07791, QPB07792, QPB07793, QPB07794, QPB07795, QPB07796, QPB07797, QPB07798, QPB07799, QPB07800, QPB07801, QPB07802, QPB07803, QPB07804, QPB07805, QPB07806, QPB07807, QPB07808, QPB07809, QPB07810, QPB07811, QPB07812, QPB07813, QPB07814, QPB07815, QPB07816, QPB07817, QPB07818, QPB07819, QPB07820, QPB07821, QPB07822, QPB07823, QPB07824, QPB07825, QPB07826, QPB07827, QPB07828, QPB07829, QPB07830, QPB07831, QPB07832, QPB07833, QPB07834, QPB07835, QPB07836, QPB07837)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=Source)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    #png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    png("ggtree_and_gheatmap.png", width=1290, height=1000)
    gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  5. Report

    Explanation for the discrepancy between the last analysis and this one: Note that many partial genes of the phages do not match because, in the previous analysis, only the longest BLASTn hits were considered and recorded in the table.

    We have a total of 14 ST2 genomes from Luise’s project, and I found 73 ST2 genomes from NCBI.

    I examined the presence and absence of all 51 + 45 + 170 genes (details below) across the 87 genomes and visualized the results in a heatmap (attached as ggtree_and_gheatmap_phages.png).

    • MT880870: 51 putative genes
    • MT880871: 45 putative genes
    • MT880872: 170 putative genes

    In the heatmap:

    Genes not present in the corresponding isolate are shown in grey. Genes present are highlighted in the corresponding colors:

    • Dark red for MT880870
    • Dark blue for MT880871
    • Dark green for MT880872

    The graphic is based on the Excel file typing_ST2.xlsx. If you have any suggestions or improvements, please let me know.

Enhanced Visualization of Gene Presence in Luise_Sepi_STKN

ggtree_and_gheatmap_mibi_selected_genes

ggtree_and_gheatmap_mibi_phages

  1. prepare typing_ST2_until_QPB07837.csv (under DIR ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates and ENV r_env)

    cp ../presence_absence_ST2/QPB*.fasta ./
    cp ../plotTreeHeatmap/typing.csv ./typing_until_QPB07571.csv
    
    # -- makeblastdb --
    #for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do
    #    makeblastdb -in ../shovill/${sample}/contigs.fa -dbtype nucl
    #done
    
    for i in {572..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {572..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    
    #reprace all '+' with 'MT880870' in typing_until_QPB07622.csv
    sed -i 's/+/MT880870/g' typing_until_QPB07622.csv
  2. prepare typing_until_QPB07667.csv

    for i in {623..667}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {623..667}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    
    sed -i 's/+/MT880871/g' typing_until_QPB07667.csv
  3. prepare typing_until_QPB07667.csv

    for i in {668..750}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {751..837}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in mibi1435 mibi1436 mibi1437 mibi1438 mibi1439 mibi1440 mibi1441 mibi1442 mibi1443 mibi1444 mibi1445 mibi1446 mibi1447 mibi1448 mibi1449 mibi1450 mibi1451 mibi1452 mibi1453 mibi1454 mibi1455 mibi1456 mibi1457 mibi1458 mibi1459 mibi1460 mibi1461 mibi1462 mibi1463 mibi1464 mibi1465 mibi1466 mibi1467 mibi1468 mibi1469 mibi1471 mibi1473 mibi1474 mibi1475 mibi1476 mibi1477 mibi1478 mibi1479 mibi1480 mibi1481 mibi1482 mibi1483 mibi1484 mibi1485 mibi1486 mibi1487 mibi1488 mibi1489 mibi1490 mibi1491 mibi1492 mibi1493 mibi1494 mibi1495 mibi1496 mibi1497 mibi1498 mibi1499 mibi1500 mibi1501 mibi1502 mibi1503 mibi1504 mibi1505 mibi1506 mibi2312 mibi2313 mibi2314 mibi2315 mibi2316 mibi2317 mibi2318 mibi2319 mibi2320 mibi2321 mibi2379; do"
            echo "blastn -db ../shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {668..837}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    
    sed -i 's/+/MT880872/g' typing_until_QPB07837.csv
    
    cut -d$'\t' -f1-43 typing_until_QPB07837.csv > temp1
    cut -d$'\t' -f44- typing_until_QPB07837.csv > temp2
    
    sed -i 's/MT880870/+/g' temp1
    paste -d$'\t' temp1 temp2 > ggtree_and_gheatmap_mibi_phages.csv
  4. plot tree heatmap under /mnt/md1/DATA_md1/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates

    cp ../plotTreeHeatmap/990_backup.tree ./
    
    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/")
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    
    # -- for the figure ggtree_and_gheatmap_mibi_phages.png --
    info <- read.csv("ggtree_and_gheatmap_mibi.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("990_backup.tree")
    cols <- c("2"="cornflowerblue","5"="darkgreen","7"="seagreen3","9"="tan","14"="red",  "17"="navyblue", "23"="gold",     "35"="green","59"="orange","73"="pink","81"="purple","86"="magenta","87"="brown", "89"="darksalmon","130"="chocolate4","190"="darkkhaki", "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet")
    #cols <- c("2"='purple2', "other"='darksalmon')  #purple2  skyblue2
    heatmapData2 <- info %>% select(Isolate,  QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622, QPB07623, QPB07624, QPB07625, QPB07626, QPB07627, QPB07628, QPB07629, QPB07630, QPB07631, QPB07632, QPB07633, QPB07634, QPB07635, QPB07636, QPB07637, QPB07638, QPB07639, QPB07640, QPB07641, QPB07642, QPB07643, QPB07644, QPB07645, QPB07646, QPB07647, QPB07648, QPB07649, QPB07650, QPB07651, QPB07652, QPB07653, QPB07654, QPB07655, QPB07656, QPB07657, QPB07658, QPB07659, QPB07660, QPB07661, QPB07662, QPB07663, QPB07664, QPB07665, QPB07666, QPB07667, QPB07668, QPB07669, QPB07670, QPB07671, QPB07672, QPB07673, QPB07674, QPB07675, QPB07676, QPB07677, QPB07678, QPB07679, QPB07680, QPB07681, QPB07682, QPB07683, QPB07684, QPB07685, QPB07686, QPB07687, QPB07688, QPB07689, QPB07690, QPB07691, QPB07692, QPB07693, QPB07694, QPB07695, QPB07696, QPB07697, QPB07698, QPB07699, QPB07700, QPB07701, QPB07702, QPB07703, QPB07704, QPB07705, QPB07706, QPB07707, QPB07708, QPB07709, QPB07710, QPB07711, QPB07712, QPB07713, QPB07714, QPB07715, QPB07716, QPB07717, QPB07718, QPB07719, QPB07720, QPB07721, QPB07722, QPB07723, QPB07724, QPB07725, QPB07726, QPB07727, QPB07728, QPB07729, QPB07730, QPB07731, QPB07732, QPB07733, QPB07734, QPB07735, QPB07736, QPB07737, QPB07738, QPB07739, QPB07740, QPB07741, QPB07742, QPB07743, QPB07744, QPB07745, QPB07746, QPB07747, QPB07748, QPB07749, QPB07750, QPB07751, QPB07752, QPB07753, QPB07754, QPB07755, QPB07756, QPB07757, QPB07758, QPB07759, QPB07760, QPB07761, QPB07762, QPB07763, QPB07764, QPB07765, QPB07766, QPB07767, QPB07768, QPB07769, QPB07770, QPB07771, QPB07772, QPB07773, QPB07774, QPB07775, QPB07776, QPB07777, QPB07778, QPB07779, QPB07780, QPB07781, QPB07782, QPB07783, QPB07784, QPB07785, QPB07786, QPB07787, QPB07788, QPB07789, QPB07790, QPB07791, QPB07792, QPB07793, QPB07794, QPB07795, QPB07796, QPB07797, QPB07798, QPB07799, QPB07800, QPB07801, QPB07802, QPB07803, QPB07804, QPB07805, QPB07806, QPB07807, QPB07808, QPB07809, QPB07810, QPB07811, QPB07812, QPB07813, QPB07814, QPB07815, QPB07816, QPB07817, QPB07818, QPB07819, QPB07820, QPB07821, QPB07822, QPB07823, QPB07824, QPB07825, QPB07826, QPB07827, QPB07828, QPB07829, QPB07830, QPB07831, QPB07832, QPB07833, QPB07834, QPB07835, QPB07836, QPB07837)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    
    png("ggtree_and_gheatmap_mibi_phages.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap_mibi_phages.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=0, offset = 6) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
    
    # -- for the figure ggtree_and_gheatmap_mibi_selected_genes.png --
    
    info <- read.csv("ggtree_and_gheatmap_mibi.csv", sep="\t")
    info$name <- info$Isolate
    tree <- read.tree("990_backup.tree")
    cols <- c("2"="cornflowerblue","5"="darkgreen","7"="seagreen3","9"="tan","14"="red",  "17"="navyblue", "23"="gold",     "35"="green","59"="orange","73"="pink","81"="purple","86"="magenta","87"="brown", "89"="darksalmon","130"="chocolate4","190"="darkkhaki", "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet")
    
    heatmapData2 <- info %>% select(Isolate,       SCCmec,  agr.typing,      gyrB,    fumC,    gltA,    icd,     apsS,    sigB,    sarA,    agrC,    yycG,    psm.β,   psm.δ,   hlb,     atlE,    atl,     sdrG,    sdrH,    ebh,     ebpS,    tagB,    capC,    sepA,    dltA,    fmtC,    lipA,    sceD,    SE0760,  esp,     ecpA)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    #heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")
    
    heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
    names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    
    png("ggtree_and_gheatmap_mibi_selected_genes.png", width=1590, height=1300)
    #svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  5. Report

    I’ve attached a figure (ggtree_and_gheatmap_mibi_selected_genes.png) that illustrates the presence and absence of the selected genes. This is a visual representation of the table I sent to you earlier. The presence or absence of each gene in the corresponding genomes was determined using a BLASTn comparison between the genome and the gene sequences.
    
    Additionally, I’ve updated the figure ggtree_and_gheatmap_mibi_phages.png. In this new version, all ST types are represented in distinct colors. The raw data for both figures can be found in the attached Excel file (ggtree_and_gheatmap_mibi.xlsx).

Enhanced Visualization of Gene Presence for the Selected Genes in Bongarts_S.epidermidis_HDRNA

ggtree_and_gheatmap_selected_genes

  1. preparing gene seqences

    (Optional online search) https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&BLAST_SPEC=MicrobialGenomes
    #Title:Refseq prokaryote representative genomes (contains refseq assembly)
    #Molecule Type:mixed DNA
    #Update date:2024/10/16
    #Number of sequences:1038672
    
    #cat gyrB_revcomp.fasta fumC_revcomp.fasta gltA.fasta icd_revcomp.fasta apsS.fasta sigB_revcomp.fasta sarA_revcomp.fasta ...
    
    # SCCmec,agr.typing,
    # gyrB(House keeper),
    # fumC,gltA,icd(Metabolic genes),
    # apsS,sigB,sarA,agrC,yycG(Virulence regulators),
    # psm.β1,hlb(Toxins),
    # atlE,sdrG,sdrH,ebh,ebp,tagB(Biofilm formation),
    # capC,sepA,dltA,fmtC,lipA,sceD,SE0760(Immune evasion & colonization),
    # esp,ecpA(Serine protease)
    
    gene            complement(1547773..1549233)
                    /gene="ebp"
                    /locus_tag="SAKG22_13940"
    
    CDS             complement(1547773..1549233)
                    /gene="ebp"
            samtools faidx AP019545.1.fasta
            samtools faidx AP019545.1.fasta "gi|1602080981|dbj|AP019545.1|":1547773-1549233 > ebp.fasta
            revcomp ebp.fasta > ebp_revcomp.fasta
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gyrB_revcomp.fasta gyrB.fasta
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fumC_revcomp.fasta fumC.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gltA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/icd_revcomp.fasta icd.fasta
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/apsS.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sigB_revcomp.fasta sigB.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sarA_revcomp.fasta sarA.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/agrC.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/yycG.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/psm-beta1.fasta psm.β1.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/hlb.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/atlE.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrG.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrH_revcomp.fasta sdrH.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebh_revcomp.fasta ebh.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebp_revcomp.fasta ebp.fasta
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/tagB.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/capC.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sepA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/dltA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fmtC.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/lipA.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sceD.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/SE0760.fasta .
    
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/esp.fasta .
            cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ecpA.fasta .
    
            #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880870.fasta .
            #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880871.fasta .
            #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880872.fasta .
    
    >gi|1602080981|dbj|AP019545.1|:1547773-1549233
    >gi|34980227|gb|AY322150.1|:3091-8793
    
    ./gyrB_revcomp.fasta
    
    ./agrC.fasta
    ./CP133692.fasta
    ./sceD_mibi1435_revcomp.fasta
    ./CP133682.fasta
    ./CP133697.fasta
    ./sdrG.fasta
    ./sarA_revcomp.fasta
    ./CP133679.fasta
    ./CP133686.fasta
    ./sepA.fasta
    ./icd.fasta
    ./CP133691.fasta
    ./CP133698.fasta
    ./yycG.fasta
    ./CP133701.fasta
    
    ./sdrH_revcomp.fasta
    ./SE0760.fasta
    ./CP133681.fasta
    ./CP133677.fasta
    ./yycG_revcomp.fasta
    ./CP000029.fasta
    ./atlE_old.fasta
    ./ebpS.fasta
    ./ebpS_revcomp.fasta
    ./atlE_revcomp.fasta
    ./sceDAE.fasta
    ./hlb.fasta
    >gi|441494790|gb|KC242859.1|:16-1008
    ATGGTGAAAAAAACAAAATCCAATTCACTAAAAAAAGTTGCAACACTTGCATTAGCAAAT
    TTATTATTAGTTGGTGCACTTACTGACAATAGTGCCAAAGCCGAATCTAAGAAAGATGAT
    ACTGATTTGAAGTTAGTTAGTCATAACGTTTATATGTTATCGACCGTTTTGTATCCAAAC
    TGGGGGCAATATAAACGCGCTGATTTAATCGGGCAATCTTCTTATATTAAAAATAATGAT
    GTCGTAATATTCAATGAAGCATTTGATAATGGCGCATCAGATAAGCTATTAAGTAATGTA
    AAAAAAGAATATCCTTACCAAACACCTGTACTCGGTCGTTCTCAATCAGGGTGGGACAAA
    ACTGAAGGTAGCTACTCATCAACTGTTGCTGAAGATGGTGGCGTAGCGATTGTAAGTAAA
    TATCCTATTAAAGAAAAAATCCAGCATGTTTTCAAAAGCGGTTGTGGATTCGACAATGAT
    AGCAACAAAGGCTTTGTTTATACAAAAATAGAGAAAAATGGGAAGAACGTTCATGTTATC
    GGTACACATACACAATCTGAAGATTCACGTTGTGGTGCTGGACATGATCGAAAAATTAGA
    GCTGAACAAATGAAAGAAATCAGTGATTTTGTTAAAAAGAAAAATATCCCTAAAGATGAA
    ACGGTATATATAGGTGGCGACCTTAATGTAAATAAAGGTACTCCAGAGTTCAAAGATATG
    CTTAAAAACTTGAATGTAAATGATGTTCTATATGCAGGTCATAATAGCACATGGGACCCT
    CAATCAAATTCAATTGCGAAATATAATTATCCTAATGGTAAACCAGAACATTTAGACTAT
    ATATTTACAGATAAAGATCATAAACAACCAAAACAATTAGTCAATGAAGTTGTGACTGAA
    AAACCTAAGCCATGGGATGTATATGCGTTCCCATATTACTACGTTTACAATGATTTTTCA
    GATCATTACCCAATCAAAGCCTATAGTAAATAA
    >hlb_
    >gi|441494790|gb|KC242859.1| Staphylococcus aureus strain 226 beta-hemolysin (hlb) gene, complete cds
    AAAGGAGTGATAATG  ATGGTGAAAAAAACAAAATCCAATTCACTAAAAAAAGTTGCAACACTTGCATTAGCAAATTTATTATTAGTTGGTGCACTTACTGACAATAGTGCCAAAGCCGAATCTAAGAAAGATGATACTGA
    TTTGAAGTTAGTTAGTCATAACGTTTATATGTTATCGACCGTTTTGTATCCAAACTGGGGGCAATATAAA
    CGCGCTGATTTAATCGGGCAATCTTCTTATATTAAAAATAATGATGTCGTAATATTCAATGAAGCATTTG
    ATAATGGCGCATCAGATAAGCTATTAAGTAATGTAAAAAAAGAATATCCTTACCAAACACCTGTACTCGG
    TCGTTCTCAATCAGGGTGGGACAAAACTGAAGGTAGCTACTCATCAACTGTTGCTGAAGATGGTGGCGTA
    GCGATTGTAAGTAAATATCCTATTAAAGAAAAAATCCAGCATGTTTTCAAAAGCGGTTGTGGATTCGACA
    ATGATAGCAACAAAGGCTTTGTTTATACAAAAATAGAGAAAAATGGGAAGAACGTTCATGTTATCGGTAC
    ACATACACAATCTGAAGATTCACGTTGTGGTGCTGGACATGATCGAAAAATTAGAGCTGAACAAATGAAA
    GAAATCAGTGATTTTGTTAAAAAGAAAAATATCCCTAAAGATGAAACGGTATATATAGGTGGCGACCTTA
    ATGTAAATAAAGGTACTCCAGAGTTCAAAGATATGCTTAAAAACTTGAATGTAAATGATGTTCTATATGC
    AGGTCATAATAGCACATGGGACCCTCAATCAAATTCAATTGCGAAATATAATTATCCTAATGGTAAACCA
    GAACATTTAGACTATATATTTACAGATAAAGATCATAAACAACCAAAACAATTAGTCAATGAAGTTGTGA
    CTGAAAAACCTAAGCCATGGGATGTATATGCGTTCCCATATTACTACGTTTACAATGATTTTTCAGATCA
    TTACCCAATCAAAGCCTATAGTAAATAA
    
    ./dltA.fasta
    ./atlE.fasta
    ./gyrB.fasta
    ./ORF123_sepA_ORF5.fasta
    ./ebh.fasta
    ./CP133696.fasta
    ./yycG_old.fasta
    ./CP133700.fasta
    ./CP133695.fasta
    ./ecpA.fasta
    ./capBCA_ywtC.fasta
    ./CP133690.fasta
    ./gltA.fasta
    ./esp.fasta
    ./sigB.fasta
    ./CP133699.fasta
    ./CP133685.fasta
    ./CP133680.fasta
    ./MT880872.fasta
    ./icd_revcomp.fasta
    ./agrC_old.fasta
    ./ecpA_.fasta
    ./agrC_revcomp.fasta
    ./MT880870.fasta
    ./CP133687.fasta
    ./CP133693.fasta
    ./psm-beta1.fasta
    >gi|380448412|gb|JQ066320.1| Staphylococcus aureus strain JP1 Psm beta1 (psm beta1) and Psm beta2 (psm beta2) genes, complete cds
    ACACGTTTAACAACACAAGAATTATTATATCTAAATGAACTAAAATTAGCAATACCTTGTAAATAAAAAA
    TGTTTATATTTTTCACTATTATAGAGCTATTTATCTAAAAAGGTTCAATAAGACTTAAATACGAATTCAG
    GCAACTTAATTGTGTTAAATACAGTTTTGAATGCCTAACTGTATTTCTTTTCTCTTTAAAATACAGTTAA
    GTACATTATAAGATGTTGTGCGGATAAACAAACTAATTGCATCAAATTTATTTTAAAATAACAACAACAA
    AACGTTAAGAGAATAACATTTCGGTGATTTAAAAGCTACGCACGTTTTTGTTATCTTCAAATTTAAATTT
    TAAGGAGTGTTTTCA
    ATGGAAGGTTTATTTAACGCAATTAAAGATACCGTAACTGCAGCAATTAATAATGATGGC
    GCAAAATTAGGCACAAGCATTGTGAGCATCGTTGAAAATGGCGTAGGTTTATTAGGTAAA
    TTATTCGGATTCTAA
    TTTCAATATGTTATGTAAGTAATCAGTATTATTTCAAAGGTGAGGGAGAGATTTAAATGA
    CTGGACTAGCAGAAGCAATCGCAAATACTGTGCAAGCTGCACAACAACATGATAGTGTGAAATTAGGCAC
    AAGTATCGTAGACATCGTTGCTAACGGTGTGGGTTTACTAGGTAAATTATTTGGATTCTAATATAATAAC
    TAATATTCTTTAAAATAAACTGGGTGAGCATACTTTAATGTTATGCACTCAGTTTATTTTATTTGCAGAA
    ATTTGAGCCTCTGTTAAGATTTAGATACATAGACAATATAGGAGATGGGGAAATTGGGATATAAAAATAT
    TTTGATAGACTTTGATGATACAATTGTTGATTTTTATGATGCAGA
    
    ATGGAAGGTTTATTTAACGCAATTAAAGATACCGTAACTGCAGCAATTAATAATGATGGC
    GCAAAATTAGGCACAAGCATTGTGAGCATCGTTGAAAATGGCGTAGGTTTATTAGGTAAA
    TTATTCGGATTCTAA
    ./psm-beta.fasta
    ./lipA.fasta
    ./capC.fasta
    ./fumC_revcomp.fasta
    ./CP133689.fasta
    ./sepA_mibi1435.fasta
    ./sigB_revcomp.fasta
    ./sceD.fasta
    ./fumC.fasta
    ./Enterococcus_faecium_isolate_E300_pathogenicity_island.fasta
    ./agrABCD_hld.fasta
    ./CP133688.fasta
    ./ecpA_mibi1435_revcomp.fasta
    ./hlb_.fasta
    ./CP133684.fasta
    ./tagB_mibi1435_revcomp.fasta
    ./CP133678.fasta
    ./tagB.fasta
    ./MT880871.fasta
    ./CP133676.fasta
    ./sdrH.fasta
    ./Staphylococcus_epidermidis_RP62A.fasta
    ./ebh_revcomp.fasta
    ./CP133683.fasta
    ./apsS.fasta
    ./Staphylococcus_aureus_MRSA252.fasta
    ./fmtC.fasta
    ./CP133694.fasta
    ./sarA.fasta
  2. prepare the long fasta if we use only the long sequencing

    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_01_K01_conservative.fasta HDRNA_01_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_03_K01_bold_bandage.fasta HDRNA_03_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_06_K01_conservative.fasta HDRNA_06_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_07_K01_conservative.fasta HDRNA_07_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_08_K01_conservative.fasta HDRNA_08_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_12_K01_bold.fasta HDRNA_12_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_16_K01_conservative.fasta HDRNA_16_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_17_K01_conservative.fasta HDRNA_17_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_19_K01_bold.fasta HDRNA_19_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_20_K01_conservative.fasta HDRNA_20_K01.fasta
  3. perform blastn searching for long-sequencing

    # -- makeblastdb --
    for sample in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do
            makeblastdb -in ../assembly/${sample}.fasta -dbtype nucl
    done
    
    for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
    echo "mkdir ${id}"
    echo "for sample in in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do"
            echo "blastn -db ../assembly/${sample}.fasta -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
    done
    
    #gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA
    python ~/Scripts/process_directory.py gyrB typing.csv typing_until_gyrB.csv
    python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
    python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
    python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv
    
    python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
    python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
    python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
    python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
    python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
    python ~/Scripts/process_directory.py psm.β1 typing_until_yycG.csv typing_until_psm.β1.csv
    python ~/Scripts/process_directory.py hlb typing_until_psm.β1.csv typing_until_hlb.csv
    python ~/Scripts/process_directory.py atlE typing_until_hlb.csv typing_until_atlE.csv
    python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
    python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv
    
    python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
    python ~/Scripts/process_directory.py ebp typing_until_ebh.csv typing_until_ebp.csv
    python ~/Scripts/process_directory.py tagB typing_until_ebp.csv typing_until_tagB.csv
    python ~/Scripts/process_directory.py capC typing_until_tagB.csv typing_until_capC.csv
    python ~/Scripts/process_directory.py sepA typing_until_capC.csv typing_until_sepA.csv
    python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
    python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
    python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv
    
    python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
    python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
    python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
    python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv
  4. perform blastn searching for short-sequencing

    # -- makeblastdb --
    for sample in HDRNA_01_K01 HDRNA_01_K02 HDRNA_01_K03 HDRNA_01_K04 HDRNA_01_K05 HDRNA_01_K06 HDRNA_01_K07 HDRNA_01_K08 HDRNA_01_K09 HDRNA_01_K10 HDRNA_02_K01 HDRNA_03_K01 HDRNA_03_K02 HDRNA_03_K03 HDRNA_03_K04 HDRNA_03_K05 HDRNA_03_K06 HDRNA_03_K07 HDRNA_03_K08 HDRNA_03_K09 HDRNA_03_K10 HDRNA_04_K01 HDRNA_05_K01 HDRNA_06_K01 HDRNA_06_K02 HDRNA_06_K03 HDRNA_06_K04 HDRNA_06_K05 HDRNA_06_K06 HDRNA_06_K07 HDRNA_06_K08 HDRNA_06_K09 HDRNA_06_K10 HDRNA_07_K01 HDRNA_07_K02 HDRNA_07_K03 HDRNA_07_K04 HDRNA_07_K05 HDRNA_07_K06 HDRNA_07_K07 HDRNA_07_K08 HDRNA_07_K09 HDRNA_07_K10 HDRNA_08_K01 HDRNA_08_K02 HDRNA_08_K03 HDRNA_08_K04 HDRNA_08_K05 HDRNA_08_K06 HDRNA_08_K07 HDRNA_08_K08 HDRNA_08_K09 HDRNA_08_K10 HDRNA_10_K01 HDRNA_11_K01 HDRNA_12_K01 HDRNA_12_K02 HDRNA_12_K03 HDRNA_12_K04 HDRNA_12_K05 HDRNA_12_K06 HDRNA_12_K07 HDRNA_12_K08 HDRNA_12_K09 HDRNA_12_K10 HDRNA_16_K01 HDRNA_16_K02 HDRNA_16_K03 HDRNA_16_K04 HDRNA_16_K05 HDRNA_16_K06 HDRNA_16_K07 HDRNA_16_K08 HDRNA_16_K09 HDRNA_16_K10 HDRNA_17_K01 HDRNA_17_K02 HDRNA_17_K03 HDRNA_17_K04 HDRNA_17_K05 HDRNA_17_K06 HDRNA_17_K07 HDRNA_17_K08 HDRNA_17_K09 HDRNA_17_K10 HDRNA_19_K01 HDRNA_19_K02 HDRNA_19_K03 HDRNA_19_K04 HDRNA_19_K05 HDRNA_19_K06 HDRNA_19_K07 HDRNA_19_K08 HDRNA_19_K09 HDRNA_19_K10 HDRNA_20_K01 HDRNA_20_K02 HDRNA_20_K03 HDRNA_20_K04 HDRNA_20_K05 HDRNA_20_K06 HDRNA_20_K07 HDRNA_20_K08 HDRNA_20_K09; do
            makeblastdb -in ../Data_Holger_S.epidermidis_short/shovill/${sample}/contigs.fa -dbtype nucl
    done
    
    #Note: The current -evalue is set to 1e-1; it might need to be made more stringent.
    for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
    echo "mkdir ${id}"
    echo "for sample in HDRNA_01_K01 HDRNA_01_K02 HDRNA_01_K03 HDRNA_01_K04 HDRNA_01_K05 HDRNA_01_K06 HDRNA_01_K07 HDRNA_01_K08 HDRNA_01_K09 HDRNA_01_K10 HDRNA_02_K01 HDRNA_03_K01 HDRNA_03_K02 HDRNA_03_K03 HDRNA_03_K04 HDRNA_03_K05 HDRNA_03_K06 HDRNA_03_K07 HDRNA_03_K08 HDRNA_03_K09 HDRNA_03_K10 HDRNA_04_K01 HDRNA_05_K01 HDRNA_06_K01 HDRNA_06_K02 HDRNA_06_K03 HDRNA_06_K04 HDRNA_06_K05 HDRNA_06_K06 HDRNA_06_K07 HDRNA_06_K08 HDRNA_06_K09 HDRNA_06_K10 HDRNA_07_K01 HDRNA_07_K02 HDRNA_07_K03 HDRNA_07_K04 HDRNA_07_K05 HDRNA_07_K06 HDRNA_07_K07 HDRNA_07_K08 HDRNA_07_K09 HDRNA_07_K10 HDRNA_08_K01 HDRNA_08_K02 HDRNA_08_K03 HDRNA_08_K04 HDRNA_08_K05 HDRNA_08_K06 HDRNA_08_K07 HDRNA_08_K08 HDRNA_08_K09 HDRNA_08_K10 HDRNA_10_K01 HDRNA_11_K01 HDRNA_12_K01 HDRNA_12_K02 HDRNA_12_K03 HDRNA_12_K04 HDRNA_12_K05 HDRNA_12_K06 HDRNA_12_K07 HDRNA_12_K08 HDRNA_12_K09 HDRNA_12_K10 HDRNA_16_K01 HDRNA_16_K02 HDRNA_16_K03 HDRNA_16_K04 HDRNA_16_K05 HDRNA_16_K06 HDRNA_16_K07 HDRNA_16_K08 HDRNA_16_K09 HDRNA_16_K10 HDRNA_17_K01 HDRNA_17_K02 HDRNA_17_K03 HDRNA_17_K04 HDRNA_17_K05 HDRNA_17_K06 HDRNA_17_K07 HDRNA_17_K08 HDRNA_17_K09 HDRNA_17_K10 HDRNA_19_K01 HDRNA_19_K02 HDRNA_19_K03 HDRNA_19_K04 HDRNA_19_K05 HDRNA_19_K06 HDRNA_19_K07 HDRNA_19_K08 HDRNA_19_K09 HDRNA_19_K10 HDRNA_20_K01 HDRNA_20_K02 HDRNA_20_K03 HDRNA_20_K04 HDRNA_20_K05 HDRNA_20_K06 HDRNA_20_K07 HDRNA_20_K08 HDRNA_20_K09; do"
            echo "blastn -db ../Data_Holger_S.epidermidis_short/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
    done
    
    #gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β1 hlb    atlE sdrG sdrH ebh ebp tagB    capC sepA dltA fmtC lipA sceD SE0760    esp ecpA
    python ~/Scripts/process_directory.py gyrB typing_104.csv typing_until_gyrB.csv
    python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
    python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
    python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv
    
    python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
    python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
    python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
    python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
    python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
    python ~/Scripts/process_directory.py psm.β1 typing_until_yycG.csv typing_until_psm.β1.csv
    python ~/Scripts/process_directory.py hlb typing_until_psm.β1.csv typing_until_hlb.csv
    python ~/Scripts/process_directory.py atlE typing_until_hlb.csv typing_until_atlE.csv
    python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
    python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv
    
    python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
    python ~/Scripts/process_directory.py ebp typing_until_ebh.csv typing_until_ebp.csv
    python ~/Scripts/process_directory.py tagB typing_until_ebp.csv typing_until_tagB.csv
    python ~/Scripts/process_directory.py capC typing_until_tagB.csv typing_until_capC.csv
    python ~/Scripts/process_directory.py sepA typing_until_capC.csv typing_until_sepA.csv
    python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
    python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
    python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv
    
    python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
    python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
    python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
    python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv
  5. plotTreeHeatmap

    ~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/plotTreeHeatmap
    
    #FastTree -gtr -nt variants/snippy.core_without_reference.aln > plotTreeHeatmap/snippy.core.tree
    
    #cp ../Data_Holger_S.epidermidis_short/plotTreeHeatmap/typing_104.csv .
    #cp ../Data_Holger_S.epidermidis_short/results_HDRNA_01-20/variants/snippy.core.aln_104.tree .
    #Run step 4
    
    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("/home/jhuang/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/plotTreeHeatmap_short/")
    # -- edit tree --
    info <- read.csv("typing_until_ecpA.csv", sep="\t")
    
    # Convert 'ST' column to factor with levels in the desired order
    info$ST <- factor(info$ST, levels = c("5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none"))
    
    info$name <- info$Isolate
    tree <- read.tree("snippy.core.aln_104.tree")
    cols <- c("5"="darkgreen", "23"="cornflowerblue", "35"="green","69"="orange","86"="magenta","87"="brown", "130"="purple","224"="darkkhaki", "487"="cyan", "640"="pink","none"="grey")
    #"2"="",
    #"7"="seagreen3",
    #"9"="tan","14"="red",  "17"="navyblue",
    #"73"="pink","81"="purple",
    #"89"="darksalmon",
    #"454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet"
    #"5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none"
    
    heatmapData2 <- info %>% select(Isolate,          gyrB,    fumC, gltA, icd,    apsS, sigB, sarA, agrC, yycG,    psm.β1, hlb,    atlE, sdrG, sdrH, ebh, ebp, tagB,    capC, sepA, dltA, fmtC, lipA, sceD, SE0760,    esp, ecpA)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    #heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")
    
    #TEMP_DEACTIVATED!  heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
    #TEMP_DEACTIVATED!  names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")
    
    ## The heatmap colours should correspond to the factor levels
    #heatmap.colours <- #setNames(c("cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta","cyan",
    #"magenta","navyblue","cornflowerblue","gold","orange","darkgreen", "green", "seagreen3", #"chocolate4","brown","purple","pink", "tan","cyan","red"),
    #c("5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none", "P01", "P02", "P03", "P04", "P05", "P06", "P07", #"P08", "P10", "P11", "P12", "P16", "P17", "P19", "P20"))
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    png("ggtree_and_gheatmap_selected_genes.png", width=1690, height=1400)
    #svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 17) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  6. Response

    As mentioned earlier, the presence-absence matrix of genes was generated using a BLASTn comparison between the genome and the gene sequences. Ensuring the accuracy of these gene sequences is crucial for reliable results. I prepared the sequences by searching GeneBank using the gene names, but there may be some ambiguity where the same gene name corresponds to different sequences.
    
    Attached, you will find the sequences for the 11 genes that were not identified in the isolates: gltA, psm-beta1, hlb, ebp, tagB, capA, sepA, fmtC, sceD, esp, and ecpA. I think it would be helpful if you could verify whether these sequences correspond to the genes we are looking for.
    
    In addition, I have attached a figure showing all samples with short-read sequencing. As illustrated in the plot, SCCmec and agr typing are still pending. If necessary, I can perform the typing and include the information in the plot. Alternatively, we could focus exclusively on the 10 samples obtained through long-read sequencing.
    
    Please let me know how you would prefer to proceed.

RNA-seq Tam on Acinetobacter baumannii strain ATCC 19606 CP059040.1 (Data_Tam_RNAseq_2024)

Urine_vs_MHB

AUM_vs_MHB

http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

Methods

Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).

The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:

nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results –fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta –gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume –max_cpus 55 –max_memory 512.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –gtf_group_features gene_id –gtf_extra_attributes gene_name –featurecounts_group_type gene_biotype –featurecounts_feature_type transcript

  1. Preparing raw data

    They are wildtype strains grown in different medium.
    AUM - artificial urine medium
    Urine - human urine
    MHB - Mueller-Hinton broth
    AUM(人工尿液培养基):pH值、营养成分、无菌性、渗透压、温度、污染物。
    Urine(人类尿液):pH值、比重、温度、污染物、化学成分、微生物负荷。
    MHB(Mueller-Hinton培养基):pH值、无菌性、营养成分、温度、渗透压、抗生素浓度。
    
    mkdir raw_data; cd raw_data
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz
  2. (Optional) using trinity to find the most closely reference

    Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
    #https://www.genome.jp/kegg/tables/br08606.html#prok
    acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
    abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
    aby     KGB     Acinetobacter baumannii AYE     2008    GenBank
    abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
    abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
    abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
    abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
    abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
    abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
    abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
    abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
    abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
    abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
    abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
    abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
    abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
    abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
    abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
    abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
    abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
    abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
    #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CP059040.1) was choosen as reference!
  3. Downloading CP059040.fasta and CP059040.gff from GenBank

  4. (Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed

    #Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta .     # Elements (Anna C.arnes)
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed .
    rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    (base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*"
    ./CP059040.fasta
    ./CP059040.bed
    ./CP059040.gb
    ./CP059040.gff3
    ./CP059040.gff3_backup
    ./CP059040_full.gb
    ./CP059040_gene.gff3
    ./CP059040_gene.gtf
    ./CP059040_gene_old.gff3
    ./CP059040_rRNA.gff3
    ./CP059040_rRNA_v.gff3
    
    # ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ----
    #gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_
    #grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    #sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    #gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    
    grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript
    sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    #grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l  #69=3887-3803
    
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    
    #3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42
    # --------------------------------------------------------------------------------------------------------------------------------------------------
    # ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3    #-->3796
    
    #The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing.
    # from gff3 to gtf
    sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf
    sed -i -e "s/;/\"; /g" CP059040_gene.gtf
    sed -i -e "s/=/=\"/g" CP059040_gene.gtf
    
    #sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf
    #using editor instead!
    
    #The following is GTF-format.
    CP000521.1      Genbank exon    95      1492    .       +       .       transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001";
    
    #NZ_MJHA01000001.1       RefSeq  region  1       8663    .       +       .       ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii
    #NZ_MJHA01000001.1       RefSeq  gene    228     746     .       -       .       ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795
    #NZ_MJHA01000001.1       Protein Homology        CDS     228     746     .       -       0       ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11
    
    ##gff-version 3
    ##sequence-region CP059040.1 1 3980852
    ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470
    
    gffread -E -F --bed CP059040.gff3 -o CP059040.bed    #-->3796
    ##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3
    ##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown
    #[01/21 10:57:46] Loading reference annotation (guides)..
    #GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810)
    #[01/21 10:57:46] 3796 reference transcripts loaded.
    #Default stack size for threads: 8388608
    #WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?)
    #WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
    #Please make sure the -G annotation file uses the same naming convention for the genome sequences.
    #[01/21 10:58:30] All threads finished.
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'Name'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to termin
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'gene_biotype'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to terminate.
    
    #grep "ID=cds-" CP059040.gff3 | wc -l
    #grep "ID=exon-" CP059040.gff3 | wc -l
    #grep "ID=gene-" CP059040.gff3 | wc -l   #the same as H0N29_18980/5=3796
    grep "gbkey=" CP059040.gff3 | wc -l  7695
    grep "ID=id-" CP059040.gff3 | wc -l  5
    grep "locus_tag=" CP059040.gff3 | wc -l    7689
    #...
    cds   3701                             locus_tag=xxxx, no gene_biotype
    exon   96                              locus_tag=xxxx, no gene_biotype
    gene   3796                            locus_tag=xxxx, gene_biotype=xxxx,
    id  (riboswitch+direct_repeat,5)       both no --> ignoring them!!  # grep "ID=id-" CP059040.gff3
    rna    96                              locus_tag=xxxx, no gene_biotype
    ------------------
        7694
    
    cp CP059040.gff3_backup CP059040.gff3
    grep "^##" CP059040.gff3 > CP059040_gene.gff3
    grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3
    #!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'!
    sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3
  5. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \
    for sample_id in MHB_r1 MHB_r2 MHB_r3; do \
            java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
    done
  6. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto
    AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto
    AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto
    MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto
    MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto
    MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto
    Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
    Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto
    Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto
  7. nextflow run

    #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
    
    docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    
    #Default: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
    #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
    #Checking the record (see below) in results/genome/CP059040.gtf
    #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
    #--featurecounts_feature_type 'transcript' returns only the tRNA results
    #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon
    
    grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
    grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
    grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
    grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
    wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!
    
    grep -P "\tCDS\t" CP059040.gff | wc -l  #3701
    sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
    grep -P "\texon\t" CP059040_m.gff | sort | wc -l  #3797
    
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
    --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
  8. Import data and pca-plot

    #mamba activate r_env
    
    #install.packages("ggfun")
    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    #library("org.Hs.eg.db")
    library(dplyr)
    library(tidyverse)
    #install.packages("devtools")
    #devtools::install_version("gtable", version = "0.3.0")
    library(gplots)
    library("RColorBrewer")
    #install.packages("ggrepel")
    library("ggrepel")
    # install.packages("openxlsx")
    library(openxlsx)
    library(EnhancedVolcano)
    library(DESeq2)
    
    setwd("~/DATA/Data_Tam_RNAseq_2024/results/star_salmon")
    # Define paths to your Salmon output quantification files
    files <- c("AUM_r1" = "./AUM_r1/quant.sf",
            "AUM_r2" = "./AUM_r2/quant.sf",
            "AUM_r3" = "./AUM_r3/quant.sf",
            "Urine_r1" = "./Urine_r1/quant.sf",
            "Urine_r2" = "./Urine_r2/quant.sf",
            "Urine_r3" = "./Urine_r3/quant.sf",
            "MHB_r1" = "./MHB_r1/quant.sf",
            "MHB_r2" = "./MHB_r2/quant.sf",
            "MHB_r3" = "./MHB_r3/quant.sf")
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    # Define the replicates and condition of the samples
    replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
    condition <- factor(c("AUM","AUM","AUM", "Urine","Urine","Urine", "MHB","MHB","MHB"))
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    # -- transcript-level count data (x2) --
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    write.csv(counts(dds), file="transcript_counts.csv")
    # -- gene-level count data (x2) --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition+replicate)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #3796->3487
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    dim(counts(dds))
    head(counts(dds), 10)
    rld <- rlogTransformation(dds)
    
    #We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function.
    #The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples.
    #The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups.
    #So, the typical workflow is:
    #  - Create the DESeqDataSet object.
    #  - Use estimateSizeFactors to normalize for library size.
    #  - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.)
    #  - Use DESeq for the differential expression analysis.
    #  - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq.
    
    # draw simple pca and heatmap
    #mat <- assay(rld)
    #mm <- model.matrix(~condition, colData(rld))
    #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
    #assay(rld) <- mat
    # -- pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    # -- heatmap --
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
  9. (Optional (ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

    #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
    #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
    #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    
    sizeFactors(dds)
    #NULL
    # Estimate size factors
    dds <- estimateSizeFactors(dds)
    # Estimate dispersions
    dds <- estimateDispersions(dds)
    #> sizeFactors(dds)
    
    #control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
    #2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
    #HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
    #0.4604461  0.5799834  0.6803681
    
    # (DEBUG) If avgTxLength is Necessary
    #To simplify the computation and ensure sizeFactors are calculated:
    assays(dds)$avgTxLength <- NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    #If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
    dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
    sizeFactors(dds)
    
    # If alone with virus data, the following BUG occured:
    #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                        HeLa_TO_r1                      HeLa_TO_r2
                        0.9978755                       1.1092227
    data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    
    #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    1/0.9978755=1.002129023
    1/1.1092227=
    #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
    
    raw_counts <- counts(dds)
    normalized_counts <- counts(dds, normalized=TRUE)
    #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
    #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    estimSf <- function (cds){
        # Get the count matrix
        cts <- counts(cds)
        # Compute the geometric mean
        geomMean <- function(x) prod(x)^(1/length(x))
        # Compute the geometric mean over the line
        gm.mean  <-  apply(cts, 1, geomMean)
        # Zero values are set to NA (avoid subsequentcdsdivision by 0)
        gm.mean[gm.mean == 0] <- NA
        # Divide each line by its corresponding geometric mean
        # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
        # MARGIN: 1 or 2 (line or columns)
        # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
        # FUN: the function to be applied
        cts <- sweep(cts, 1, gm.mean, FUN="/")
        # Compute the median over the columns
        med <- apply(cts, 2, median, na.rm=TRUE)
        # Return the scaling factor
        return(med)
    }
    #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
    #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
    #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
    #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
    #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
    #DESeq2’s median of ratios [1]
    #EdgeR’s trimmed mean of M values (TMM) [2]
    #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
    test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
    sum(test_normcount != normalized_counts)
  10. Select the differentially expressed genes

    #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
    #https://www.biostars.org/p/282295/
    #https://www.biostars.org/p/335751/
    #> dds$condition
    #[1] AUM   AUM   AUM   Urine Urine Urine MHB   MHB   MHB
    #Levels: AUM MHB Urine
    #CONSOLE: mkdir star_salmon/degenes
    
    setwd("degenes")
    #---- relevel to control ----
    dds$condition <- relevel(dds$condition, "MHB")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("AUM_vs_MHB","Urine_vs_MHB")
    
    for (i in clist) {
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      res_df <- as.data.frame(res)
    
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.05 & log2FoldChange>=1.35)
      down <- subset(res_df, padj<=0.05 & log2FoldChange<=-1.35)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    # -- Under host-env --
    grep -P "\tgene\t" CP059040.gff > CP059040_gene.gff
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-all.txt AUM_vs_MHB-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-up.txt AUM_vs_MHB-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-down.txt AUM_vs_MHB-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-all.txt Urine_vs_MHB-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-up.txt Urine_vs_MHB-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-down.txt Urine_vs_MHB-down.csv
    
    #for i in AUM_vs_MHB Urine_vs_MHB; do
    #  echo "contrast = paste(\"condition\", \"${i}\", sep=\"_\")"
    #  echo "res = results(dds, name=contrast)"
    #  echo "res <- res[!is.na(res$log2FoldChange),]"
    #  echo "res_df <- as.data.frame(res)"
    #  #selectLab = selectLab_italics,
    #  echo "png(\"${i}.png\",width=1200, height=1000)"
    #  #legendPosition = 'right',legendLabSize = 12,  arrowheads = FALSE,
    #  #subtitle=expression(~Delta*\"$(echo $i | cut -d'_' -f1) versus \" *~Delta*\"$(echo $i | cut -d'_' -f3)\"))"
    #  echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.2, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(\"$(echo $i | cut -d'_' -f1) versus $(echo $i | cut -d'_' -f3)\"))"
    #  echo "dev.off()"
    #done
    
    res <- read.csv("AUM_vs_MHB-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    #print(duplicated_genes)
    # [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
    #[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
    #[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
    #res[res$GeneName == "bfr", ]
    
    #1st_strategy First occurrence is kept and Subsequent duplicates are removed
    #res <- res[!duplicated(res$GeneName), ]
    #2nd_strategy keep the row with the smallest padj value for each GeneName
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_AUM_vs_MHB.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-150
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("AUM_vs_MHB.png", width=1200, height=2000)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("AUM versus MHB"))
    dev.off()
    
    res <- read.csv("Urine_vs_MHB-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-305
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("Urine_vs_MHB.png", width=1200, height=2000)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("Urine versus MHB"))
    dev.off()
  11. Report

    Attached are the results of the analysis.
    
    In the Urine_vs_MHB comparison, we identified a total of 259 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 138 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). Notably, the following genes have a p-adjusted value of 0, indicating very high confidence in their differential expression. The bas-series genes (basA, basB, basC, basD, basE, basJ) are particularly prominent:
    
    GeneName    GeneID  baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
    basJ    gene-H0N29_05120    11166.90    11.42   0.30    38.27   0   0
    basE    gene-H0N29_05085    12006.52    10.45   0.23    45.76   0   0
    basD    gene-H0N29_05080    12217.80    10.15   0.24    42.42   0   0
    bauA    gene-H0N29_05070    25280.68    9.55    0.19    51.48   0   0
    basA    gene-H0N29_05040    9750.68 9.02    0.18    48.90   0   0
    basC    gene-H0N29_05075    5034.14 8.58    0.21    40.07   0   0
    H0N29_08320 gene-H0N29_08320    4935.78 7.87    0.20    40.01   0   0
    barB    gene-H0N29_05105    5187.29 7.81    0.18    43.39   0   0
    H0N29_09380 gene-H0N29_09380    3477.26 7.41    0.19    38.91   0   0
    H0N29_13950 gene-H0N29_13950    13959.05    6.85    0.15    45.70   0   0
    H0N29_10825 gene-H0N29_10825    3664.70 6.44    0.17    37.59   0   0
    H0N29_10790 gene-H0N29_10790    2574.12 6.41    0.17    37.86   0   0
    H0N29_10010 gene-H0N29_10010    9376.84 -8.14   0.19    -43.70  0   0
    
    In the AUM_vs_MHB comparison, we identified a total of 149 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 65 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). The following genes also show a p-adjusted value of 0, indicating very high confidence in their differential expression:
    
    GeneName    GeneID  baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
    putA    gene-H0N29_09870    36100.24    -7.25   0.15    -49.78  0   0
    H0N29_10010 gene-H0N29_10010    9376.84 -7.43   0.18    -41.96  0   0
    
    To ensure proper visualization, I replaced the padj = 0 values with small numbers: 1e-305 for Urine_vs_MHB and 1e-150 for AUM_vs_MHB.
    
    We have now identified the significantly expressed genes. If you would like any further analysis based on these genes or need additional plots, please let me know.
  12. (TODO) clustering the genes and draw heatmap

    for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do echo "cut -d',' -f1-1 ${i}-up_annotated.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down_annotated.txt > ${i}-down.id"; done
    cat *.id | sort -u > ids
    #add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id  #4647
    RNASeq.NoCellLine <- assay(rld)
    #install.packages("gplots")
    library("gplots")
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    #datamat = RNASeq.NoCellLine
    write.csv(as.data.frame(datamat), file ="gene_expressions.txt")
    constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
    if(any(constant_rows)) {
      cat("Removing", sum(constant_rows), "constant rows.\n")
      datamat <- datamat[!constant_rows, ]
    }
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.05)
    mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
    mycol = mycol[as.vector(mycl)]
    #png("DEGs_heatmap.png", width=900, height=800)
    #cex.lab=10, labRow="",
    png("DEGs_heatmap.png", width=800, height=1000)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',labRow="",
                scale='row',trace='none',col=bluered(75), cexCol=1.8,
                RowSideColors = mycol, margins=c(10,2), cexRow=1.5, srtCol=30, lhei = c(1, 8), lwid=c(2, 8))  #rownames(datamat)
    #heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,5), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=30, lhei=c(0.1,2))
    dev.off()
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
    write.csv(names(subset(mycl, mycl == '4')),file='cluster4.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    significant_gene_expressions.txt \
    -d',' -o DEGs_heatmap_expression_data.xls;
    #### cluster members (advanced) #####
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
    subset_2<-names(subset(mycl, mycl == '2'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
    subset_3<-names(subset(mycl, mycl == '3'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
    subset_4<-names(subset(mycl, mycl == '4'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
    subset_5<-names(subset(mycl, mycl == '5'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        colnames(expression_values) <- colnames(data)
        # Combine gene information and expression data
        combined_result <- cbind(result, expression_values)
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    #write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Kostenart von Nebenkostenabrechnung

  • Betriebskosten

    • Abwasser / Entwässerung

    • Allgemeinstrom

    • Antenne

    • Breitbandkabel

    • Dachrinnenreinigung

    • Deichabgabe

    • Gartenpflege

    • Gebäudereinigung

    • Gebäudeversicherung

    • Haftpflichtversicherung

    • Hausbeleuchtung

    • Hauswart

    • Heizkosten

    • Heizungsanlagen

    • Kaltwasser

    • Müllabfuhr

    • Niederschlagswasser

    • Recycling

    • Schornsteinreinigung

    • Straßenreinigung

    • Ungezieferbekämpfung

    • Warmwasser

    • Warmwasserversorgungsanlagen

    • Wartung Aufzug

    • Wartung Brandschutz /Rauchwarnmelder

    • Wartung Lüftungsanlagen

    • Wartung Tiefgarage

    • Wäscheeinrichtungen

    • Winterdiest

    • Sonstige Kostenart hinzufügen

    • Kosten für Dichtigkeitsprüfung nicht umlegbar In einigen Bundesländern ist vorgeschrieben, dass Abwasserleitungen auf Dichtigkeit zu überprüfen sind. Die hierdurch entstehenden Kosten sind ebenfalls nicht umlagefähig.

    • Schornsteinfegerkosten: Gibt es keine Zentralheizung und keine zentrale Warmwasserversorgung, dann sind die Schornsteinfegerkosten über die “kalten Betriebskosten” abzurechnen. Sie können dann nach Zahl der Wohnungen oder nach anteiliger Wohnfläche umgelegt werden.

    • Müllabfuhr: Letztlich kommt es zunächst auf den Mietvertrag an, was Mieter und Vermieter vereinbart haben. Finden Sie nichts dazu im Mietvertrag, wird sich Ihr Vermieter an der gesetzlichen Regelung orientieren. Die Umlage erfolgt dann anhand der Wohnfläche Ihrer Wohnung, sofern die Müllmenge nicht konkret erfasst wird.

    • Gebäudeversicherung: Die Kosten für die Gebäudeversicherung dürfen Immobilieneigentümer auf die Mieter umlegen. Bei Mehrfamilienhäusern erfolgt die Umlage meist je nach Anteil der einzelnen Wohnungen an der gesamten Wohnfläche. Vermieter müssen beim Abschluss einer Gebäudeversicherung darauf achten, dass die Prämie nicht überhöht ist.

    • Grundsteuer: Zahlung der Grundsteuer als Betriebskosten für die Wohnung Üblich ist, dass die Grundsteuer nach der Wohnfläche der einzelnen Wohnungen (wenn im Mietvertrag keine andere Verteilung vereinbart ist) verteilt wird. Der Vermieter nutzt hierfür den Grundsteuerbescheid.

  • Direkte Kosten

    • Grundsteuer
    • Sonstige direkte Kosten