Author Archives: gene_x

Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v3)

There are several alternative R packages and tools to perform motif enrichment analysis for RNA-binding proteins (RBPs), beyond PWMEnrich::motifEnrichment(). Here are the most notable ones:

Tool / Package Enrichment Custom Motifs CLI or R? RNA-specific? Notes
PWMEnrich R Tried (see pipeline.v1-block3)
RBPmap ❌ (uses own db) Web/CLI Tried RBPmap, but it is too slow
Biostrings/TFBSTools ❌ (only scanning) R ATtRACT+Biostrings/TFBSTools (tried, pipeline.v1-block3)
rmap ✅ (CLIP-based) R
Homer CLI ⚠ RNA optional
MEME (AME, FIMO) Web/CLI ⚠ Generic Finally using ATtRACT+FIMO, AME has BUG, not runnable
#For me it was suggested to use “RBPmap” or “GraphProt” to do this analysis.
  1. Get 3UTR.fasta, 5UTR.fasta, CDS.fasta and transcripts.fasta

            mRNA Transcript
    ┌────────────┬────────────┬────────────┐
    │   5′ UTR   │     CDS    │   3′ UTR   │
    └────────────┴────────────┴────────────┘
    ↑            ↑            ↑            ↑
    Start        Start        Stop         End
    of           Codon       Codon        of
    Transcript                             Transcript
    
    ✅ Option 1: Use GENCODE and python scripts (CHOSEN!)
    
    #Input: up- and down-, all-regulated files
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt    #20086
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt  #634
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt     #23832
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt   #375
    
    #Filtering the down-regulated genes to include only protein_coding genes before extracting 3' UTRs, because
    #1. Only protein_coding genes have well-annotated 3' UTRs
    #3' UTRs are defined as the region after the CDS (coding sequence) and before the poly-A tail.
    #Non-coding RNAs (e.g., lncRNA, snoRNA, miRNA precursors) do not have CDS, and therefore don't have canonical 3' UTRs.
    #2. In GENCODE, most UTR annotations are only provided for transcripts of gene_type = "protein_coding".
    
    cd ~/DATA/Data_Ute/RBPs_analysis/extract_3UTR_5UTR_CDS_transcript
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up_protein_coding.txt
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down_protein_coding.txt
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up_protein_coding.txt
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down_protein_coding.txt
    
    #Visit and Download: GENCODE FTP site https://www.gencodegenes.org/human/
        * GTF annotation file (e.g., gencode.v48.annotation.gtf.gz)
        * Corresponding genome FASTA (e.g., GRCh38.primary_assembly.genome.fa.gz)
    wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
    wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/GRCh38.primary_assembly.genome.fa.gz
    gunzip gencode.v48.annotation.gtf.gz
    gunzip GRCh38.primary_assembly.genome.fa.gz
    
    python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_down
    python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_up  #5988
    python extract_transcript_parts.py WaGa_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_down  #93
    python extract_transcript_parts.py WaGa_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_up  #6538
    
    ✅ Option 2-5 see at the end!
  2. Why 3′ UTR?

    🧬 miRNA, RBP, or translation/post-transcriptional regulation
    ➡️ Use 3' UTR sequences
    
    Because:
    
    Most miRNA binding and many RBP motifs are located in the 3' UTR.
    
    It’s the primary region for mRNA stability, localization, and translation regulation.
    
    🧠 Example: You're looking for binding enrichment of miRNAs or RNA-binding proteins (PUM, HuR, etc.)
    ✅ Input = 3UTR.fasta
    
    🧪 If you're testing PBRs related to:
    - Translation initiation, upstream ORFs, or 5' cap interaction:
    ➡️ Use 5' UTR
    
    - Coding mutations, protein-level motifs, or translational efficiency:
    ➡️ Use CDS
    
    - General transcriptome-wide motif search (no preference):
    ➡️ Use transcripts, or test all regions separately to localize signal
  3. Recommended Workflow with RBPmap https://rbpmap.technion.ac.il (Too slow!)

    RBPmap itself does not compute enrichment p-values or FDR; it's a motif scanning tool.
    
    To get statistically meaningful RBP enrichments, combine RBPmap with custom permutation testing or Fisher’s exact test + multiple testing correction.
    
        1. Prepare foreground (target) and background sequences
    
            Extract 3′ UTRs of:
    
            📉 Downregulated mRNAs (foreground) — likely targeted by upregulated miRNAs
    
            ⚪ A control set of 3′ UTRs — e.g., non-differentially expressed protein-coding genes
    
                grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-all.txt > MKL-1_wt.EV_vs_parental-all_protein_coding.txt
                grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-all.txt > WaGa_wt.EV_vs_parental-all_protein_coding.txt
    
                cut -d',' -f1 MKL-1_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
                cut -d',' -f1 MKL-1_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #5988
                cut -d',' -f1 MKL-1_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #112
                cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
                comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
                grep -Ff background_genes.txt MKL-1_wt.EV_vs_parental-all_protein_coding.txt > MKL-1_wt.EV_vs_parental-background_protein_coding.txt  #13139
    
                cut -d',' -f1 WaGa_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
                cut -d',' -f1 WaGa_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #6538
                cut -d',' -f1 WaGa_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #93
                cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
                comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
                grep -Ff background_genes.txt WaGa_wt.EV_vs_parental-all_protein_coding.txt > WaGa_wt.EV_vs_parental-background_protein_coding.txt  #12608
    
                python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_background
                python extract_transcript_parts.py WaGa_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_background
    
                foreground.fasta: 你的目标(前景)序列,例如下调基因的 3′UTRs。
                background.fasta: 你的背景对照序列,例如未显著差异表达的基因的 3′UTRs。
    
        2. Run RBPmap separately on both sets (in total of 6 calculations)
    
            * Submit both sets of UTRs to RBPmap.
            * Use the same settings (e.g., “human genome”, “high stringency”, "Apply conservation filter" etc.)
            * Choose all RBPs
            * Download motif match outputs for both sets
    
        3. Count motif hits per RBP in each set
    
            You now have:
            For each RBP:
            a: number of target 3′ UTRs with a motif match
            b: number of background 3′ UTRs with a motif match
            c: total number of target 3′ UTRs
            d: total number of background 3′ UTRs
    
        4. Perform Fisher’s Exact Test per RBP
    
            For each RBP, construct a 2x2 table:
    
            Motif Present   Motif Absent
            Foreground (targets)    a   c - a
            Background  b   d - b
    
        5. Adjust p-values for multiple testing
        Use Benjamini-Hochberg (FDR) correction (e.g., in Python or R) across all RBPs tested.
    
        6.✅ Summary
    
            Step    Tool
            Prepare Database of RNA-binding motifs  ATtRACT
            3′ UTR extraction   extract_transcript_parts.py
            Motif scan  RBPmap or FIMO
            Count motif hits    Your own parser (Python or R)
            Fisher’s exact test scipy.stats or fisher.test()
            FDR correction  multipletests() or p.adjust()
    
        python rbp_enrichment.py rbpmap_downregulated.tsv rbpmap_background.tsv rbp_enrichment_results.csv
  4. Quick Drop-In Plan (RBPmap Alternative with FIMO for motif scan)

    1. [ATtRACT + FIMO (MEME suite)]
    
        ATtRACT: Database of RNA-binding motifs.
        FIMO: Fast and scriptable motif scanning tool.
    
        #Download RBP motifs (PWM) from ATtRACT DB; Convert to MEME format (if needed); Use FIMO to scan UTR sequences
    
        grep "Homo_sapiens" ATtRACT_db.txt > attract_human.txt
    
        #cut -f12 attract_human.txt | sort | uniq > valid_ids.txt
    
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta MKL-1_background.filtered.3UTR.fasta
        ✅ 筛选完成: 总序列 = 70650
        🧹 已移除过短序列 (<16 nt): 1760
        🟢 保留有效序列 (≥16 nt): 68890
        📁 新背景文件保存为: MKL-1_background.filtered.3UTR.fasta
        # 检查背景文件中有多少序列:
        grep -c "^>" MKL-1_background.filtered.3UTR.fasta
        68890
        # 检查背景 FIMO 命中的总序列数:
        cut -f3 fimo_background_MKL-1_background/fimo.tsv | sort | uniq | wc -l
        67841
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta MKL-1_up.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta MKL-1_down.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta WaGa_background.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta WaGa_up.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta WaGa_down.filtered.3UTR.fasta
    
        python convert_attract_pwm_to_meme.py
    
        fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_background_MKL-1_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_foreground_WaGa_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_foreground_WaGa_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_background_WaGa_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta
    
        #Explanation for the table from FIMO (Find Individual Motif Occurrences), which scans sequences to find statistically significant matches to known motifs (e.g., RNA or DNA binding sites).
    
        Column  Meaning
        motif_id    ID of the motif, as defined in the .meme file
        motif_alt_id    Alternative ID or name for the motif (may be blank or unused)
        sequence_name   Name of the sequence where the motif was found (e.g., gene
        start   Start position (1-based) of the motif match within the sequence
        stop    End position of the motif match
        strand  Strand on which the motif was found: + (forward) or - (reverse)
        score   Motif match score; higher scores indicate better matches
        p-value Statistical significance of the match (lower is more significant)
        q-value Adjusted p-value (False Discovery Rate corrected)
        matched_sequence    The actual sequence in the input that matches the motif
    
        ✅ Example Interpretation
        1338 ENSG00000134871|ENST00000714397|3UTR 103 114 + 23.4126 5.96e-08 0.111 GGAGAGAAGGGA
        motif_id: 1338 — a numeric ID from your motif file
        sequence_name: ENSG00000134871|ENST00000714397|3UTR — refers to the gene, transcript, and region (3′ UTR)
        start–stop: 103–114 — the motif occurs from position 103 to 114
        strand: + — found on the positive strand
        score: 23.41 — high score means strong motif match
        p-value: 5.96e-08 — very statistically significant
        q-value: 0.111 — FDR-corrected p-value
        matched_sequence: GGAGAGAAGGGA — the actual sequence match in the UTR
    
        💡 Tips
        You can map motif_id to RBP (RNA-binding protein) names using an annotation file like ATtRACT_db.txt.
        Typically, q-value < 0.05 is considered significant.
        Duplicate matches in different transcripts of the same gene may occur and are valid.
        Would you like help converting motif_id to RBP names for clarity?
    
        🧠 In most biological contexts:
            * Counting a motif as present multiple times because it's in several transcripts can inflate significance.
            * If you're using Fisher's exact test (as in enrichment), this transcript-level duplication can bias results.
    
        ⚠️ Caveat: If you're studying isoform-specific regulation, then transcript-level data may be valuable and shouldn't be collapsed. But for most general RBP enrichment or gene expression studies, the gene-level collapse is preferred.
    
        #Keep only one match per gene (based on Ensembl Gene ID like ENSG00000134871) for each RBP motif, even if multiple transcripts have hits.
        #python filter_fimo_best_per_gene.py --input fimo_foreground/fimo.tsv --output fimo_foreground/fimo.filtered.tsv
        convert_gtf_to_Gene_annotation_TSV_file.py  #generate gene_annotation.tsv
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_MKL-1_down/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_MKL-1_down/fimo.filtered.tsv \
        --output_annotated fimo_foreground_MKL-1_down/fimo.filtered.annotated.tsv
        #21559
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_MKL-1_up/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_MKL-1_up/fimo.filtered.tsv \
        --output_annotated fimo_foreground_MKL-1_up/fimo.filtered.annotated.tsv
        #(736661 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_background_MKL-1_background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_background_MKL-1_background/fimo.filtered.tsv \
        --output_annotated fimo_background_MKL-1_background/fimo.filtered.annotated.tsv
        #(1869075 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_WaGa_down/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_WaGa_down/fimo.filtered.tsv \
        --output_annotated fimo_foreground_WaGa_down/fimo.filtered.annotated.tsv
        #(20364 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_WaGa_up/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_WaGa_up/fimo.filtered.tsv \
        --output_annotated fimo_foreground_WaGa_up/fimo.filtered.annotated.tsv
        #(805634 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_background_WaGa_background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_background_WaGa_background/fimo.filtered.tsv \
        --output_annotated fimo_background_WaGa_background/fimo.filtered.annotated.tsv
        #(1811615 rows)
    
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_MKL-1_up/fimo.filtered.tsv \
            --fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
            --output rbp_enrichment_MKL-1_up.csv \
            --strategy inclusive
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_MKL-1_down/fimo.filtered.tsv \
            --fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
            --output rbp_enrichment_MKL-1_down.csv
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_WaGa_up/fimo.filtered.tsv \
            --fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
            --output rbp_enrichment_WaGa_up.csv
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_WaGa_down/fimo.filtered.tsv \
            --fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
            --output rbp_enrichment_WaGa_down.csv
    
        python plot_volcano.py --csv rbp_enrichment_MKL-1_up.csv --output MKL-1_volcano_up.pdf --title "Upregulated MKL-1"
        python plot_rbp_heatmap.py \
        --csvs rbp_enrichment_MKL-1_up.csv rbp_enrichment_MKL-1_down.csv \
        --labels Upregulated Downregulated \
        --output MKL-1_rbp_enrichment_heatmap.pdf
    
        #Column Meaning
        #a  Number of unique foreground UTRs hit by the RBP
        #b  Number of unique background UTRs hit by the RBP
        #c  Total number of foreground UTRs
        #d  Total number of background UTRs (⬅️ this is the value you're asking about)
        #p_value, fdr   From Fisher's exact test on enrichment
    
        #-- Get all genes the number 1621 refers to --
        #AGO2,1621,5050,5732,12987,1.0,1.0   #MKL-1_up
        #motif_ids are 414 and 399
        grep "^414" fimo.filtered.annotated.tsv > AGO2.txt
        grep "^399" fimo.filtered.annotated.tsv >> AGO2.txt
        cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
        wc -l AGO2_uniq.txt
        #1621 AGO2_uniq.txt
    
        #工具 功能  关注点 应用场景
        FIMO    精确查找 motif 出现位置 motif 在什么位置出现   找出具体结合位点
        AME 统计 motif 富集情况   哪些 motif 在某组序列中更富集  比较 motif 是否显著出现更多
    
        如你还在做差异表达后的RBP富集分析,可以考虑先用 FIMO 扫描,再用你自己写的代码 + Fisher’s exact test 做类似 AME 的工作,或直接用 AME 做分析
    
        # Generate the attract_human.meme inkl. Gene_name!
        #python generate_named_meme.py pwm.txt attract_human.txt
        python generate_attract_human_meme.py pwm.txt ATtRACT_db.txt
    
        #ERROR during running ame --> DEBUG!
        #--control ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_all.3UTR.fasta \
        ame --control --shuffle-- \
        --oc ame_out \
        --scoring avg \
        --method fisher --verbose 5 ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta attract_human.meme
    
    2. GraphProt2 (ALTERNATIVE_TODO)
    
        ML-based tool using sequence + structure
    
        Pre-trained models for many RBPs
    
        ✅ Advantages:
    
        Local, GPU/CPU supported
    
        More biologically realistic (includes structure)
  5. miRNAs motif analysis using ATtRACT + FIMO

    ✅ Goal
    
        * Extract their sequences
        * Generate a background set
        * Run RBP enrichment (e.g., with RBPmap or FIMO)
        * Get p-adjusted enrichment stats (e.g., Fisher + BH)
    
        Input_1. DE results (differential expression file from smallRNA-seq)
            #Input: up- and down-, all-regulated files
            #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-up.txt  #83
            #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-down.txt  #34
            #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-all.txt  #1304
            ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-up.txt  #66
            ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-down.txt  #38
            ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-all.txt  #1304
            #Format: 1st column = miRNA ID (e.g., hsa-miR-21-5p), optionally with other stats.
    
        Input_2. Reference FASTA (Reference sequences from miRBase or GENCODE)
            #From miRBase: https://mirbase.org/download/  https://mirbase.org/download/CURRENT/
            ##miRBase_v21
            #mature.fa.gz → contains mature miRNA sequences
            #hairpin.fa.gz → for pre-miRNAs
    
            cp ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-*.txt .
            #"hsa-miR-3180|hsa-miR-3180-3p"
            #>hsa-miR-3180 MIMAT0018178 Homo sapiens miR-3180
            #UGGGGCGGAGCUUCCGGAG
            #>hsa-miR-3180-3p MIMAT0015058 Homo sapiens miR-3180-3p
            #UGGGGCGGAGCUUCCGGAGGCC
    
        5.1 (Optional, not used!)
    
            #python extract_miRNA_fasta.py EV_vs_parental-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt  #84+0
            #python extract_miRNA_fasta.py EV_vs_parental-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt  #0
            #python extract_miRNA_fasta.py EV_vs_parental-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt  #34+0
            #python extract_miRNA_fasta.py EV_vs_parental-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt  #0
            #python extract_miRNA_fasta.py EV_vs_parental-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt         #1304+16
            #python extract_miRNA_fasta.py EV_vs_parental-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt  #16
            python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt  #67+0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt  #0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt  #38+0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt  #0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt         #1304+16
            python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt  #16
    
        5.2 (Advanced)
            Extract Sequences + Background Set
    
            Inputs:
                * up_miRNA.txt and down_miRNA.txt: DE results (first column = miRNA name, e.g., hsa-miR-21-5p)
                * mature.fa or hairpin.fa from miRBase
    
            Outputs:
                * mirna_up.fa
                * mirna_down.fa
                * mirna_background.fa
    
            #Use all remaining miRNAs as background:
            python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna --full-background
            mv mirna_background.fa mirna_full-background.fa
            #Use random subset background. Note that the generated background has the number of maxsize(up, down), in the case is up (84 records):
            python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna
            # grep ">" mature_v21.fa | wc -l  #35828
            # grep ">" mirna_full-background.fa | wc -l  #35710-->35723
            # grep ">" mirna_up.fa | wc -l  #84
            # grep ">" mirna_down.fa | wc -l  #34
            # grep ">" mirna_background.fa | wc -l  #84-->67
            # #35,710 + 84 + 34 = 35,828
    
        🔬 What You Can Do Next
        Goal    Tool    Input
        * RBP motif enrichment in pre-miRNAs    RBPmap, FIMO, AME   up_precursor_miRNAs.fa
        * Motif comparison (up vs down miRNAs)  DREME, MEME, HOMER  Up/down mature miRNAs
        * Build background for enrichment   Random subset of other miRNAs   Filtered from hairpin.fa
    
        fimo --thresh 1e-4 --oc fimo_mirna_down attract_human.meme mirna_down.fa
        fimo --thresh 1e-4 --oc fimo_mirna_up attract_human.meme mirna_up.fa
        fimo --thresh 1e-4 --oc fimo_mirna_full-background attract_human.meme mirna_full-background.fa
        fimo --thresh 1e-4 --oc fimo_mirna_background attract_human.meme mirna_background.fa
        #END
    
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_down/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_down/fimo.filtered.tsv \
        --output_annotated fimo_mirna_down/fimo.filtered.annotated.tsv  #21
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_up/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_up/fimo.filtered.tsv \
        --output_annotated fimo_mirna_up/fimo.filtered.annotated.tsv  #48
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_full-background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_full-background/fimo.filtered.tsv \
        --output_annotated fimo_mirna_full-background/fimo.filtered.annotated.tsv  #896
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_background/fimo.filtered.tsv \
        --output_annotated fimo_mirna_background/fimo.filtered.annotated.tsv  #57
    
        python run_enrichment_miRNAs.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_mirna_up/fimo.filtered.tsv \
            --fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
            --output rbp_enrichment_mirna_up.csv \
            --strategy inclusive
        python run_enrichment_miRNAs.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_mirna_down/fimo.filtered.tsv \
            --fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
            --output rbp_enrichment_mirna_down.csv \
            --strategy inclusive
        #python run_enrichment_miRNAs.py \
        #    --attract ATtRACT_db.txt \
        #    --fimo_fg fimo_mirna_up/fimo.filtered.tsv \
        #    --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
        #    --output rbp_enrichment_mirna_up_on_subset-background.csv \
        #    --strategy inclusive
        #python run_enrichment_miRNAs.py \
        #    --attract ATtRACT_db.txt \
        #    --fimo_fg fimo_mirna_down/fimo.filtered.tsv \
        #    --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
        #    --output rbp_enrichment_mirna_down_on_subset-background.csv \
        #    --strategy inclusive
    
        #FXR2   1 (hsa-miR-92b-5p)  1   1   118 0.0168067226890756  0.365546218487395
        #ORB2   1 (hsa-miR-4748)    1   1   118 0.0168067226890756  0.365546218487395
    
        #-- Get all genes the number 1621 refers to --
        grep "^FXR2" ATtRACT_db.txt
        #motif_ids is M020_0.6
        grep "^M020_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv > FXR2.txt
        grep "^M020_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv
        #cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
        #wc -l AGO2_uniq.txt (1621 records)
    
        grep "^ORB2" ATtRACT_db.txt
        grep "^M120_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv
  6. RBP Enrichment from RBPmap Results (NOT implemented!) 🔹 Use RBPmap output (typically CSV or TSV) 🔹 Compare hit counts in input vs background 🔹 Perform Fisher’s exact test + Benjamini-Hochberg correction 🔹 Plot significantly enriched RBPs

        📁 Requirements
        You’ll need:
    
        File    Description
        rbpmap_up.tsv   RBPmap result file for upregulated set
        rbpmap_background.tsv   RBPmap result file for background set
    
        📝 These should have columns like:
    
        Motif Name or Protein
    
        Sequence Name or Sequence ID
        (If different, I’ll show you how to adjust.
    
        python analyze_rbpmap_enrichment.py rbpmap_up.tsv rbpmap_background.tsv enriched_up.csv enriched_up_plot.png
    
        ✅ Output
        enriched_up.csv
        RBP FG_hits BG_hits pval    padj    enriched
        ELAVL1  24  2   0.0001  0.003   ✅
        HNRNPA1 15  10  0.04    0.06    ❌
    
        enriched_up_plot.png
        Barplot showing top significant RBPs (lowest FDR)
    
        🧰 Customization Options
        Would you like:
    
            * Support for multiple RBPmap files at once?
            * To match by RBP family?
            * A full report (PDF/HTML) of top hits?
            * Let me know, and I’ll tailor the next script!
  7. The planned pipeline is very reasonable and scientifically sound for RBP (RNA-binding protein) enrichment using miRNAs as input for FIMO motif scanning and downstream analysis.

    Here’s a breakdown of why your steps make sense, plus a few important suggestions:
    
    ✅ YOUR PIPELINE LOGIC (EXCELLENT)
    1. Motif Scanning with FIMO
    
    fimo --thresh 1e-4 --oc fimo_mirna_* attract_human.meme mirna_*.fa
    ✔️ Uses ATtRACT motifs to scan miRNA sequences.
    
    ✔️ Applies a reasonable p-value threshold (1e-4) to detect motif hits.
    
    📌 If you want more sensitivity (but more noise), you could try 1e-3; for higher stringency, 1e-5.
    
    2. Filter FIMO hits to best per gene and annotate
    
    python filter_fimo_best_per_gene_annotated.py --input ... --annot ... --output ...
    ✔️ Filters redundant hits.
    
    ✔️ Annotates with gene symbols or other metadata — essential for interpretability.
    
    3. RBP Enrichment
    
    python run_enrichment.py --fimo_fg ... --fimo_bg ... --attract ... --output ...
    ✔️ Uses foreground (up/down-regulated) and background (non-differential) miRNAs.
    
    ✔️ The --strategy inclusive is typical and includes all motif hits, not just strongest per gene or region.
    
    ✔️ Running up and down separately = ✅ best practice.
    
    🔍 MINOR IMPROVEMENTS / SUGGESTIONS
    ✅ Use the same background for both up and down analyses:
    You already are — good!
    
    This ensures your enrichment comparison is balanced and interpretable.
    
    ⚠️ Check motif directionality (optional):
    Some RBPs are strand-specific — but since miRNAs are single-stranded, and your FASTA is from mature sequences, you’re probably okay.
    
    ✅ Consider including summary stats:
    How many motifs hit in up/down/background?
    
    How many unique RBPs?
    
    This helps interpret enrichment p-values.
    
    🧪 Statistical validation (optional but helpful):
    If you have many miRNAs (e.g. >30), consider bootstrapping enrichment from random subsets of the background to assess robustness.
    
    ✅ FINAL VERDICT
    Your pipeline is:
    
    ✔️ Technically correct
    
    ✔️ Biologically appropriate
    
    ✔️ Reproducible and clear
    
    🧠 Well-structured for miRNA-RBP motif enrichment
  8. Why background differs between RNA-seq and miRNA motif analysis:

    In RNA-seq differential expression analysis, especially for gene-level enrichment (e.g., GO or pathway analysis), the background typically includes all genes detected or expressed, not just a random subset. The logic is:
    
    ✅ “If a gene was testable (i.e., expressed above a threshold), it should be part of the universe we consider when evaluating enrichment.”
    
    This is because the statistical model assumes:
    
    Your universe is all genes tested.
    
    You're asking: Are my differentially expressed genes overrepresented in some category, compared to all testable genes?
    
    🔬 Why a random subset background was used in the miRNA motif pipeline (e.g., RBP enrichment):
    In the prepare_miRNA_sets.py script, the subset was used for one reason only:
    
    📉 To balance the number of background sequences relative to the foreground (up/down) — especially for statistical tests like Fisher's exact test.
    
    But this is optional and not required for correctness. A larger background improves statistical power, but must match distributional properties (e.g., GC content, sequence length) to avoid bias.
    
    ✅ So: Should you use all non-differentially expressed miRNAs as background?
    Yes, that is often more appropriate, assuming:
    
    You have a full set of detectable/testable miRNAs (e.g., all from the FASTA).
    
    You exclude the foreground (up/down) sets.
    
    The background matches the general properties of the foreground set.
  9. RBP enrichments via FIMO (Another description; it is the same to the workflow in the point 4)

    1. Collect the 3′ UTR sequences: Use the 3UTR.fasta file generated earlier, filtered to protein-coding and downregulated genes.
    
    2. Prepare Motif Database (MEME format)
    
        * ATtRACT: https://attract.cnic.es
        * RBPDB: http://rbpdb.ccbr.utoronto.ca
        * Ray2013 (CISBP-RNA motifs) — available via MEME Suite
        * [RBPmap motifs (if downloadable)]
        #Example format: rbp_motifs.meme
    
    2. Run FIMO to Scan for RBP Motifs (Similar to RBPmap)
    
        fimo --oc fimo_up rbp_motifs.meme mirna_up.fa
        fimo --oc fimo_down rbp_motifs.meme mirna_down.fa
        fimo --oc fimo_background rbp_motifs.meme mirna_background.fa
        #This produces fimo.tsv in each output folder.
    
    3. Run RBP motif enrichment using MEME Suite using AME (Analysis of Motif Enrichment). Note that FIMO+run_enrichment.py=AME, however, directly using AME returns ERROR:
    
        ame \
        --control control_3UTRs.fasta \
        --oc ame_out \
        --scoring avg \
        --method fisher \
        3UTR.fasta \
        rbp_motifs.meme
    
        Where:
    
        * 3UTR.fasta = your downregulated genes’ 3′ UTRs
        * control_3UTRs.fasta = background UTRs (e.g., random protein-coding genes not downregulated)
        * rbp_motifs.meme = motif file from RBPDB or Ray2013
    
    4. Interpret Results: Output includes RBP motifs enriched in your downregulated mRNAs' 3′ UTRs.
    
        You can then link enriched RBPs to known interactions with your upregulated miRNAs, or explore their regulatory roles.
    
    5. ✅ Bonus: Predict Which mRNAs Are Targets of Your miRNAs
    
        Use tools like: miRanda, TargetScan, miRDB
    
        Then intersect predicted targets with your downregulated genes to identify likely functional interactions.
    
    6. Summary
    
        Goal    Input   Tool / Approach
        RBP enrichment  3UTR.fasta of downregulated genes   AME with RBP motifs
        Background/control  3′ UTRs from non-differential or upregulated genes
        Link miRNA to targets   Use TargetScan / miRanda    Intersect with down genes
    
    7. Would you like:
    
        * Ready-to-use RBP motif .meme file?
        * Script to generate background sequences?
        * Visualization options for the enrichment results?
  10. Other options to get sequences of 3UTR, 5UTR, CDS and mRNA transcripts

    ✅ Option 2: Use Ensembl BioMart (web-based, no coding) --> Lasting too long!
    
        Go to Ensembl BioMart https://www.ensembl.org/biomart/martview/7b826bcbd0cec79021977f8dc12a8f61
    
        Select:
    
        Database: Ensembl Genes
        Dataset: Homo sapiens genes (GRCh38 or latest)
    
        Click on “Filters” → expand Region or Gene to limit your selection (optional).
        Click on “Attributes”:
        Under Sequences, check:
        Sequences
        3' UTR sequences
    
        Optionally add gene IDs, transcript IDs, etc.
    
        Click “Results” to view/download the FASTA of 3' UTRs.
    
    ✅ Option 3: Use GENCODE (precompiled annotations) and gffread
    
        Use a tool like gffread (from the Cufflinks or gffread package) to extract 3' UTRs:
    
            #gffread gencode.v44.annotation.gtf -g GRCh38.primary_assembly.genome.fa -w all_utrs.fa -U
            #gffread -w three_prime_utrs.fa -g GRCh38.fa -x cds.fa -y proteins.fa -U -F gencode.gtf
    
            grep -P "\tthree_prime_utr\t" gencode.v48.annotation.gtf > three_prime_utrs.gtf
            gtf2bed < three_prime_utrs.gtf > three_prime_utrs.bed
            bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed three_prime_utrs.bed -name -s > three_prime_utrs.fa
    
            gffread gencode.v48.annotation.gtf -g GRCh38.primary_assembly.genome.fa -U -w all_with_utrs.fa
    
        Add -U flag to extract UTRs, and filter post hoc for only 3' UTRs if needed.
    
    ✅ Option 4: Use Bioconductor in R (UCSC-ID, not suitable!)
    
        # Install if not already installed
        if (!requireNamespace("BiocManager", quietly = TRUE))
            install.packages("BiocManager")
        BiocManager::install("GenomicFeatures")
        BiocManager::install("txdbmaker")
        #sudo apt-get update
        #sudo apt-get install libmariadb-dev
        #(optional)sudo apt-get install libmysqlclient-dev
        install.packages("RMariaDB")
    
        # Load library
        library(GenomicFeatures)
    
        # Create TxDb object for human genome
        txdb <- txdbmaker::makeTxDbFromUCSC(genome="hg38", tablename="refGene")
    
        # Extract 3' UTRs by transcript
        utr3 <- threeUTRsByTranscript(txdb, use.names=TRUE)
    
    # View or export as needed
    
    ✅ Option 5: Extract 3′ UTRs Using UCSC Table Browser (GUI method)
        🔗 Website:
        UCSC Table Browser
    
        🔹 Step-by-Step Instructions
        1. Set the basic parameters:
        Clade: Mammal
    
        Genome: Human
    
        Assembly: GRCh38/hg38
    
        Group: Genes and Gene Predictions
    
        Track: GENCODE v44 (or latest)
    
        Table: knownGene or wgEncodeGencodeBasicV44
    
        Choose knownGene for RefSeq-like models or wgEncodeGencodeBasicV44 for GENCODE
    
        2. Region:
        Select: genome (default)
    
        3. Output format:
        Select: sequence
    
        4. Click "get output"
        🔹 Sequence Retrieval Options:
        On the next page (after clicking "get output"), you’ll see sequence options.
    
        Configure as follows:
        ✅ Output format: FASTA
    
        ✅ Which part of the gene: Select only
        → UTRs → 3' UTR only
    
        ✅ Header options: choose if you want gene name,
  11. ⚡️ Bonus: Combine with miRNA-mRNA predictions

    Once you have RBPs enriched in downregulated mRNAs, you can intersect:
        * Which RBPs overlap miRNA binding regions (e.g., via CLIPdb or POSTAR)
        * Check if miRNAs and RBPs compete or co-bind
    This can lead to identifying miRNA-RBP regulatory modules.
  12. Reports

Please find attached the results of the RNA-binding protein (RBP) enrichment analysis using FIMO and the ATtRACT motif database, along with a brief description of the procedures used for both the 3′ UTR-based analysis (RNA-seq) and the miRNA-based analysis (small RNA-seq).

    1. RBP Motif Enrichment from RNA-seq (3′ UTRs)

    We focused on 3′ UTRs, as they are key regulatory regions for RBPs. Sequences shorter than 16 nucleotides were excluded. Using FIMO (from the MEME suite) with motifs from the ATtRACT database, we scanned both foreground and background 3′ UTR sets to identify motif occurrences.

    Foreground: Differentially expressed transcripts (e.g., MKL-1 up/down, WaGa up/down)
    Background: All non-differentially expressed transcripts

    Analysis: Fisher’s exact test was used to assess motif enrichment; p-values were adjusted using the Benjamini–Hochberg method.

    Output files (RNA-seq):

        * rbp_enrichment_MKL-1_down.xlsx / .png
        * rbp_enrichment_MKL-1_up.xlsx / .png
        * rbp_enrichment_WaGa_down.xlsx / .png
        * rbp_enrichment_WaGa_up.xlsx / .png

    2. RBP Motif Enrichment from Small RNA-seq (miRNAs)

    This analysis focused on differentially expressed miRNAs, using either mature miRNA sequences from miRBase. We scanned for RBP binding motifs within these sequences using FIMO and assessed motif enrichment relative to background sets.

    Foreground: DE miRNAs (up/down) from small RNA-seq comparisons
    Background: All other miRNAs from miRBase

    Analysis: FIMO was used with --thresh 1e-4, followed by annotation and filtering. Enrichment was assessed using Fisher’s test + BH correction.

    Output files (miRNAs):

        * rbp_enrichment_mirna_down.xlsx
        * rbp_enrichment_mirna_up.xlsx

    How to Interpret the Numbers
    Each row in the result tables represents one RBP and its enrichment statistics:

    a: foreground genes/sequences with the motif
    b: background genes/sequences with the motif
    c: total number of foreground genes/sequences
    d: total number of background genes/sequences

    These values are used to compute p-values and FDRs.

    For example, in rbp_enrichment_MKL-1_up.xlsx, AGO2 has a = 1621, meaning FIMO detected AGO2 motifs in 1,621 genes in the MKL-1 upregulated set. These genes are listed in AGO2_uniq.txt.

    Similarly, for the miRNA analysis (e.g., rbp_enrichment_mirna_up.xlsx and rbp_enrichment_mirna_down.xlsx), the numbers represent counts of unique miRNAs with at least one significant motif hit. As examples, I calculated the detailed membership for FXR2 and ORB2.

Post-processing of DAMIAN results

  1. Prepare input raw data

    # -- Ringversuch --
    ~/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20579/01_RV1_DNA_S1_R1_001.fastq.gz RV1_DNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20579/01_RV1_DNA_S1_R2_001.fastq.gz RV1_DNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20580/02_RV2_DNA_S2_R1_001.fastq.gz RV2_DNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20580/02_RV2_DNA_S2_R2_001.fastq.gz RV2_DNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20581/03_RV3_DNA_S3_R1_001.fastq.gz RV3_DNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20581/03_RV3_DNA_S3_R2_001.fastq.gz RV3_DNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20582/04_RV4_DNA_S4_R1_001.fastq.gz RV4_DNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20582/04_RV4_DNA_S4_R2_001.fastq.gz RV4_DNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20583/05_RV5_DNA_S5_R1_001.fastq.gz RV5_DNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20583/05_RV5_DNA_S5_R2_001.fastq.gz RV5_DNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20584/06_RV6_DNA_S6_R1_001.fastq.gz RV6_DNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20584/06_RV6_DNA_S6_R2_001.fastq.gz RV6_DNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20585/07_RV1_RNA_S7_R1_001.fastq.gz RV1_RNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20585/07_RV1_RNA_S7_R2_001.fastq.gz RV1_RNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20586/08_RV2_RNA_S8_R1_001.fastq.gz RV2_RNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20586/08_RV2_RNA_S8_R2_001.fastq.gz RV2_RNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20587/09_RV3_RNA_S9_R1_001.fastq.gz RV3_RNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20587/09_RV3_RNA_S9_R2_001.fastq.gz RV3_RNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20588/10_RV4_RNA_S10_R1_001.fastq.gz RV4_RNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20588/10_RV4_RNA_S10_R2_001.fastq.gz RV4_RNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20589/11_RV5_RNA_S11_R1_001.fastq.gz RV5_RNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20589/11_RV5_RNA_S11_R2_001.fastq.gz RV5_RNA_R2.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20590/12_RV6_RNA_S12_R1_001.fastq.gz RV6_RNA_R1.fastq.gz
    ln ../241213_VH00358_120_AAG523FM5_Ringversuch/p20590/12_RV6_RNA_S12_R2_001.fastq.gz RV6_RNA_R2.fastq.gz
  2. Prepare virus database and select 8 representatives for the eight given viruses from the database

    # -- Download all genomes --
    # enterovirus D68
    # HSV-1
    # HSV-2
    # Influenza A H1N1
    # Cytomegalovirus AD169 (The genome size of Human herpesvirus 5 (HHV-5) — more commonly known as Cytomegalovirus (CMV))
    # Influenza A H3N2
    # Monkeypox
    # HIV-1
    
    esearch -db nucleotide -query "txid42789[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_42789_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_42789_ncbi.fasta complete_42789_ncbi.fasta    #899
    esearch -db nucleotide -query "txid10298[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10298_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_10298_ncbi.fasta complete_10298_ncbi.fasta    #162
    esearch -db nucleotide -query "txid10310[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10310_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_10310_ncbi.fasta complete_10310_ncbi.fasta    #33
    esearch -db nucleotide -query "txid1323429[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_1323429_ncbi.fasta
    python ~/Scripts/filter_fasta2.py genome_1323429_ncbi.fasta complete_1323429_ncbi.fasta    #465
    esearch -db nucleotide -query "txid10360[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10360_ncbi.fasta
    python ~/Scripts/filter_fasta2.py genome_10360_ncbi.fasta complete_10360_ncbi.fasta    #1
    esearch -db nucleotide -query "txid41857[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_41857_ncbi.fasta
    python ~/Scripts/filter_fasta2.py genome_41857_ncbi.fasta complete_41857_ncbi.fasta    #120
    esearch -db nucleotide -query "txid10244[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10244_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_10244_ncbi.fasta complete_10244_ncbi.fasta    #2525
    esearch -db nucleotide -query "txid11676[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_11676_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_11676_ncbi.fasta complete_11676_ncbi.fasta    #485995-->7416
    
    # ---- Alternatively, using ENA instead to download the genomes ----
    # https://www.ebi.ac.uk/ena/browser/view/11676 (1138065 records)
    # #Click "Sequence" and download "Counts" (1132648) and "Taxon descendants count" (1138065) if there is enough time! Downloading time points is 09.04.2025.
    # python ~/Scripts/filter_fasta.py  ena_11676_sequence.fasta complete_11676_ena.fasta  #1138065-->????
    
    # Virus Name    NCBI TaxID
    # ------------------------
    # Enterovirus D68   42789                             >PQ895337.1 Enterovirus D68 isolate SH2024-25870
    # HSV-1 (Herpes Simplex Virus 1)    10298             >PQ569920.1 Human alphaherpesvirus 1 isolate MacIntyre, complete genome
    # HSV-2 (Herpes Simplex Virus 2)    10310             >OM370995.1 Human alphaherpesvirus 2 strain G, complete genome
    
        samtools faidx complete_42789_ncbi.fasta PQ895337.1 > Enterovirus_D68_isolate_SH2024-25870.fasta
        samtools faidx complete_10298_ncbi.fasta PQ569920.1 > HSV-1_isolate_MacIntyre.fasta
        samtools faidx complete_10310_ncbi.fasta OM370995.1 > HSV-2_strain_G.fasta
    
    # Influenza A virus (H1N1)  1323429
    # The Influenza A virus (H1N1) genome is composed of eight single-stranded negative-sense RNA segments, and the total genome size is approximately 13,500 nucleotides (13.5 kb).
    # Segment   Gene    Protein Product(s)  Approx. Length (nt)
    # 1 PB2 Polymerase basic 2  ~2,341
    # 2 PB1 Polymerase basic 1, PB1-F2  ~2,341
    # 3 PA  Polymerase acidic   ~2,233
    # 4 HA  Hemagglutinin   ~1,778
    # 5 NP  Nucleoprotein   ~1,565
    # 6 NA  Neuraminidase   ~1,413
    # 7 M   Matrix proteins (M1, M2)    ~1,027
    # 8 NS  Nonstructural (NS1, NS2)    ~890
    
    # >LC662544.1 Influenza A virus (H1N1) A/PR/8/34 NEP, NS1 genes for nonstructural protein 2, nonstructural protein 1, complete cds
    # >LC662543.1 Influenza A virus (H1N1) A/PR/8/34 M2, M1 genes for matrix protein 2, matrix protein 1, complete cds
    # >LC662542.1 Influenza A virus (H1N1) A/PR/8/34 NA gene for neuraminidase, complete cds
    # >LC662541.1 Influenza A virus (H1N1) A/PR/8/34 NP gene for nucleoprotein, complete cds
    # >LC662540.1 Influenza A virus (H1N1) A/PR/8/34 HA gene for haemagglutinin, complete cds
    # >LC662539.1 Influenza A virus (H1N1) A/PR/8/34 PA, PA-X genes for polymerase PA, PA-X protein, complete cds
    # >LC662538.1 Influenza A virus (H1N1) A/PR/8/34 PB1, PB1-F2 genes for polymerase PB1, PB1-F2 protein, complete cds
    # >LC662537.1 Influenza A virus (H1N1) A/PR/8/34 PB2 gene for polymerase PB2, complete cds
    
        samtools faidx complete_1323429_ncbi.fasta LC662537.1 > H1N1_A-PR-8-34_PB2.fasta
        samtools faidx complete_1323429_ncbi.fasta LC662538.1 > H1N1_A-PR-8-34_PB1.fasta
        samtools faidx complete_1323429_ncbi.fasta LC662539.1 > H1N1_A-PR-8-34_PA.fasta
        samtools faidx complete_1323429_ncbi.fasta LC662540.1 > H1N1_A-PR-8-34_HA.fasta
        samtools faidx complete_1323429_ncbi.fasta LC662541.1 > H1N1_A-PR-8-34_NP.fasta
        samtools faidx complete_1323429_ncbi.fasta LC662542.1 > H1N1_A-PR-8-34_NA.fasta
        samtools faidx complete_1323429_ncbi.fasta LC662543.1 > H1N1_A-PR-8-34_M.fasta
        samtools faidx complete_1323429_ncbi.fasta LC662544.1 > H1N1_A-PR-8-34_NS.fasta
    
    # Human cytomegalovirus AD169   10360
    
    # Influenza A virus (H3N2)  41857
    
    # >LC817411.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 8, complete sequence
    # >LC817410.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 7, complete sequence
    # >LC817409.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 6, complete sequence
    # >LC817408.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 5, complete sequence
    # >LC817407.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 4, complete sequence
    # >LC817406.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 3, complete sequence
    # >LC817405.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 2, complete sequence
    # >LC817404.1 Influenza A virus H3N2 A_Fukushima_OR808_2023 RNA, seqment 1, complete sequence
    
        samtools faidx complete_41857_ncbi.fasta LC817404.1 > H3N2_A-Fukushima-OR808-2023_PB2.fasta
        samtools faidx complete_41857_ncbi.fasta LC817405.1 > H3N2_A-Fukushima-OR808-2023_PB1.fasta
        samtools faidx complete_41857_ncbi.fasta LC817406.1 > H3N2_A-Fukushima-OR808-2023_PA.fasta
        samtools faidx complete_41857_ncbi.fasta LC817407.1 > H3N2_A-Fukushima-OR808-2023_HA.fasta
        samtools faidx complete_41857_ncbi.fasta LC817408.1 > H3N2_A-Fukushima-OR808-2023_NP.fasta
        samtools faidx complete_41857_ncbi.fasta LC817409.1 > H3N2_A-Fukushima-OR808-2023_NA.fasta
        samtools faidx complete_41857_ncbi.fasta LC817410.1 > H3N2_A-Fukushima-OR808-2023_M.fasta
        samtools faidx complete_41857_ncbi.fasta LC817411.1 > H3N2_A-Fukushima-OR808-2023_NS.fasta
    
    # Monkeypox virus   10244: >OP689666.1 Monkeypox virus isolate MPXV/Germany/2022/RKI513, complete genome
        samtools faidx complete_10244_ncbi.fasta OP689666.1 > Monkeypox_isolate_MPXV-Germany-2022-RKI513.fasta
    
    # Human immunodeficiency virus 1    11676: >AJ866558.1 Human immunodeficiency virus 1 complete genome, isolate 01IC-PCI127
        samtools faidx complete_11676_ncbi.fasta AJ866558.1 >  HIV-1_isolate_01IC-PCI127.fasta
    
    # -- Selected genomes saved in the fasta-files --
    # Enterovirus_D68_isolate_SH2024-25870.fasta
    # HSV-1_isolate_MacIntyre.fasta
    # HSV-2_strain_G.fasta
    # H1N1_A-PR-8-34_PB2.fasta
    # H1N1_A-PR-8-34_PB1.fasta
    # H1N1_A-PR-8-34_PA.fasta
    # H1N1_A-PR-8-34_HA.fasta
    # H1N1_A-PR-8-34_NP.fasta
    # H1N1_A-PR-8-34_NA.fasta
    # H1N1_A-PR-8-34_M.fasta
    # H1N1_A-PR-8-34_NS.fasta
    # Human_cytomegalovirus_strain_AD169.fasta
    # H3N2_A-Fukushima-OR808-2023_PB2.fasta
    # H3N2_A-Fukushima-OR808-2023_PB1.fasta
    # H3N2_A-Fukushima-OR808-2023_PA.fasta
    # H3N2_A-Fukushima-OR808-2023_HA.fasta
    # H3N2_A-Fukushima-OR808-2023_NP.fasta
    # H3N2_A-Fukushima-OR808-2023_NA.fasta
    # H3N2_A-Fukushima-OR808-2023_M.fasta
    # H3N2_A-Fukushima-OR808-2023_NS.fasta
    # Monkeypox_isolate_MPXV-Germany-2022-RKI513.fasta
    # HIV-1_isolate_01IC-PCI127.fasta
  3. (Optional) Run the first round of vrap (–virus==viruses_selected.fasta)

    ln -s ~/Tools/vrap/ .
    mamba activate /home/jhuang/miniconda3/envs/vrap
    
    cd ~/DATA/Data_Damian/vrap_Ringversuch
    cat complete_10244_ncbi.fasta complete_10298_ncbi.fasta complete_10310_ncbi.fasta complete_1323429_ncbi.fasta complete_10360_ncbi.fasta complete_41857_ncbi.fasta complete_10244_ncbi.fasta complete_11676_ncbi.fasta > viruses_selected.fasta
    
    #Run vrap (first round): replace --virus to the specific taxonomy (e.g. viruses_selected.fasta) --> change virus_user_db --> specific_bacteria_user_db
    (vrap) for sample in RV1_DNA RV2_DNA RV3_DNA RV4_DNA RV5_DNA RV6_DNA  RV1_RNA RV2_RNA RV3_RNA RV4_RNA RV5_RNA RV6_RNA; do
        vrap/vrap.py  -1 ${sample}_R1.fastq.gz -2 ${sample}_R2.fastq.gz  -o vrap_${sample} --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Damian/vrap_Ringversuch/viruses_selected.fasta --nt=/mnt/nvme1n1p1/blast/nt --nr=/mnt/nvme1n1p1/blast/nr  -t 100 -l 200  -g
    done
  4. Run the second round of vrap (–host==${virus}.fasta)

    cat Enterovirus_D68_isolate_SH2024-25870.fasta HSV-1_isolate_MacIntyre.fasta HSV-2_strain_G.fasta H1N1_A-PR-8-34_PB2.fasta H1N1_A-PR-8-34_PB1.fasta H1N1_A-PR-8-34_PA.fasta H1N1_A-PR-8-34_HA.fasta H1N1_A-PR-8-34_NP.fasta H1N1_A-PR-8-34_NA.fasta H1N1_A-PR-8-34_M.fasta H1N1_A-PR-8-34_NS.fasta Human_cytomegalovirus_strain_AD169.fasta H3N2_A-Fukushima-OR808-2023_PB2.fasta H3N2_A-Fukushima-OR808-2023_PB1.fasta H3N2_A-Fukushima-OR808-2023_PA.fasta H3N2_A-Fukushima-OR808-2023_HA.fasta H3N2_A-Fukushima-OR808-2023_NP.fasta H3N2_A-Fukushima-OR808-2023_NA.fasta H3N2_A-Fukushima-OR808-2023_M.fasta H3N2_A-Fukushima-OR808-2023_NS.fasta Monkeypox_isolate_MPXV-Germany-2022-RKI513.fasta HIV-1_isolate_01IC-PCI127.fasta > viruses_representative.fasta
    
    # Run vrap (second round): selecte some representative viruses from the generated Excel-files generated by the last step as --host
    (vrap) for sample in RV1_DNA RV2_DNA RV3_DNA RV4_DNA RV5_DNA RV6_DNA  RV1_RNA RV2_RNA RV3_RNA RV4_RNA RV5_RNA RV6_RNA; do
        vrap/vrap_until_bowtie2.py  -1 ${sample}_R1.fastq.gz -2 ${sample}_R2.fastq.gz  -o vrap_${sample}_on_representatives --host /home/jhuang/DATA/Data_Damian/vrap_Ringversuch/viruses_representative.fasta   -t 100 -l 200  --gbt2 --noblast
    done
  5. Generate the mapping statistics for the sam-files generated from last step

    for sample in RV1_DNA RV2_DNA RV3_DNA RV4_DNA RV5_DNA RV6_DNA  RV1_RNA RV2_RNA RV3_RNA RV4_RNA RV5_RNA RV6_RNA; do
        echo "-----${sample}_on_representatives------" >> LOG_mapping
        #cd vrap_${sample}_on_${virus}/bowtie
        cd vrap_${sample}_on_representatives/bowtie
        # Rename and convert SAM to BAM
        mv mapped mapped.sam 2>> ../../LOG_mapping
        samtools view -S -b mapped.sam > mapped.bam 2>> ../../LOG_mapping
        samtools sort mapped.bam -o mapped_sorted.bam 2>> ../../LOG_mapping
        samtools index mapped_sorted.bam 2>> ../../LOG_mapping
        # Write flagstat output to log (go up two levels to write correctly)
        samtools flagstat mapped_sorted.bam >> ../../LOG_mapping 2>&1
        cd ../..
    done
    
    #draw some plots for some representative isolates which found in the first round (see Excel-file).
    samtools depth -m 0 -a mapped_sorted.bam > coverage.txt
    grep "PQ895337.1" coverage.txt > PQ895337_coverage.txt
    grep "PQ569920.1" coverage.txt > PQ569920_coverage.txt
    
            import pandas as pd
            import matplotlib.pyplot as plt
    
            # Load coverage data
            df = pd.read_csv("PQ895337_coverage.txt", sep="\t", header=None, names=["chr", "pos", "coverage"])
    
            # Plot
            plt.figure(figsize=(10,4))
            plt.plot(df["pos"], df["coverage"], color="blue", linewidth=0.5)
            plt.xlabel("Genomic Position")
            plt.ylabel("Coverage Depth")
            plt.title("BAM Coverage Plot")
            plt.show()
  6. Report

    Subject: Mapping Results and Selected Reference Genomes
    
    Dear XXXX,
    
    Please find below the results. For each of the viruses you sent me, a representative isolate has been selected, as listed below:
    
    Selected Reference Isolates:
    
        Enterovirus D68:
            PQ895337.1 – Enterovirus D68 isolate SH2024-25870
    
        HSV-1 (Herpes Simplex Virus 1):
            PQ569920.1 – Human alphaherpesvirus 1 isolate MacIntyre, complete genome
    
        HSV-2 (Herpes Simplex Virus 2):
            OM370995.1 – Human alphaherpesvirus 2 strain G, complete genome
    
        Influenza A virus (H1N1):
    
            LC662537.1 – Influenza A virus (H1N1) A/PR/8/34 PB2 gene for polymerase PB2, complete cds
            LC662538.1 – Influenza A virus (H1N1) A/PR/8/34 PB1, PB1-F2 genes for polymerase PB1, PB1-F2 protein, complete cds
            LC662539.1 – Influenza A virus (H1N1) A/PR/8/34 PA, PA-X genes for polymerase PA, PA-X protein, complete cds
            LC662540.1 – Influenza A virus (H1N1) A/PR/8/34 HA gene for haemagglutinin, complete cds
            LC662541.1 – Influenza A virus (H1N1) A/PR/8/34 NP gene for nucleoprotein, complete cds
            LC662542.1 – Influenza A virus (H1N1) A/PR/8/34 NA gene for neuraminidase, complete cds
            LC662543.1 – Influenza A virus (H1N1) A/PR/8/34 M2, M1 genes for matrix protein 2, matrix protein 1, complete cds
            LC662544.1 – Influenza A virus (H1N1) A/PR/8/34 NEP, NS1 genes for nonstructural protein 2, nonstructural protein 1, complete cds
    
        Cytomegalovirus (strain AD169):
            X17403.1 – Human cytomegalovirus strain AD169, complete genome
    
        Influenza A virus (H3N2):
    
            LC817404.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 PB2 gene, complete sequence
            LC817405.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 PB1 gene, complete sequence
            LC817406.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 PA gene, complete sequence
            LC817407.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 HA gene, complete sequence
            LC817408.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 NP gene, complete sequence
            LC817409.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 NA gene, complete sequence
            LC817410.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 M gene, complete sequence
            LC817411.1 – Influenza A virus H3N2 A_Fukushima_OR808_2023 NS gene, complete sequence
    
        Monkeypox virus:
            OP689666.1 – Isolate MPXV/Germany/2022/RKI513, complete genome
    
        Human Immunodeficiency Virus 1 (HIV-1):
            AJ866558.1 – Isolate 01IC-PCI127, complete genome
    
    Mapping Results:
    
    Then, we mapped the paired-end reads from 12 samples of the Ringversuch project against the reference genomes listed above. The following are the mapping statistics. Coverage plots are attached for each case where reads map to the reference genome (see attachments).
    
    Mapping statistics:
    
        RV1_DNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV2_DNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV3_DNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV4_DNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV5_DNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV6_DNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV1_RNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV2_RNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV3_RNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV4_RNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV5_RNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)
        RV6_RNA_on_Enterovirus_D68_isolate_SH2024-25870: 0 + 0 mapped (0.00% : N/A)

Variant calling for Data_Pietschmann_229ECoronavirus_Mutations_2025 (via docker own_viral_ngs)

  1. Input data:

    ln -s ../raw_data_2024/hCoV229E_Rluc_R1.fastq.gz hCoV229E_Rluc_R1.fastq.gz
    ln -s ../raw_data_2024/hCoV229E_Rluc_R2.fastq.gz hCoV229E_Rluc_R2.fastq.gz
    ln -s ../raw_data_2024/p10_DMSO_R1.fastq.gz p10_DMSO_R1.fastq.gz
    ln -s ../raw_data_2024/p10_DMSO_R2.fastq.gz p10_DMSO_R2.fastq.gz
    ln -s ../raw_data_2024/p10_K22_R1.fastq.gz p10_K22_R1.fastq.gz
    ln -s ../raw_data_2024/p10_K22_R2.fastq.gz p10_K22_R2.fastq.gz
    ln -s ../raw_data_2024/p10_K7523_R1.fastq.gz p10_K7523_R1.fastq.gz
    ln -s ../raw_data_2024/p10_K7523_R2.fastq.gz p10_K7523_R2.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R1_001.fastq.gz p16_DMSO_R1.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R2_001.fastq.gz p16_DMSO_R2.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R1_001.fastq.gz p16_K22_R1.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R2_001.fastq.gz p16_K22_R2.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R1_001.fastq.gz p16_X7523_R1.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R2_001.fastq.gz p16_X7523_R2.fastq.gz
  2. Call variant calling using snippy

    ln -s ~/Tools/bacto/db/ .;
    ln -s ~/Tools/bacto/envs/ .;
    ln -s ~/Tools/bacto/local/ .;
    cp ~/Tools/bacto/Snakefile .;
    cp ~/Tools/bacto/bacto-0.1.json .;
    cp ~/Tools/bacto/cluster.json .;
    
    #download CU459141.gb from GenBank
    mv ~/Downloads/sequence\(2\).gb db/PP810610.gb
    
    #setting the following in bacto-0.1.json
        "fastqc": false,
        "taxonomic_classifier": false,
        "assembly": true,
        "typing_ariba": false,
        "typing_mlst": true,
        "pangenome": true,
        "variants_calling": true,
        "phylogeny_fasttree": true,
        "phylogeny_raxml": true,
        "recombination": false, (due to gubbins-error set false)
        "genus": "Alphacoronavirus",
        "kingdom": "Viruses",
        "species": "Human coronavirus 229E",
        "mykrobe": {
            "species": "corona"
        },
        "reference": "db/PP810610.gb"
    
    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  3. Summarize all SNPs and Indels from the snippy result directory.

    #Output: snippy/summary_snps_indels.csv
    # IMPORTANT_ADAPT the array isolates = ["AYE-S", "AYE-Q", "AYE-WT on Tig4", "AYE-craA on Tig4", "AYE-craA-1 on Cm200", "AYE-craA-2 on Cm200"]
    python3 ~/Scripts/summarize_snippy_res.py snippy
    cd snippy
    #grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
  4. Using spandx calling variants (almost the same results to the one from viral-ngs!)

    mamba activate /home/jhuang/miniconda3/envs/spandx
    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
    cp PP810610.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610    #-d
    ~/Scripts/genbank2fasta.py PP810610.gb
    mv PP810610.gb_converted.fna PP810610.fasta    #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file
    ln -s /home/jhuang/Tools/spandx/ spandx
    (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref PP810610.fasta --annotation --database PP810610 -resume
    
    # Rerun SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation
    cd Outputs/Master_vcf
    (spandx) cp -r ../../snippy/hCoV229E_Rluc/reference .
    (spandx) cp ../../spandx/bin/SNP_matrix.sh ./
    #Note that ${variant_genome_path}=NC_001348 in the following command, but it was not used after command replacement.
    #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
    "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
    (spandx) bash SNP_matrix.sh PP810610 .
  5. Calling inter-host variants by merging the results from snippy+spandx (Manually!)

    # Inter-host variants(宿主间变异):一种病毒在两个人之间有不同的基因变异,这些变异可能与宿主的免疫反应、疾病表现或病毒传播的方式相关。
    cp All_SNPs_indels_annotated.txt All_SNPs_indels_annotated_backup.txt
    vim All_SNPs_indels_annotated.txt
    
    #in the file ids: grep "$(echo -e '\t')353$(echo -e '\t')" All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt
    #Replace \n with " All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt\ngrep "
    #Replace grep " --> grep "$(echo -e '\t')
    #Replace " All_ --> $(echo -e '\t')" All_
    
    # Potential intra-host variants: 10871, 19289, 23435.
    CHROM   POS     REF     ALT     TYPE    hCoV229E_Rluc_trimmed   p10_DMSO_trimmed        p10_K22_trimmed p10_K7523_trimmed       p16_DMSO_trimmed        p16_K22_trimmed p16_X7523_trimmed       Effect  Impact  Functional_Class        Codon_change    Protein_and_nucleotide_change   Amino_Acid_Length       Gene_name       Biotype
    PP810610        1464    T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        gTt/gCt p.Val416Ala/c.1247T>C   6757    CDS_1   protein_coding
    PP810610        1699    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val494Val/c.1482C>T   6757    CDS_1   protein_coding
    PP810610        6691    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  tgC/tgT p.Cys2158Cys/c.6474C>T  6757    CDS_1   protein_coding
    PP810610        6919    C       G       SNP     G       G       G       G       G       G       G       synonymous_variant      LOW     SILENT  ggC/ggG p.Gly2234Gly/c.6702C>G  6757    CDS_1   protein_coding
    PP810610        7294    T       A       SNP     A       A       A       A       A       A       A       missense_variant        MODERATE        MISSENSE        agT/agA p.Ser2359Arg/c.7077T>A  6757    CDS_1   protein_coding
    * PP810610       10871   C       T       SNP     C       C/T     T       C/T     C/T     T       C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu3552Phe/c.10654C>T 6757    CDS_1   protein_coding
    PP810610        14472   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        aTg/aCg p.Met4752Thr/c.14255T>C 6757    CDS_1   protein_coding
    PP810610        15458   T       C       SNP     C       C       C       C       C       C       C       synonymous_variant      LOW     SILENT  Ttg/Ctg p.Leu5081Leu/c.15241T>C 6757    CDS_1   protein_coding
    PP810610        16035   C       A       SNP     A       A       A       A       A       A       A       stop_gained     HIGH    NONSENSE        tCa/tAa p.Ser5273*/c.15818C>A   6757    CDS_1   protein_coding
    PP810610        17430   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu5738Ser/c.17213T>C 6757    CDS_1   protein_coding
    * PP810610       19289   G       T       SNP     G       G       T       G       G       G/T     G       missense_variant        MODERATE        MISSENSE        Gtt/Ttt p.Val6358Phe/c.19072G>T 6757    CDS_1   protein_coding
    PP810610        21183   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        tTt/tGt p.Phe230Cys/c.689T>G    1173    CDS_2   protein_coding
    PP810610        22636   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        aaT/aaG p.Asn714Lys/c.2142T>G   1173    CDS_2   protein_coding
    PP810610        23022   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu843Ser/c.2528T>C   1173    CDS_2   protein_coding
    * PP810610       23435   C       T       SNP     C       C       T       C/T     C       C/T     C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu981Phe/c.2941C>T   1173    CDS_2   protein_coding
    PP810610        24512   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctc/Ttc p.Leu36Phe/c.106C>T     88      CDS_4   protein_coding
    PP810610        24781   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        aCt/aTt p.Thr36Ile/c.107C>T     77      CDS_5   protein_coding
    PP810610        25163   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu82Phe/c.244C>T     225     CDS_6   protein_coding
    PP810610        25264   C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val115Val/c.345C>T    225     CDS_6   protein_coding
    PP810610        26838   G       T       SNP     T       T       T       T       T       T       T
  6. Calling intra-host variants using viral-ngs

    # Intra-host variants(宿主内变异):同一个人感染了某种病毒,但在其体内的不同细胞或器官中可能存在多个不同的病毒变异株。
    
    #How to run and debug the viral-ngs docker?
    # ---- DEBUG_2025_1: using docker instead ----
    mkdir viralngs; cd viralngs
    ln -s ~/Tools/viral-ngs_docker/Snakefile Snakefile
    ln -s  ~/Tools/viral-ngs_docker/bin bin
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/refsel.acids refsel.acids
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/lastal.acids lastal.acids
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/config.yaml config.yaml
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-runs.txt samples-runs.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-depletion.txt samples-depletion.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-metagenomics.txt samples-metagenomics.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly.txt samples-assembly.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly-failures.txt samples-assembly-failures.txt
    # Adapt the sample-*.txt
    
    mkdir viralngs/data
    mkdir viralngs/data/00_raw
    
    mkdir bams
    ref_fa="PP810610.fasta";
    #for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
    for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
        bwa index ${ref_fa}; \
        bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_trimmed_P_1.fastq trimmed/${sample}_trimmed_P_2.fastq | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
    done
    
    conda activate viral-ngs4
    #for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
    #for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
    for sample in p16_K22; do
        picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs/data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
    done
    conda deactivate
    
    # -- ! Firstly set the samples-assembly.txt empty, so that only focus on running depletion!
    docker run -it -v /mnt/md1/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs:/work -v /home/jhuang/Tools/viral-ngs_docker:/home/jhuang/Tools/viral-ngs_docker -v /home/jhuang/REFs:/home/jhuang/REFs -v /home/jhuang/Tools/GenomeAnalysisTK-3.6:/home/jhuang/Tools/GenomeAnalysisTK-3.6 -v /home/jhuang/Tools/novocraft_v3:/home/jhuang/Tools/novocraft_v3 -v /usr/local/bin/gatk:/usr/local/bin/gatk   own_viral_ngs bash
    cd /work
    snakemake --directory /work --printshellcmds --cores 40
    
    # -- ! Secondly manully run assembly steps
    # --> By itereative add the unfinished assembly in the list, each time replace one, and run "snakemake --directory /work --printshellcmds --cores 40"
    
        # # ---- NOTE that the following steps need rerun --> DOES NOT WORK, USE STRATEGY ABOVE ----
        # #for sample in p10_K22 p10_K7523; do
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/read_utils.py merge_bams data/01_cleaned/${sample}.cleaned.bam tmp/01_cleaned/${sample}.cleaned.bam --picardOptions SORT_ORDER=queryname
        #     bin/read_utils.py rmdup_mvicuna_bam tmp/01_cleaned/${sample}.cleaned.bam data/01_per_sample/${sample}.cleaned.bam --JVMmemory 30g
        # done
        #
        # #Note that the error generated by nextflow is from the step gapfill_gap2seq!
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/assembly.py assemble_spades data/01_per_sample/${sample}.taxfilt.bam /home/jhuang/REFs/viral_ngs_dbs/trim_clip/contaminants.fasta tmp/02_assembly/${sample}.assembly1-spades.fasta --nReads 10000000 --threads 15 --memLimitGb 12
        # done
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        # for sample in p10_K22 p10_K7523; do
        #     bin/assembly.py order_and_orient tmp/02_assembly/${sample}.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/${sample}.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/${sample}.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta --threads 15
        # done
        #
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/assembly.py gapfill_gap2seq tmp/02_assembly/${sample}.assembly2-scaffolded.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
        # done
    
    #IMPORTANT: Reun the following commands!
    for sample in hCoV229E_Rluc  p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    
        bin/assembly.py impute_from_reference tmp/02_assembly/${sample}.assembly2-gapfilled.fasta tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta tmp/02_assembly/${sample}.assembly3-modify.fasta --newName ${sample} --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index  --loglevel DEBUG
    done
    
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly3-modify.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly4-refined.fasta --outVcf tmp/02_assembly/${sample}.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 15  --loglevel DEBUG
        #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly4-refined.fasta data/01_per_sample/${sample}.cleaned.bam data/02_assembly/${sample}.fasta --outVcf tmp/02_assembly/${sample}.assembly4.vcf.gz --min_coverage 3 --novo_params '-r Random -l 20 -g 40 -x 20 -t 100' --threads 15  --loglevel DEBUG
        # done
    
    # -- ! Thirdly set the samples-assembly.txt completely and run "snakemake --directory /work --printshellcmds --cores 40"
  7. Merge intra- and inter-host variants, comparing the variants to the alignments of the assemblies to confirm its correctness.

    cat NC_001348.fasta viralngs/data/02_assembly/VZV_20S.fasta viralngs/data/02_assembly/VZV_60S.fasta > aligned_1.fasta
    mafft --clustalout aligned_1.fasta > aligned_1.aln
    #~/Scripts/convert_fasta_to_clustal.py aligned_1.fasta_orig aligned_1.aln
    ~/Scripts/convert_clustal_to_clustal.py aligned_1.aln aligned_1_.aln
    #manully delete the postion with all or '-' in aligned_1_.aln
    ~/Scripts/check_sequence_differences.py aligned_1_.aln
    ~/Scripts/check_sequence_differences.py aligned_1_.aln > aligned_1.res
    grep -v " = n" aligned_1.res > aligned_1_.res
    
    cat NC_001348.fasta viralngs/tmp/02_assembly/VZV_20S.assembly4-refined.fasta viralngs/tmp/02_assembly/VZV_60S.assembly4-refined.fasta > aligned_1.fasta
    mafft --clustalout aligned_1.fasta > aligned_1.aln
    ~/Scripts/convert_clustal_to_clustal.py aligned_1.aln aligned_1_.aln
    ~/Scripts/check_sequence_differences.py aligned_1_.aln > aligned_1.res
    grep -v " = n" aligned_1.res > aligned_1_.res
    
    #Differences found at the following positions (150):
    Position 8956: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 8991: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 8992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 8995: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 9190: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
    * Position 13659: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    * Position 47969: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    * Position 53691: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    * Position 55501: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    * Position 63248: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 63799: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    * Position 64328: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 65179: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    * Position 65225: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    * Position 95302: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    
    gunzip isnvs.annot.txt.gz
    ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    chr     pos     sample  patient time    alleles iSNV_freq
    OP297860        13203   HSV1_S1 HSV1_S1         T,C,A   1.0
    OP297860        13203   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
    OP297860        13522   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13522   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008905554253573941
    OP297860        13659   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13659   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008383233532934131
    
    ~/Scripts/convert_clustal_to_fasta.py aligned_1_.aln aligned_1.fasta
    samtools faidx aligned_1.fasta
    samtools faidx aligned_1.fasta OP297860.1 > OP297860.1.fasta
    samtools faidx aligned_1.fasta HSV1_S1-1 > HSV1_S1-1.fasta
    samtools faidx aligned_1.fasta HSV-Klinik_S2-1 > HSV-Klinik_S2-1.fasta
    seqkit seq OP297860.1.fasta -w 70 > OP297860.1_w70.fasta
    diff OP297860.1_w70.fasta ../../refsel_db/refsel.fasta
  8. Consensus sequences of each and of all isolates

    cp data/02_assembly/*.fasta ./
    for sample in 838_S1 840_S2 820_S3 828_S4 815_S5 834_S6 808_S7 811_S8 837_S9 768_S10 773_S11 767_S12 810_S13 814_S14 10121-16_S15 7510-15_S16 828-17_S17 8806-15_S18 9881-16_S19 8981-14_S20; do
    for sample in p953-84660-tsek p938-16972-nra p942-88507-nra p943-98523-nra p944-103323-nra p947-105565-nra p948-112830-nra; do \
    mv ${sample}.fasta ${sample}.fa
    cat all.fa ${sample}.fa >> all.fa
    done
    cat RSV_dedup.fa all.fa > RSV_all.fa
    mafft --adjustdirection RSV_all.fa > RSV_all.aln
    snp-sites RSV_all.aln -o RSV_all_.aln
  9. Download all Human alphaherpesvirus 3 (Varicella-zoster virus) genomes

    Human alphaherpesvirus 3
    acronym: HHV-3 VZV
    equivalent: Human herpes virus 3
    
    Human alphaherpesvirus 3 (Varicella-zoster virus)
        * Human herpesvirus 3 strain Dumas
        * Human herpesvirus 3 strain Oka vaccine
        * Human herpesvirus 3 VZV-32
    
    #Taxonomy ID: 10335
    esearch -db nucleotide -query "txid10335[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10335_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_10335_ncbi.fasta complete_genome_10335_ncbi.fasta  #2041-->165
    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/10335
    #Click "Sequence" and download "Counts" (2003) and "Taxon descendants count" (2005) if there is enough time! Downloading time points is 11.03.2025.
    python ~/Scripts/filter_fasta.py  ena_10335_sequence.fasta complete_genome_10335_ena_taxon_descendants_count.fasta  #2005-->153
    #python ~/Scripts/filter_fasta.py ena_10335_sequence_Counts.fasta complete_genome_10335_ena_Counts.fasta  #xxx, 5.8G
    https://www.ebi.ac.uk/ena/browser/view/10239
    https://www.ebi.ac.uk/ena/browser/view/2497569
    https://www.ebi.ac.uk/ena/browser/view/Taxon:2497569
    ena_10239_sequence.fasta
    esearch -db nucleotide -query "txid10239[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10239_ncbi.fasta
  10. Using Multi-CAR for scaffolding the contigs (If not useful, choose another scaffolding tool, e.g. https://github.com/malonge/RagTag)

     All contigs over 500 bp were successfully scaffolded to the graft genome using Multi-CAR (13), resulting in a chromosomal assembly of 4,506,689 bp.
     https://genome.cs.nthu.edu.tw/Multi-CAR/
     https://github.com/ablab-nthu/Multi-CSAR
  11. Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

    (vrap) vrap/vrap.py  -1 trimmed/VZV_20S_trimmed_P_1.fastq -2 trimmed/VZV_20S_trimmed_P_2.fastq  -o VZV_20S_on_X04370 --host /home/jhuang/DATA/Data_Huang_Human_herpesvirus_3/X04370.fasta   -t 100 -l 200  -g
    cd bowtie
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
  12. Show the bw on IGV

  13. Reports

    diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly4-refined.fasta
    
    diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly1-spades.fasta
    diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly2-scaffolded.fasta
    diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly2-gapfilled.fasta
    diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly3-modify.fasta
    diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly4-refined.fasta
    ./2040_04.assembly2-alternate_sequences.fasta
    ./2040_04.assembly2-scaffold_ref.fasta

How to debug and construct the docker docker own_viral_ngs?

    mkdir viralngs; cd viralngs
    ln -s ~/Tools/viral-ngs_docker/Snakefile Snakefile
    ln -s  ~/Tools/viral-ngs_docker/bin bin
    cp  ~/Tools/viral-ngs_docker/refsel.acids refsel.acids
    cp  ~/Tools/viral-ngs_docker/lastal.acids lastal.acids
    cp  ~/Tools/viral-ngs_docker/config.yaml config.yaml
    cp  ~/Tools/viral-ngs_docker/samples-runs.txt samples-runs.txt
    cp  ~/Tools/viral-ngs_docker/samples-depletion.txt samples-depletion.txt
    cp  ~/Tools/viral-ngs_docker/samples-metagenomics.txt samples-metagenomics.txt
    cp  ~/Tools/viral-ngs_docker/samples-assembly.txt samples-assembly.txt
    cp  ~/Tools/viral-ngs_docker/samples-assembly-failures.txt samples-assembly-failures.txt

    docker run -it -v /mnt/md1/DATA/Data_Huang_Human_herpesvirus_3/viralngs:/work -v /home/jhuang/Tools/viral-ngs_docker:/home/jhuang/Tools/viral-ngs_docker -v /home/jhuang/REFs:/home/jhuang/REFs -v /home/jhuang/Tools/GenomeAnalysisTK-3.6:/home/jhuang/Tools/GenomeAnalysisTK-3.6 -v /home/jhuang/Tools/novocraft_v3:/home/jhuang/Tools/novocraft_v3 -v /usr/local/bin/gatk:/usr/local/bin/gatk   own_viral_ngs bash
    cd /work
    snakemake --directory /work --printshellcmds --cores 40

    #BUG_1: FileNotFoundError: [Errno 2] No such file or directory: '/home/jhuang/Tools/samtools-1.9/samtools': '/home/jhuang/Tools/samtools-1.9/samtools'
    #DEBUG_1 (DEPRECATED):
            # - In docker install independent samtools
            conda create -n samtools-1.9-env samtools=1.9 -c bioconda -c conda-forge
            # - persistence the modified docker, next time run own docker image
            docker ps
            #CONTAINER ID   IMAGE                              COMMAND   CREATED         STATUS         PORTS     NAMES
            #881a1ad6a990   quay.io/broadinstitute/viral-ngs   "bash"    8 minutes ago   Up 8 minutes             intelligent_yalow
            docker commit 881a1ad6a990 own_viral_ngs
            docker image ls
            docker run -it own_viral_ngs bash
            #Change the path as "/opt/miniconda/envs/samtools-1.9-env/bin/samtools" in /work/bin/tools/samtools.py
            #         If another tool expect for samtools could not be installed, also use the same method above to install it on own_viral_ngs!
    #DEBUG_1_BETTER_SIMPLE: TOOL_VERSION = '1.6' --> '1.9' in ~/Tools/viral-ngs_docker/bin/tools/samtools.py

    #BUG_2:
            bin/taxon_filter.py deplete data/00_raw/2040_04.bam tmp/01_cleaned/2040_04.raw.bam tmp/01_cleaned/2040_04.bmtagger_depleted.bam tmp/01_cleaned/2040_04.rmdup.bam data/01_cleaned/2040_04.cleaned.bam --bmtaggerDbs /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA --blastDbs /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus --threads 15 --srprismMemory 14250 --JVMmemory 50g --loglevel DEBUG
            #2025-05-23 09:58:45,326 - __init__:445:_attempt_install - DEBUG - Currently installed version of blast: 2.7.1-h4422958_6
            #2025-05-23 09:58:45,327 - __init__:448:_attempt_install - DEBUG - Expected version of blast:            2.6.0
            #2025-05-23 09:58:45,327 - __init__:449:_attempt_install - DEBUG - Incorrect version of blast installed. Removing it...
    #DEBUG_2: TOOL_VERSION = "2.6.0" --> "2.7.1" in ~/Tools/viral-ngs_docker/bin/tools/blast.py

    #BUG_3:
            bin/read_utils.py bwamem_idxstats data/01_cleaned/1762_04.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/1762_04.spike_count.txt --minScoreToFilter 60 --loglevel DEBUG
    #DEBUG_3: TOOL_VERSION = "0.7.15" --> "0.7.17" in ~/Tools/viral-ngs_docker/bin/tools/bwa.py

    #BUG_4: FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/bin/trimmomatic': '/usr/local/bin/trimmomatic'
    #DEBUG_4: TOOL_VERSION = "0.36" --> "0.38" in ~/Tools/viral-ngs_docker/bin/tools/trimmomatic.py

    #BUG_5: FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/spades.py': '/usr/bin/spades.py'
    #DEBUG_5:  TOOL_VERSION = "0.36" --> "0.38" in ~/Tools/viral-ngs_docker/bin/tools/trimmomatic.py
    #                def install_and_get_path(self):
    #                        # the conda version wraps the jar file with a shell script
    #                        return 'trimmomatic'

    #BUG_6: bin/assembly.py order_and_orient tmp/02_assembly/2039_04.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/2039_04.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/2039_04.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta --threads 15 --loglevel DEBUG
    2025-05-23 17:40:19,526 - __init__:445:_attempt_install - DEBUG - Currently installed version of mummer4: 4.0.0beta2-pl526hf484d3e_4
    2025-05-23 17:40:19,527 - __init__:448:_attempt_install - DEBUG - Expected version of mummer4:            4.0.0rc1
    2025-05-23 17:40:19,527 - __init__:449:_attempt_install - DEBUG - Incorrect version of mummer4 installed. Removing it..
    DEBUG_6:  TOOL_VERSION = "4.0.0rc1" --> "4.0.0beta2" in ~/Tools/viral-ngs_docker/bin/tools/mummer.py

    #BUG_7: bin/assembly.py order_and_orient tmp/02_assembly/2039_04.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/2039_04.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/2039_04.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta --threads 15 --loglevel DEBUG
            File "bin/assembly.py", line 549, in 
base_counts = [sum([len(seg.seq.replace(“N”, “”)) for seg in scaffold]) \ AttributeError: ‘Seq’ object has no attribute ‘replace’ DEBUG_7: base_counts = [sum([len(seg.seq.replace(“N”, “”)) for seg in scaffold]) –> base_counts = [sum([len(seg.seq.ungap(‘N’)) for seg in scaffold]) in ~/Tools/viral-ngs_docker/bin/assembly.py BUG_8: bin/assembly.py refine_assembly tmp/02_assembly/1243_2.assembly3-modify.fasta data/01_per_sample/1243_2.cleaned.bam tmp/02_assembly/1243_2.assembly4-refined.fasta –outVcf tmp/02_assembly/1243_2.assembly3.vcf.gz –min_coverage 2 –novo_params ‘-r Random -l 20 -g 40 -x 20 -t 502’ –threads 15 –loglevel DEBUG File “/work/bin/tools/gatk.py”, line 75, in execute FileNotFoundError: [Errno 2] No such file or directory: ‘/usr/local/bin/gatk’: ‘/usr/local/bin/gatk’ #DEBUG_8: -v /usr/local/bin/gatk:/usr/local/bin/gatk in ‘docker run’ and change default python in the script via a shebang; TOOL_VERSION = “3.8” –> “3.6” in ~/Tools/viral-ngs_docker/bin/tools/gatk.py BUG_9: pyyaml is missing! #DEBUG_9: NO_ERROR if rerun! bin/assembly.py impute_from_reference tmp/02_assembly/2039_04.assembly2-gapfilled.fasta tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta tmp/02_assembly/2039_04.assembly3-modify.fasta –newName 2039_04 –replaceLength 55 –minLengthFraction 0.05 –minUnambig 0.05 –index –loglevel DEBUG for sample in 2039_04 2040_04; do for sample in 1762_04 1243_2 875_04; do bin/assembly.py impute_from_reference tmp/02_assembly/${sample}.assembly2-gapfilled.fasta tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta tmp/02_assembly/${sample}.assembly3-modify.fasta –newName ${sample} –replaceLength 55 –minLengthFraction 0.05 –minUnambig 0.05 –index –loglevel DEBUG done #BUG_10: bin/reports.py consolidate_fastqc reports/fastqc/2039_04/align_to_self reports/fastqc/2040_04/align_to_self reports/fastqc/1762_04/align_to_self reports/fastqc/1243_2/align_to_self reports/fastqc/875_04/align_to_self reports/summary.fastqc.align_to_self.txt #DEBUG_10: File “bin/intrahost.py”, line 527 and line 579 in merge_to_vcf # #MODIFIED_BACK samp_to_seqIndex[sampleName] = seq.seq.ungap(‘-‘) #samp_to_seqIndex[sampleName] = seq.seq.replace(“-“, “”) #BUG_11: bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/2039_04.fasta data/02_assembly/2040_04.fasta data/02_assembly/1762_04.fasta data/02_assembly/1243_2.fasta data/02_assembly/875_04.fasta data/03_multialign_to_ref –ep 0.123 –maxiters 1000 –preservecase –localpair –outFilePrefix aligned –sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt –threads 15 –loglevel DEBUG 2025-05-26 15:04:19,014 – cmd:195:main_argparse – INFO – command: bin/interhost.py multichr_mafft inFastas=[‘ref_genome/reference.fasta’, ‘data/02_assembly/2039_04.fasta’, ‘data/02_assembly/2040_04.fasta’, ‘data/02_assembly/1762_04.fasta’, ‘data/02_assembly/1243_2.fasta’, ‘data/02_assembly/875_04.fasta’] localpair=True globalpair=None preservecase=True reorder=None gapOpeningPenalty=1.53 ep=0.123 verbose=False outputAsClustal=None maxiters=1000 outDirectory=data/03_multialign_to_ref outFilePrefix=aligned sampleRelationFile=None sampleNameListFile=data/03_multialign_to_ref/sampleNameList.txt threads=15 loglevel=DEBUG tmp_dir=/tmp tmp_dirKeep=False 2025-05-26 15:04:19,014 – cmd:209:main_argparse – DEBUG – using tempDir: /tmp/tmp-interhost-multichr_mafft-nuws9mhp 2025-05-26 15:04:21,085 – __init__:445:_attempt_install – DEBUG – Currently installed version of mafft: 7.402-0 2025-05-26 15:04:21,085 – __init__:448:_attempt_install – DEBUG – Expected version of mafft: 7.221 2025-05-26 15:04:21,085 – __init__:449:_attempt_install – DEBUG – Incorrect version of mafft installed. Removing it… #DEBUG_11: TOOL_VERSION = “7.221” –> “7.402” in ~/Tools/viral-ngs_docker/bin/tools/mafft.py

Processing Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606

  1. Targets

    Could you please assist me with processing RNA-seq data? The reference genome is CP059040. I aim to analyze the data using PCA, a Venn diagram, and KEGG and GO annotation enrichment analysis.
    The samples are labeled as follows (where 'x' indicates the replicate number):
    
        LB-AB-x
        LB-IJ-x
        LB-W1-x
        LB-WT19606-x
        LB-Y1-x
        Mac-AB-x
        Mac-IJ-x
        Mac-W1-x
        Mac-WT19606-x
        Mac-Y1-x
  2. Download the raw data

    ./lnd login -u X101SC25015922-Z02-J002 -p m*********5
    ./lnd list
    ./lnd cp -d oss://  ./
    ./lnd cp oss://CP2024102300053 .  #Error
    ./lnd list oss://CP2024102300053
    ./lnd cp -d oss://CP2024102300053/H101SC25015922/RSMR00204 .
    #CP2024102300053/H101SC25015922/RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002
  3. Prepare raw data

    mkdir raw_data; cd raw_data
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-1/LB-AB-1_1.fq.gz LB-AB-r1_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-1/LB-AB-1_2.fq.gz LB-AB-r1_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-2/LB-AB-2_1.fq.gz LB-AB-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-2/LB-AB-2_2.fq.gz LB-AB-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-3/LB-AB-3_1.fq.gz LB-AB-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-3/LB-AB-3_2.fq.gz LB-AB-r3_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-1/LB-IJ-1_1.fq.gz LB-IJ-r1_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-1/LB-IJ-1_2.fq.gz LB-IJ-r1_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-2/LB-IJ-2_1.fq.gz LB-IJ-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-2/LB-IJ-2_2.fq.gz LB-IJ-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-4/LB-IJ-4_1.fq.gz LB-IJ-r4_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-4/LB-IJ-4_2.fq.gz LB-IJ-r4_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-1/LB-W1-1_1.fq.gz LB-W1-r1_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-1/LB-W1-1_2.fq.gz LB-W1-r1_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-2/LB-W1-2_1.fq.gz LB-W1-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-2/LB-W1-2_2.fq.gz LB-W1-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-3/LB-W1-3_1.fq.gz LB-W1-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-3/LB-W1-3_2.fq.gz LB-W1-r3_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-2/LB-WT19606-2_1.fq.gz LB-WT19606-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-2/LB-WT19606-2_2.fq.gz LB-WT19606-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-3/LB-WT19606-3_1.fq.gz LB-WT19606-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-3/LB-WT19606-3_2.fq.gz LB-WT19606-r3_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-4/LB-WT19606-4_1.fq.gz LB-WT19606-r4_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-4/LB-WT19606-4_2.fq.gz LB-WT19606-r4_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-2/LB-Y1-2_1.fq.gz LB-Y1-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-2/LB-Y1-2_2.fq.gz LB-Y1-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-3/LB-Y1-3_1.fq.gz LB-Y1-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-3/LB-Y1-3_2.fq.gz LB-Y1-r3_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-4/LB-Y1-4_1.fq.gz LB-Y1-r4_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-4/LB-Y1-4_2.fq.gz LB-Y1-r4_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-1/Mac-AB-1_1.fq.gz Mac-AB-r1_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-1/Mac-AB-1_2.fq.gz Mac-AB-r1_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-2/Mac-AB-2_1.fq.gz Mac-AB-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-2/Mac-AB-2_2.fq.gz Mac-AB-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-3/Mac-AB-3_1.fq.gz Mac-AB-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-3/Mac-AB-3_2.fq.gz Mac-AB-r3_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-1/Mac-IJ-1_1.fq.gz Mac-IJ-r1_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-1/Mac-IJ-1_2.fq.gz Mac-IJ-r1_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-2/Mac-IJ-2_1.fq.gz Mac-IJ-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-2/Mac-IJ-2_2.fq.gz Mac-IJ-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-4/Mac-IJ-4_1.fq.gz Mac-IJ-r4_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-4/Mac-IJ-4_2.fq.gz Mac-IJ-r4_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-1/Mac-W1-1_1.fq.gz Mac-W1-r1_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-1/Mac-W1-1_2.fq.gz Mac-W1-r1_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-2/Mac-W1-2_1.fq.gz Mac-W1-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-2/Mac-W1-2_2.fq.gz Mac-W1-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-3/Mac-W1-3_1.fq.gz Mac-W1-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-3/Mac-W1-3_2.fq.gz Mac-W1-r3_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-2/Mac-WT19606-2_1.fq.gz Mac-WT19606-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-2/Mac-WT19606-2_2.fq.gz Mac-WT19606-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-3/Mac-WT19606-3_1.fq.gz Mac-WT19606-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-3/Mac-WT19606-3_2.fq.gz Mac-WT19606-r3_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-4/Mac-WT19606-4_1.fq.gz Mac-WT19606-r4_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-4/Mac-WT19606-4_2.fq.gz Mac-WT19606-r4_R2.fq.gz
    
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-2/Mac-Y1-2_1.fq.gz Mac-Y1-r2_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-2/Mac-Y1-2_2.fq.gz Mac-Y1-r2_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-3/Mac-Y1-3_1.fq.gz Mac-Y1-r3_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-3/Mac-Y1-3_2.fq.gz Mac-Y1-r3_R2.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-4/Mac-Y1-4_1.fq.gz Mac-Y1-r4_R1.fq.gz
    ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-4/Mac-Y1-4_2.fq.gz Mac-Y1-r4_R2.fq.gz
  4. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in LB-AB-r1 LB-AB-r2 LB-AB-r3  LB-IJ-r1 LB-IJ-r2 LB-IJ-r4  LB-W1-r1 LB-W1-r2 LB-W1-r3  LB-WT19606-r2 LB-WT19606-r3 LB-WT19606-r4  LB-Y1-r2 LB-Y1-r3 LB-Y1-r4    Mac-AB-r1 Mac-AB-r2 Mac-AB-r3  Mac-IJ-r1 Mac-IJ-r2 Mac-IJ-r4  Mac-W1-r1 Mac-W1-r2 Mac-W1-r3  Mac-WT19606-r2 Mac-WT19606-r3 Mac-WT19606-r4  Mac-Y1-r2 Mac-Y1-r3 Mac-Y1-r4; do
            java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
    done
  5. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    LB-AB-r1,LB-AB-r1_R1.fq.gz,LB-AB-r1_R2.fq.gz,auto
    LB-AB-r2,LB-AB-r2_R1.fq.gz,LB-AB-r2_R2.fq.gz,auto
    LB-AB-r3,LB-AB-r3_R1.fq.gz,LB-AB-r3_R2.fq.gz,auto
    LB-IJ-r1,LB-IJ-r1_R1.fq.gz,LB-IJ-r1_R2.fq.gz,auto
    LB-IJ-r2,LB-IJ-r2_R1.fq.gz,LB-IJ-r2_R2.fq.gz,auto
    LB-IJ-r4,LB-IJ-r4_R1.fq.gz,LB-IJ-r4_R2.fq.gz,auto
    LB-W1-r1,LB-W1-r1_R1.fq.gz,LB-W1-r1_R2.fq.gz,auto
    LB-W1-r2,LB-W1-r2_R1.fq.gz,LB-W1-r2_R2.fq.gz,auto
    LB-W1-r3,LB-W1-r3_R1.fq.gz,LB-W1-r3_R2.fq.gz,auto
    LB-WT19606-r2,LB-WT19606-r2_R1.fq.gz,LB-WT19606-r2_R2.fq.gz,auto
    LB-WT19606-r3,LB-WT19606-r3_R1.fq.gz,LB-WT19606-r3_R2.fq.gz,auto
    LB-WT19606-r4,LB-WT19606-r4_R1.fq.gz,LB-WT19606-r4_R2.fq.gz,auto
    LB-Y1-r2,LB-Y1-r2_R1.fq.gz,LB-Y1-r2_R2.fq.gz,auto
    LB-Y1-r3,LB-Y1-r3_R1.fq.gz,LB-Y1-r3_R2.fq.gz,auto
    LB-Y1-r4,LB-Y1-r4_R1.fq.gz,LB-Y1-r4_R2.fq.gz,auto
    Mac-AB-r1,Mac-AB-r1_R1.fq.gz,Mac-AB-r1_R2.fq.gz,auto
    Mac-AB-r2,Mac-AB-r2_R1.fq.gz,Mac-AB-r2_R2.fq.gz,auto
    Mac-AB-r3,Mac-AB-r3_R1.fq.gz,Mac-AB-r3_R2.fq.gz,auto
    Mac-IJ-r1,Mac-IJ-r1_R1.fq.gz,Mac-IJ-r1_R2.fq.gz,auto
    Mac-IJ-r2,Mac-IJ-r2_R1.fq.gz,Mac-IJ-r2_R2.fq.gz,auto
    Mac-IJ-r4,Mac-IJ-r4_R1.fq.gz,Mac-IJ-r4_R2.fq.gz,auto
    Mac-W1-r1,Mac-W1-r1_R1.fq.gz,Mac-W1-r1_R2.fq.gz,auto
    Mac-W1-r2,Mac-W1-r2_R1.fq.gz,Mac-W1-r2_R2.fq.gz,auto
    Mac-W1-r3,Mac-W1-r3_R1.fq.gz,Mac-W1-r3_R2.fq.gz,auto
    Mac-WT19606-r2,Mac-WT19606-r2_R1.fq.gz,Mac-WT19606-r2_R2.fq.gz,auto
    Mac-WT19606-r3,Mac-WT19606-r3_R1.fq.gz,Mac-WT19606-r3_R2.fq.gz,auto
    Mac-WT19606-r4,Mac-WT19606-r4_R1.fq.gz,Mac-WT19606-r4_R2.fq.gz,auto
    Mac-Y1-r2,Mac-Y1-r2_R1.fq.gz,Mac-Y1-r2_R2.fq.gz,auto
    Mac-Y1-r3,Mac-Y1-r3_R1.fq.gz,Mac-Y1-r3_R2.fq.gz,auto
    Mac-Y1-r4,Mac-Y1-r4_R1.fq.gz,Mac-Y1-r4_R2.fq.gz,auto
    
    #mv trimmed/* .
  6. nextflow run

    #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
    #docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
  7. Import data and pca-plot

     #mamba activate r_env
    
    #install.packages("ggfun")
    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    #library("org.Hs.eg.db")
    library(dplyr)
    library(tidyverse)
    #install.packages("devtools")
    #devtools::install_version("gtable", version = "0.3.0")
    library(gplots)
    library("RColorBrewer")
    #install.packages("ggrepel")
    library("ggrepel")
    # install.packages("openxlsx")
    library(openxlsx)
    library(EnhancedVolcano)
    library(DESeq2)
    library(edgeR)
    
    setwd("~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon")
    # Define paths to your Salmon output quantification files
    
    files <- c("LB-AB_r1" = "./LB-AB-r1/quant.sf",
            "LB-AB_r2" = "./LB-AB-r2/quant.sf",
            "LB-AB_r3" = "./LB-AB-r3/quant.sf",
            "LB-IJ_r1" = "./LB-IJ-r1/quant.sf",
            "LB-IJ_r2" = "./LB-IJ-r2/quant.sf",
            "LB-IJ_r4" = "./LB-IJ-r4/quant.sf",
            "LB-W1_r1" = "./LB-W1-r1/quant.sf",
            "LB-W1_r2" = "./LB-W1-r2/quant.sf",
            "LB-W1_r3" = "./LB-W1-r3/quant.sf",
            "LB-WT19606_r2" = "./LB-WT19606-r2/quant.sf",
            "LB-WT19606_r3" = "./LB-WT19606-r3/quant.sf",
            "LB-WT19606_r4" = "./LB-WT19606-r4/quant.sf",
            "LB-Y1_r2" = "./LB-Y1-r2/quant.sf",
            "LB-Y1_r3" = "./LB-Y1-r3/quant.sf",
            "LB-Y1_r4" = "./LB-Y1-r4/quant.sf",
            "Mac-AB_r1" = "./Mac-AB-r1/quant.sf",
            "Mac-AB_r2" = "./Mac-AB-r2/quant.sf",
            "Mac-AB_r3" = "./Mac-AB-r3/quant.sf",
            "Mac-IJ_r1" = "./Mac-IJ-r1/quant.sf",
            "Mac-IJ_r2" = "./Mac-IJ-r2/quant.sf",
            "Mac-IJ_r4" = "./Mac-IJ-r4/quant.sf",
            "Mac-W1_r1" = "./Mac-W1-r1/quant.sf",
            "Mac-W1_r2" = "./Mac-W1-r2/quant.sf",
            "Mac-W1_r3" = "./Mac-W1-r3/quant.sf",
            "Mac-WT19606_r2" = "./Mac-WT19606-r2/quant.sf",
            "Mac-WT19606_r3" = "./Mac-WT19606-r3/quant.sf",
            "Mac-WT19606_r4" = "./Mac-WT19606-r4/quant.sf",
            "Mac-Y1_r2" = "./Mac-Y1-r2/quant.sf",
            "Mac-Y1_r3" = "./Mac-Y1-r3/quant.sf",
            "Mac-Y1_r4" = "./Mac-Y1-r4/quant.sf")
    
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    # Define the replicates and condition of the samples
    #replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
    #adeA and adeB encode a membrane fusion protein that is part of the AdeABC efflux pump, which contributes to multidrug resistance.
    #System: Part of the AdeIJK efflux pump, which includes: adeI — membrane fusion protein, adeJ — RND transporter, adeK — outer membrane factor
    condition <- factor(c("LB-AB","LB-AB","LB-AB", "LB-IJ","LB-IJ","LB-IJ", "LB-W1","LB-W1","LB-W1","LB-WT19606","LB-WT19606","LB-WT19606","LB-Y1","LB-Y1","LB-Y1","Mac-AB","Mac-AB","Mac-AB","Mac-IJ","Mac-IJ","Mac-IJ","Mac-W1","Mac-W1","Mac-W1","Mac-WT19606","Mac-WT19606","Mac-WT19606","Mac-Y1","Mac-Y1","Mac-Y1"))
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, row.names=names(files))
    
    # ------------------------
    # 1️⃣ Setup and input files
    # ------------------------
    
    # Read in transcript-to-gene mapping
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    
    # Prepare tx2gene for gene-level summarization (remove gene_name if needed)
    tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]
    
    # -------------------------------
    # 2️⃣ Transcript-level counts
    # -------------------------------
    # Create DESeqDataSet directly from tximport (transcript-level)
    dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    write.csv(counts(dds_tx), file="transcript_counts.csv")
    
    # --------------------------------
    # 3️⃣ Gene-level summarization
    # --------------------------------
    # Re-import Salmon data summarized at gene level
    txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE)
    
    # Create DESeqDataSet for gene-level counts
    #dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate)
    dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition)
    
    # --------------------------------
    # 4️⃣ Raw counts table (with gene names)
    # --------------------------------
    # Extract raw gene-level counts
    counts_data <- as.data.frame(counts(dds, normalized=FALSE))
    counts_data$gene_id <- rownames(counts_data)
    
    # Add gene names
    tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
    counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)
    
    # Reorder columns: gene_id, gene_name, then counts
    count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
    counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]
    
    # --------------------------------
    # 5️⃣ Calculate CPM
    # --------------------------------
    library(edgeR)
    library(openxlsx)
    
    # Prepare count matrix for CPM calculation
    count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])
    
    # Calculate CPM
    #cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
    total_counts <- colSums(count_matrix)
    cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
    cpm_matrix <- as.data.frame(cpm_matrix)
    
    # Add gene_id and gene_name back to CPM table
    cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)
    
    # --------------------------------
    # 6️⃣ Save outputs
    # --------------------------------
    write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
    write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
    write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)
    
    # -- (Optional) Save the rlog-transformed counts --
    dim(counts(dds))
    head(counts(dds), 10)
    rld <- rlogTransformation(dds)
    rlog_counts <- assay(rld)
    write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")
    
    # ---- (Optional for NACHREIHEN) split the factos media and strain from condition (for comparison Mac vs LB) ----
    # AdeIJK vs. AdeABC Efflux Pumps
    #     * AdeIJK is the "housekeeping" pump — always active, broadly expressed, contributing to background resistance.
    #     * AdeABC is the "emergency" pump — induced under stress or mutations, more potent in contributing to clinical multidrug resistance.
    #LB = Luria-Bertani broth (a standard rich growth medium)
    #Mac = MacConkey agar or broth (selective for Gram-negative bacteria)
    # - Growth medium   Media or Condition, GrowthMedium
    # - Bacterial strain/genotype   Strain or Isolate, Genotype, SampleType
    media <- factor(c("LB","LB","LB", "LB","LB","LB", "LB","LB","LB","LB","LB","LB","LB","LB","LB","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac"))
    strain <- factor(c("AB","AB","AB", "IJ","IJ","IJ", "W1","W1","W1","WT19606","WT19606","WT19606","Y1","Y1","Y1","AB","AB","AB","IJ","IJ","IJ","W1","W1","W1","WT19606","WT19606","WT19606","Y1","Y1","Y1"))
    # Define the colData for DESeq2
    colData <- data.frame(media=media, strain=strain, row.names=names(files))
    # -- transcript-level count data (x2) --
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~media+strain)
    #write.csv(counts(dds), file="transcript_counts_media_strain.csv")  #check correctness, it should be identical to transcript_counts.csv
    # -- gene-level count data (x2) --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(media=media, strain=strain, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~media+strain)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #3796->????
    #write.csv(counts(dds, normalized=FALSE), file="gene_counts_media_strain.csv")  #check correctness, it should be identical to gene_counts.csv
    # ---- (Optional for NACHREIHEN) END ----
    
    # -- pca --
    png("pca2.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    # -- heatmap --
    png("heatmap2.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
    
    # -- pca_media_strain --
    png("pca_media.png", 1200, 800)
    plotPCA(rld, intgroup=c("media"))
    dev.off()
    png("pca_strain.png", 1200, 800)
    plotPCA(rld, intgroup=c("strain"))
    dev.off()
  8. (Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

    #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
    #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
    #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    
    sizeFactors(dds)
    #NULL
    # Estimate size factors
    dds <- estimateSizeFactors(dds)
    # Estimate dispersions
    dds <- estimateDispersions(dds)
    #> sizeFactors(dds)
    
    #control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
    #2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
    #HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
    #0.4604461  0.5799834  0.6803681
    
    # (DEBUG) If avgTxLength is Necessary
    #To simplify the computation and ensure sizeFactors are calculated:
    assays(dds)$avgTxLength <- NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    #If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
    dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
    sizeFactors(dds)
    
    # If alone with virus data, the following BUG occured:
    #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                        HeLa_TO_r1                      HeLa_TO_r2
                        0.9978755                       1.1092227
    data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    
    #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    1/0.9978755=1.002129023
    1/1.1092227=
    #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
    
    raw_counts <- counts(dds)
    normalized_counts <- counts(dds, normalized=TRUE)
    #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
    #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    estimSf <- function (cds){
        # Get the count matrix
        cts <- counts(cds)
        # Compute the geometric mean
        geomMean <- function(x) prod(x)^(1/length(x))
        # Compute the geometric mean over the line
        gm.mean  <-  apply(cts, 1, geomMean)
        # Zero values are set to NA (avoid subsequentcdsdivision by 0)
        gm.mean[gm.mean == 0] <- NA
        # Divide each line by its corresponding geometric mean
        # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
        # MARGIN: 1 or 2 (line or columns)
        # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
        # FUN: the function to be applied
        cts <- sweep(cts, 1, gm.mean, FUN="/")
        # Compute the median over the columns
        med <- apply(cts, 2, median, na.rm=TRUE)
        # Return the scaling factor
        return(med)
    }
    #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
    #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
    #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
    #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
    #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
    #DESeq2’s median of ratios [1]
    #EdgeR’s trimmed mean of M values (TMM) [2]
    #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
    test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
    sum(test_normcount != normalized_counts)
  9. Select the differentially expressed genes

    #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
    #https://www.biostars.org/p/282295/
    #https://www.biostars.org/p/335751/
    #> dds$condition
    #LB-AB       LB-IJ       LB-W1       LB-WT19606  LB-Y1       Mac-AB     Mac-IJ      Mac-W1      Mac-WT19606 Mac-Y1
    #CONSOLE: mkdir star_salmon/degenes
    
    setwd("degenes")
    #---- relevel to control ----
    dds$condition <- relevel(dds$condition, "LB-WT19606")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("LB.AB_vs_LB.WT19606","LB.IJ_vs_LB.WT19606","LB.W1_vs_LB.WT19606","LB.Y1_vs_LB.WT19606")
    
    dds$condition <- relevel(dds$condition, "Mac-WT19606")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("Mac.AB_vs_Mac.WT19606","Mac.IJ_vs_Mac.WT19606","Mac.W1_vs_Mac.WT19606","Mac.Y1_vs_Mac.WT19606")
    
    # - 如果你的实验是关注细菌在没有选择性压力下的生长、基因表达或一般行为,LB 是更好的对照。
    # - 如果你希望研究细菌在选择性压力下的行为(例如,针对革兰氏阴性细菌、测试抗生素耐药性或区分乳糖发酵菌),那么 MacConkey 更适合作为对照。
    dds$media <- relevel(dds$media, "LB")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("Mac_vs_LB")
    
    dds$media <- relevel(dds$media, "Mac")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("LB_vs_Mac")
    
    for (i in clist) {
      #contrast = paste("condition", i, sep="_")
      contrast = paste("media", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      res_df <- as.data.frame(res)
    
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    # -- Under host-env --
    grep -P "\tgene\t" CP059040.gff > CP059040_gene.gff
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.AB_vs_LB.WT19606-all.txt LB.AB_vs_LB.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.AB_vs_LB.WT19606-up.txt LB.AB_vs_LB.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.AB_vs_LB.WT19606-down.txt LB.AB_vs_LB.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.IJ_vs_LB.WT19606-all.txt LB.IJ_vs_LB.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.IJ_vs_LB.WT19606-up.txt LB.IJ_vs_LB.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.IJ_vs_LB.WT19606-down.txt LB.IJ_vs_LB.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.W1_vs_LB.WT19606-all.txt LB.W1_vs_LB.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.W1_vs_LB.WT19606-up.txt LB.W1_vs_LB.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.W1_vs_LB.WT19606-down.txt LB.W1_vs_LB.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.Y1_vs_LB.WT19606-all.txt LB.Y1_vs_LB.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.Y1_vs_LB.WT19606-up.txt LB.Y1_vs_LB.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.Y1_vs_LB.WT19606-down.txt LB.Y1_vs_LB.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.AB_vs_Mac.WT19606-all.txt Mac.AB_vs_Mac.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.AB_vs_Mac.WT19606-up.txt Mac.AB_vs_Mac.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.AB_vs_Mac.WT19606-down.txt Mac.AB_vs_Mac.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.IJ_vs_Mac.WT19606-all.txt Mac.IJ_vs_Mac.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.IJ_vs_Mac.WT19606-up.txt Mac.IJ_vs_Mac.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.IJ_vs_Mac.WT19606-down.txt Mac.IJ_vs_Mac.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.W1_vs_Mac.WT19606-all.txt Mac.W1_vs_Mac.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.W1_vs_Mac.WT19606-up.txt Mac.W1_vs_Mac.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.W1_vs_Mac.WT19606-down.txt Mac.W1_vs_Mac.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.Y1_vs_Mac.WT19606-all.txt Mac.Y1_vs_Mac.WT19606-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.Y1_vs_Mac.WT19606-up.txt Mac.Y1_vs_Mac.WT19606-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.Y1_vs_Mac.WT19606-down.txt Mac.Y1_vs_Mac.WT19606-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac_vs_LB-all.txt Mac_vs_LB-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac_vs_LB-up.txt Mac_vs_LB-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac_vs_LB-down.txt Mac_vs_LB-down.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB_vs_Mac-all.txt LB_vs_Mac-all.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB_vs_Mac-up.txt LB_vs_Mac-up.csv
    python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB_vs_Mac-down.txt LB_vs_Mac-down.csv
    
    # ---- Mac_vs_LB ----
    res <- read.csv("Mac_vs_LB-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    #print(duplicated_genes)
    # [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
    #[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
    #[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
    #res[res$GeneName == "bfr", ]
    
    #1st_strategy First occurrence is kept and Subsequent duplicates are removed
    #res <- res[!duplicated(res$GeneName), ]
    #2nd_strategy keep the row with the smallest padj value for each GeneName
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_Mac_vs_LB.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-150
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("Mac_vs_LB.png", width=1200, height=2000)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("Mac versus LB"))
    dev.off()
    
    # ---- LB.AB_vs_LB.WT19606 ----
    res <- read.csv("LB.AB_vs_LB.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_LB.AB_vs_LB.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("LB.AB_vs_LB.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("LB.AB versus LB.WT19606"))
    dev.off()
    
    # ---- LB.IJ_vs_LB.WT19606 ----
    res <- read.csv("LB.IJ_vs_LB.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_LB.IJ_vs_LB.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("LB.IJ_vs_LB.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("LB.IJ versus LB.WT19606"))
    dev.off()
    
    # ---- LB.W1_vs_LB.WT19606 ----
    res <- read.csv("LB.W1_vs_LB.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_LB.W1_vs_LB.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("LB.W1_vs_LB.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("LB.W1 versus LB.WT19606"))
    dev.off()
    
    # ---- LB.Y1_vs_LB.WT19606 ----
    res <- read.csv("LB.Y1_vs_LB.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_LB.Y1_vs_LB.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("LB.Y1_vs_LB.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("LB.Y1 versus LB.WT19606"))
    dev.off()
    
    # ---- Mac.AB_vs_Mac.WT19606 ----
    res <- read.csv("Mac.AB_vs_Mac.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_Mac.AB_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("Mac.AB_vs_Mac.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("Mac.AB versus Mac.WT19606"))
    dev.off()
    
    # ---- Mac.IJ_vs_Mac.WT19606 ----
    res <- read.csv("Mac.IJ_vs_Mac.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_Mac.IJ_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("Mac.IJ_vs_Mac.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("Mac.IJ versus Mac.WT19606"))
    dev.off()
    
    # ---- Mac.W1_vs_Mac.WT19606 ----
    res <- read.csv("Mac.W1_vs_Mac.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_Mac.W1_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("Mac.W1_vs_Mac.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("Mac.W1 versus Mac.WT19606"))
    dev.off()
    
    # ---- Mac.Y1_vs_Mac.WT19606 ----
    res <- read.csv("Mac.Y1_vs_Mac.WT19606-all.csv")
    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
      res$GeneName == "" | is.na(res$GeneName),
      gsub("gene-", "", res$GeneID),
      res$GeneName
    )
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
    res <- res %>%
      group_by(GeneName) %>%
      slice_min(padj, with_ties = FALSE) %>%
      ungroup()
    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    
    # Assuming res is your dataframe and already processed
    # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", res)
    # Add the up-regulated genes as the second sheet
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_Expression_Mac.Y1_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    
    # Set the 'GeneName' column as row.names
    rownames(res) <- res$GeneName
    # Drop the 'GeneName' column since it's now the row names
    res$GeneName <- NULL
    head(res)
    
    ## Ensure the data frame matches the expected format
    ## For example, it should have columns: log2FoldChange, padj, etc.
    #res <- as.data.frame(res)
    ## Remove rows with NA in log2FoldChange (if needed)
    #res <- res[!is.na(res$log2FoldChange),]
    
    # Replace padj = 0 with a small value
    res$padj[res$padj == 0] <- 1e-12
    
    #library(EnhancedVolcano)
    # Assuming res is already sorted and processed
    png("Mac.Y1_vs_Mac.WT19606.png", width=1200, height=1200)
    #max.overlaps = 10
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = 'log2FoldChange',
                    y = 'padj',
                    pCutoff = 1e-2,
                    FCcutoff = 2,
                    title = '',
                    subtitleLabSize = 18,
                    pointSize = 3.0,
                    labSize = 5.0,
                    colAlpha = 1,
                    legendIconSize = 4.0,
                    drawConnectors = TRUE,
                    widthConnectors = 0.5,
                    colConnectors = 'black',
                    subtitle = expression("Mac.Y1 versus Mac.WT19606"))
    dev.off()
    
    #TODO: annotate the Gene_Expression_xxx_vs_yyy.xlsx
  10. Clustering the genes and draw heatmap

    #http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
    #http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
    
    # save the Up-regulated and Down-regulated genes into -up.id and -down.id
    for i in Mac_vs_LB LB.AB_vs_LB.WT19606 LB.IJ_vs_LB.WT19606 LB.W1_vs_LB.WT19606 LB.Y1_vs_LB.WT19606 Mac.AB_vs_Mac.WT19606 Mac.IJ_vs_Mac.WT19606 Mac.W1_vs_Mac.WT19606 Mac.Y1_vs_Mac.WT19606; do
      echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
      echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
    done
    #5 LB.AB_vs_LB.WT19606-down.id
    #20 LB.AB_vs_LB.WT19606-up.id
    #64 LB.IJ_vs_LB.WT19606-down.id
    #69 LB.IJ_vs_LB.WT19606-up.id
    #23 LB.W1_vs_LB.WT19606-down.id
    #97 LB.W1_vs_LB.WT19606-up.id
    #9 LB.Y1_vs_LB.WT19606-down.id
    #20 LB.Y1_vs_LB.WT19606-up.id
    #20 Mac.AB_vs_Mac.WT19606-down.id
    #29 Mac.AB_vs_Mac.WT19606-up.id
    #65 Mac.IJ_vs_Mac.WT19606-down.id
    #197 Mac.IJ_vs_Mac.WT19606-up.id
    #359 Mac_vs_LB-down.id
    #308 Mac_vs_LB-up.id
    #290 Mac.W1_vs_Mac.WT19606-down.id
    #343 Mac.W1_vs_Mac.WT19606-up.id
    #75 Mac.Y1_vs_Mac.WT19606-down.id
    #0 Mac.Y1_vs_Mac.WT19606.png-down.id
    #0 Mac.Y1_vs_Mac.WT19606.png-up.id
    #68 Mac.Y1_vs_Mac.WT19606-up.id
    #2061 total
    
    cat *.id | sort -u > ids
    #Delete "GeneName"
    #add Gene_Id in the first line, delete the ""  #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID.
    GOI <- read.csv("ids")$Gene_Id    #1329
    RNASeq.NoCellLine <- assay(rld)
    #install.packages("gplots")
    library("gplots")
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    #datamat = RNASeq.NoCellLine
    write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt")
    
    constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
    if(any(constant_rows)) {
      cat("Removing", sum(constant_rows), "constant rows.\n")
      datamat <- datamat[!constant_rows, ]
    }
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.15)
    mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
    mycol = mycol[as.vector(mycl)]
    #png("DEGs_heatmap.png", width=900, height=800)
    #cex.lab=10, labRow="",
    png("DEGs_heatmap.png", width=1200, height=1000)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',labRow="",
                scale='row',trace='none',col=bluered(75), cexCol=1.8,
                RowSideColors = mycol, margins=c(10,2), cexRow=1.5, srtCol=30, lhei = c(1, 8), lwid=c(2, 8))  #rownames(datamat)
    #heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,5), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=30, lhei=c(0.1,2))
    dev.off()
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
    write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
    write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls;
    
    #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) #####
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
    subset_2<-names(subset(mycl, mycl == '2'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
    subset_3<-names(subset(mycl, mycl == '3'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
    subset_4<-names(subset(mycl, mycl == '4'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
    subset_5<-names(subset(mycl, mycl == '5'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        colnames(expression_values) <- colnames(data)
        # Combine gene information and expression data
        combined_result <- cbind(result, expression_values)
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Blast2GO_workflow

  1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.
    
    Install EggNOG-mapper:
    
        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env
    
    Run annotation:
    
        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        python ~/Scripts/update_fasta_header.py CP059040_protein_.fasta CP059040_protein.fasta
        emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.
    
    Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from ‘Blast2GO 5 Basic’, saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    • ‘Load protein sequences’ (Tags: NONE, generated columns: Nr, SeqName) –>
    • Buttons ‘blast’ (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    • Button ‘mapping’ (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), “Mapping finished – Please proceed now to annotation.”
    • Button ‘annot’ (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), “Annotation finished.”

      • Used parameter ‘Annotation CutOff’: The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
      • Used parameter ‘GO Weight’ is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

      or blast2go_cli_v1.5.1 (NOT_USED)

      #https://help.biobam.com/space/BCD/2250407989/Installation
      #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from ‘Blast2GO 5 Basic’, saved in blast2go_annot.annot2): Interpro based protein families / domains –> Button interpro

    • Button ‘interpro’ (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) –> “InterProScan Finished – You can now merge the obtained GO Annotations.”

    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs

    • Button ‘interpro’/’Merge InterProScan GOs to Annotation’ –> “Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation.” –> “Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation).” #* Button ‘annot’/’ANNEX’ –> “ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping.”

      #– before merging (blast2go_annot.annot) — #H0N29_18790 GO:0004842 ankyrin repeat domain-containing protein #H0N29_18790 GO:0085020 #– after merging (blast2go_annot.annot2) –> #H0N29_18790 GO:0031436 ankyrin repeat domain-containing protein #H0N29_18790 GO:0070531 #H0N29_18790 GO:0004842 #H0N29_18790 GO:0005515 #H0N29_18790 GO:0085020

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    • Go to KAAS
    • Upload your FASTA file.
    • Select an appropriate gene set.
    • Download the KO assignments.
  2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn’t directly in KEGG, use a closely related organism.

    • Check available KEGG organisms:

      library(clusterProfiler)
      library(KEGGREST)
      
      kegg_organisms <- keggList("organism")
      
      Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).
      
      # Search for Acinetobacter in the list
      grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
      # Gammaproteobacteria
      #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"
  3. Find the Closest KEGG Organism for a Non-Model Species

    If your organism is not in KEGG, search for the closest relative:

        grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use “ko” instead of a species code (the code has been intergrated in the point 4):

        kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology
  4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/ata_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")
    
        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)
    
        setwd("~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_",
                            header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
    
        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    
        # Load the results
        #res <- read.csv("Mac_vs_LB-all.csv")     #up307, down358
        #res <- read.csv("LB.AB_vs_LB.WT19606-all.csv")     #up307, down358
        #res <- read.csv("LB.IJ_vs_LB.WT19606-all.csv")     #up307, down358
        #res <- read.csv("LB.W1_vs_LB.WT19606-all.csv")     #up307, down358
        #res <- read.csv("LB.Y1_vs_LB.WT19606-all.csv")     #up307, down358
        #res <- read.csv("Mac.AB_vs_Mac.WT19606-all.csv")     #up307, down358
        #res <- read.csv("Mac.IJ_vs_Mac.WT19606-all.csv")     #up307, down358
        #res <- read.csv("Mac.W1_vs_Mac.WT19606-all.csv")     #up307, down358
        res <- read.csv("Mac.Y1_vs_Mac.WT19606-all.csv")     #up307, down358
    
        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
    
        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
    
        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))
    
        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
    
        # Filter up-regulated genes
        up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        # Filter down-regulated genes
        down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        # Create a new workbook
        wb <- createWorkbook()
        # Add the complete dataset as the first sheet (with annotations)
        addWorksheet(wb, "Complete_Data")
        writeData(wb, "Complete_Data", res_updated)
        # Add the up-regulated genes as the second sheet (with annotations)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        # Add the down-regulated genes as the third sheet (with annotations)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        # Save the workbook to a file
        saveWorkbook(wb, "Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
        ## Set the 'GeneName' column as row.names
        #rownames(res_updated) <- res_updated$GeneName
        ## Drop the 'GeneName' column since it's now the row names
        #res_updated$GeneName <- NULL
        # -- BREAK_1 --
    
        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    
        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # -- BREAK_2 --
    
        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings
        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)
    
        # Create a new workbook
        wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))
    
        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes
    
        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes
    
        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
        # -- BREAK_3 --
    
        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)
    
        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
    
        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    
        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    
        # Save the workbook with enrichment results
        saveWorkbook(wb, "KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
        #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
        #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
        #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!
    
        #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
        #Mac_vs_LB
        #LB.AB_vs_LB.WT19606
        #LB.IJ_vs_LB.WT19606
        #LB.W1_vs_LB.WT19606
        #LB.Y1_vs_LB.WT19606
        #Mac.AB_vs_Mac.WT19606
        #Mac.IJ_vs_Mac.WT19606
        #Mac.W1_vs_Mac.WT19606
        #Mac.Y1_vs_Mac.WT19606
  5. (DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

            library(openxlsx)
    
            # Function to read and clean gene ID files
            read_gene_ids <- function(file_path) {
            # Read the gene IDs from the file
            gene_ids <- readLines(file_path)
    
            # Remove any quotes and trim whitespaces
            gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
            gene_ids <- trimws(gene_ids)  # Trim whitespaces
    
            # Remove empty entries or NAs
            gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]
    
            return(gene_ids)
            }
    
            # Example list of LB files with both -up.id and -down.id for each condition
            lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                            "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
            lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                            "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")
    
            # Combine both up and down files for each condition
            lb_files <- c(lb_files_up, lb_files_down)
    
            # Read gene IDs for each file in LB group
            #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
            lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))
    
            lb_degs_ <- list()
            combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
            #unique_combined_set <- unique(combined_set)
            lb_degs_$AB <- combined_set
            combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
            lb_degs_$IJ <- combined_set
            combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
            lb_degs_$W1 <- combined_set
            combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
            lb_degs_$Y1 <- combined_set
    
            # Example list of Mac files with both -up.id and -down.id for each condition
            mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                            "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
            mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                            "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")
    
            # Combine both up and down files for each condition in Mac group
            mac_files <- c(mac_files_up, mac_files_down)
    
            # Read gene IDs for each file in Mac group
            mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))
    
            mac_degs_ <- list()
            combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
            mac_degs_$AB <- combined_set
            combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
            mac_degs_$IJ <- combined_set
            combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
            mac_degs_$W1 <- combined_set
            combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
            mac_degs_$Y1 <- combined_set
    
            # Function to clean sheet names to ensure no sheet name exceeds 31 characters
            truncate_sheet_name <- function(names_list) {
            sapply(names_list, function(name) {
            if (nchar(name) > 31) {
            return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
            }
            return(name)
            })
            }
    
            # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)
    
            # Define intersections between different conditions for LB
            inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
            inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
            inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
            inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
            inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
            inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)
    
            # Define intersections between three conditions for LB
            inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
            inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
            inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
            inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))
    
            # Define intersection between all four conditions for LB
            inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))
    
            # Now remove the intersected genes from each original set for LB
            venn_list_lb <- list()
    
            # For LB.AB, remove genes that are also in other conditions
            venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))
    
            # For LB.IJ, remove genes that are also in other conditions
            venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))
    
            # For LB.W1, remove genes that are also in other conditions
            venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))
    
            # For LB.Y1, remove genes that are also in other conditions
            venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))
    
            # Add the intersections for LB (same as before)
            venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
            venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
            venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
            venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
            venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
            venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1
    
            # Define intersections between three conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
            venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
            venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1
    
            # Define intersection between all four conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1
    
            # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)
    
            # Define intersections between different conditions
            inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
            inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
            inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
            inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
            inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
            inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)
    
            # Define intersections between three conditions
            inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
            inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
            inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
            inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))
    
            # Define intersection between all four conditions
            inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))
    
            # Now remove the intersected genes from each original set
            venn_list_mac <- list()
    
            # For Mac.AB, remove genes that are also in other conditions
            venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))
    
            # For Mac.IJ, remove genes that are also in other conditions
            venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))
    
            # For Mac.W1, remove genes that are also in other conditions
            venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))
    
            # For Mac.Y1, remove genes that are also in other conditions
            venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))
    
            # Add the intersections (same as before)
            venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
            venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
            venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
            venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
            venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1
    
            # Define intersections between three conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
            venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1
    
            # Define intersection between all four conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1
    
            # Save the gene IDs to Excel for further inspection (optional)
            write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
            write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")
    
            # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
            write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
            write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)
    
            # Venn Diagram for LB group
            venn1 <- ggvenn(lb_degs_,
                            fill_color = c("skyblue", "tomato", "gold", "orchid"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)
    
            # Venn Diagram for Mac group
            venn2 <- ggvenn(mac_degs_,
                            fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)
    
            cat("✅ All Venn intersection sets exported to Excel successfully.\n")

How to correlate RNA-seq Data with Mass Spectrometry Proteomics Data?

Correlating RNA-seq data with mass spectrometry (MS)-based proteomics data is a powerful way to link transcript-level expression with protein-level abundance. Here’s a step-by-step outline of how to approach it:

  1. Preprocessing and Normalization

    For RNA-Seq data:

    • Obtain gene-level expression data, usually as raw counts or TPM (transcripts per million) / FPKM (fragments per kilobase million).
    • Normalize the data (e.g., using DESeq2’s variance stabilizing transformation (VST) or edgeR’s TMM normalization).

    For MS proteomics data:

    • Quantify protein abundances, often using spectral counts, iBAQ, LFQ intensities, or other measures.
    • Log-transform the data if needed to stabilize variance.
  2. Data Mapping and Integration

    • Gene/Protein Mapping: Use gene symbols, Ensembl IDs, or UniProt IDs to map transcript-level data (RNA-seq) to protein-level data (MS). Be cautious of differences in annotation – e.g., some genes might have multiple protein isoforms.

    • Common Identifiers:

      • Convert all IDs to a common identifier (e.g., gene symbols or Ensembl IDs).
      • Remove entries without matching pairs to ensure one-to-one correspondence.
  3. Data Filtering

    • Filter out lowly expressed genes/proteins or those not reliably detected in both datasets.
    • Optionally, keep only genes/proteins of interest or those with high coverage.
  4. Correlation Analysis

    • For each matched gene/protein pair, calculate correlation (usually Pearson or Spearman) across the samples.

      Steps:

      • Construct a table with rows as genes/proteins and columns as samples.
      • For each row, you’ll have two vectors:
        • RNA expression (e.g., normalized RNA counts)
        • Protein abundance (e.g., log-transformed LFQ intensity)
      • Calculate:

        from scipy.stats import pearsonr, spearmanr
        
        rna_vector = [...]
        protein_vector = [...]
        
        pearson_corr, _ = pearsonr(rna_vector, protein_vector)
        spearman_corr, _ = spearmanr(rna_vector, protein_vector)
  5. Visualize and Interpret

    • Plot scatter plots of RNA vs protein levels for:

      • All genes/proteins together (aggregate view)
      • Selected genes of interest
    • Plot correlation coefficients:

      • Histogram of all gene/protein correlations
      • Heatmap if you have sample-wise data
    • Assess overall agreement:

      • Typically, moderate correlation (~0.3–0.6) is observed in many studies.
  6. Consider Batch Effects and Biological Variability

    • If the datasets come from different experiments or platforms, consider batch correction methods (e.g., ComBat from the sva R package).

    • Be mindful that:

      • Post-transcriptional regulation affects how well mRNA levels correlate with protein levels.
      • Some genes/proteins might show no correlation due to translational regulation, stability, etc.
  7. Summary Workflow

    ✅ Preprocess & normalize both datasets ✅ Map genes/proteins to common IDs ✅ Filter to shared, high-quality data ✅ Calculate correlations ✅ Visualize and interpret

  8. Python script that walks through the key steps of correlating RNA-seq data with proteomics data:

    import pandas as pd
    import numpy as np
    from scipy.stats import pearsonr, spearmanr
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # --- Step 1: Load your data ---
    
    # Example: CSVs with genes/proteins as rows, samples as columns
    rna_data = pd.read_csv('rna_seq_data.csv', index_col=0)  # genes x samples
    protein_data = pd.read_csv('proteomics_data.csv', index_col=0)  # proteins x samples
    
    # --- Step 2: Map genes to proteins (assuming same identifiers) ---
    
    # Filter to common genes/proteins
    common_genes = rna_data.index.intersection(protein_data.index)
    rna_data_filtered = rna_data.loc[common_genes]
    protein_data_filtered = protein_data.loc[common_genes]
    
    print(f"Number of common genes/proteins: {len(common_genes)}")
    
    # --- Step 3: Log transform if needed (optional) ---
    
    rna_data_log = np.log2(rna_data_filtered + 1)
    protein_data_log = np.log2(protein_data_filtered + 1)
    
    # --- Step 4: Calculate gene-wise correlations across samples ---
    
    pearson_corrs = []
    spearman_corrs = []
    
    for gene in common_genes:
        rna_vector = rna_data_log.loc[gene]
        protein_vector = protein_data_log.loc[gene]
    
        pearson_corr, _ = pearsonr(rna_vector, protein_vector)
        spearman_corr, _ = spearmanr(rna_vector, protein_vector)
    
        pearson_corrs.append(pearson_corr)
        spearman_corrs.append(spearman_corr)
    
    # Save results
    correlation_df = pd.DataFrame({
        'Gene': common_genes,
        'Pearson': pearson_corrs,
        'Spearman': spearman_corrs
    })
    correlation_df.to_csv('gene_protein_correlations.csv', index=False)
    print("Saved gene-wise correlation data to 'gene_protein_correlations.csv'")
    
    # --- Step 5: Visualize the correlation distributions ---
    
    sns.histplot(correlation_df['Pearson'], bins=30, kde=True, color='skyblue')
    plt.xlabel('Pearson Correlation')
    plt.title('Distribution of Pearson Correlations (RNA vs Protein)')
    plt.show()
    
    sns.histplot(correlation_df['Spearman'], bins=30, kde=True, color='salmon')
    plt.xlabel('Spearman Correlation')
    plt.title('Distribution of Spearman Correlations (RNA vs Protein)')
    plt.show()
    
    # --- Step 6: Scatter plot for a selected gene/protein ---
    
    example_gene = common_genes[0]  # change to your gene of interest
    plt.scatter(rna_data_log.loc[example_gene], protein_data_log.loc[example_gene])
    plt.xlabel('Log2 RNA Expression')
    plt.ylabel('Log2 Protein Abundance')
    plt.title(f'RNA vs Protein for {example_gene}')
    plt.grid(True)
    plt.show()
    
    # Key Notes:
    #✅ Replace the filenames (rna_seq_data.csv and proteomics_data.csv) with your actual files.
    #✅ The script expects rows to be genes/proteins and columns to be samples.
    #✅ Modify or add steps if you have different normalization needs (e.g., DESeq2 normalization).
  9. R script that covers the same steps as above:

    # --- Load libraries ---
    library(ggplot2)
    library(dplyr)
    
    # --- Step 1: Load your data ---
    # Example: CSVs with genes/proteins as rows, samples as columns
    rna_data <- read.csv("rna_seq_data.csv", row.names = 1)
    protein_data <- read.csv("proteomics_data.csv", row.names = 1)
    
    # --- Step 2: Find common genes/proteins ---
    common_genes <- intersect(rownames(rna_data), rownames(protein_data))
    rna_data_filtered <- rna_data[common_genes, ]
    protein_data_filtered <- protein_data[common_genes, ]
    
    cat("Number of common genes/proteins:", length(common_genes), "\n")
    
    # --- Step 3: Log-transform if needed (optional) ---
    rna_data_log <- log2(rna_data_filtered + 1)
    protein_data_log <- log2(protein_data_filtered + 1)
    
    # --- Step 4: Calculate gene-wise correlations across samples ---
    pearson_corrs <- numeric(length(common_genes))
    spearman_corrs <- numeric(length(common_genes))
    
    for (i in seq_along(common_genes)) {
    gene <- common_genes[i]
    rna_vector <- as.numeric(rna_data_log[gene, ])
    protein_vector <- as.numeric(protein_data_log[gene, ])
    
    pearson_corrs[i] <- cor(rna_vector, protein_vector, method = "pearson")
    spearman_corrs[i] <- cor(rna_vector, protein_vector, method = "spearman")
    }
    
    # Save the results
    correlation_df <- data.frame(
    Gene = common_genes,
    Pearson = pearson_corrs,
    Spearman = spearman_corrs
    )
    
    write.csv(correlation_df, "gene_protein_correlations.csv", row.names = FALSE)
    cat("Saved gene-wise correlation data to 'gene_protein_correlations.csv'\n")
    
    # --- Step 5: Visualize the correlation distributions ---
    ggplot(correlation_df, aes(x = Pearson)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    labs(title = "Distribution of Pearson Correlations (RNA vs Protein)",
        x = "Pearson Correlation", y = "Frequency") +
    theme_minimal()
    
    ggplot(correlation_df, aes(x = Spearman)) +
    geom_histogram(bins = 30, fill = "salmon", color = "black") +
    labs(title = "Distribution of Spearman Correlations (RNA vs Protein)",
        x = "Spearman Correlation", y = "Frequency") +
    theme_minimal()
    
    # --- Step 6: Scatter plot for a selected gene/protein ---
    example_gene <- common_genes[1]  # change this to your gene of interest
    df_example <- data.frame(
    RNA = as.numeric(rna_data_log[example_gene, ]),
    Protein = as.numeric(protein_data_log[example_gene, ])
    )
    
    ggplot(df_example, aes(x = RNA, y = Protein)) +
    geom_point() +
    labs(title = paste("RNA vs Protein for", example_gene),
        x = "Log2 RNA Expression", y = "Log2 Protein Abundance") +
    theme_minimal() +
    geom_smooth(method = "lm", se = FALSE, color = "red")
    
    # Key Notes:
    #✅ Replace "rna_seq_data.csv" and "proteomics_data.csv" with your real file names.
    #✅ Rows: genes/proteins, columns: samples.
    #✅ Change example_gene to any gene of interest for plotting.
    #Tweak this for the new dataset or extend it with batch correction or other normalizations?

All tools and services of BV-BRC

Genomics

  • Genome Assembly
  • Genome Annotation
  • Comprehensive Genome Analysis (B)
  • BLAST
  • Primer Design
  • Similar Genome Finder
  • Genome Alignment
  • Variation Analysis
  • Tn-Seq Analysis

Phylogenomics

  • Bacterial Genome Tree
  • Viral Genome Tree
  • Gene/Protein Tree

Protein Tools

  • MSA and SNP Analysis
  • Meta-CATS
  • Proteome Comparison
  • Protein Family Sorter
  • Comparative Systems
  • Docking

Metagenomics

  • Taxonomic Classification
  • Metagenomic Binning
  • Metagenomic Read Mapping

Transcriptomics

  • RNA-Seq Analysis
  • Expression Import

Utilities

  • Fastq Utilities
  • ID Mapper

Viral Tools

  • SARS-CoV-2 Genome Analysis
  • SARS-CoV-2 Wastewater Analysis
  • Influenza Sequence Submission
  • Influenza HA Subtype Conversion
  • Subspecies Classification
  • Viral Assembly

Outbreak Tracker

  • Measles 2025
  • Mpox 2024
  • Influenza H5N1 2024
  • SARS-CoV-2

DAMIAN Post-processing for Flavivirus and FSME

  1. Prepare input raw data

    ~/DATA/Data_DAMIAN_Post-processing_Flavivirus_and_FSME
    
    ln ./240621_M03701_0312_000000000-GHL9N/p20534/7448_7501_S0_R1_001.fastq.gz p20534_7448_R1.fastq.gz
    ln ./240621_M03701_0312_000000000-GHL9N/p20534/7448_7501_S0_R2_001.fastq.gz p20534_7448_R2.fastq.gz
  2. Prepare virus database and select 8 representatives for the eight given viruses from the database

    # -- Download genomes --
    # ---- Date is 13.06.2025. ----
    #Taxonomy ID: 3044782
    #Die Gattung Orthoflavivirus (früher Flavivirus) umfasst behüllte Viren mit einem positivsträngigen RNA-Einzelstrang als Genom, die durch Arthropoden (Zecken und Stechmücken) als Vektoren auf Vögel und Säugetiere übertragen werden.
    #The English name for Flavivirus is simply: Flavivirus
    #It is both the scientific and common name for the genus of viruses in the family Flaviviridae. This genus includes several well-known viruses such as:
            * Dengue virus
            * Zika virus
            * West Nile virus
            * Yellow fever virus
            * Tick-borne encephalitis virus (TBEV / FSME virus)
    
    esearch -db nucleotide -query "txid3044782[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_3044782_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_3044782_ncbi.fasta complete_genome_3044782_ncbi.fasta  #96579-->9431
    #https://www.ebi.ac.uk/ena/browser/view/3044782
    
    #Download FMSE
    esearch -db nucleotide -query "txid11084[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_11084_ncbi.fasta
    python ~/Scripts/filter_fasta.py genome_11084_ncbi.fasta complete_genome_11084_ncbi.fasta  #3426-->219
    #https://www.ebi.ac.uk/ena/browser/view/11084
    
    samtools faidx complete_genome_11084_ncbi.fasta PV626569.1 > PV626569.fasta
  3. Run the second round of vrap (–host==${virus}.fasta)

    #cat FluB_PB1.fasta FluB_PB2.fasta FluB_PA.fasta FluB_HA.fasta FluB_NP.fasta FluB_NB_NA.fasta FluB_M1_BM2.fasta FluB_NEP_NS1.fasta > FluB.fasta
    
    # Run vrap (second round): selecte some representative viruses from the generated Excel-files generated by the last step as --host
    (vrap) for sample in p20534_7448; do
        vrap/vrap_until_bowtie2.py  -1 ${sample}_R1.fastq.gz -2 ${sample}_R2.fastq.gz  -o vrap_${sample}_on_FSME --host /home/jhuang/DATA/Data_DAMIAN_Post-processing_Flavivirus_and_FSME/PV626569.fasta   -t 100 -l 200  --gbt2 --noblast
    done
    
    (vrap) for sample in p20534_7448; do
        vrap/vrap_until_bowtie2.py  -1 ${sample}_R1.fastq.gz -2 ${sample}_R2.fastq.gz  -o vrap_${sample}_on_Flavivirus --host /home/jhuang/DATA/Data_DAMIAN_Post-processing_Flavivirus_and_FSME/complete_genome_3044782_ncbi.fasta   -t 100 -l 200  --gbt2 --noblast
    done
  4. Generate the mapping statistics for the sam-files generated from last step

    for sample in p20534_7448; do
        echo "-----${sample}_on_representatives------" >> LOG_mapping
        #cd vrap_${sample}_on_${virus}/bowtie
        cd vrap_${sample}_on_Flavivirus/bowtie
        # Rename and convert SAM to BAM
        mv mapped mapped.sam 2>> ../../LOG_mapping
        samtools view -S -b mapped.sam > mapped.bam 2>> ../../LOG_mapping
        samtools sort mapped.bam -o mapped_sorted.bam 2>> ../../LOG_mapping
        samtools index mapped_sorted.bam 2>> ../../LOG_mapping
        # Write flagstat output to log (go up two levels to write correctly)
        samtools flagstat mapped_sorted.bam >> ../../LOG_mapping 2>&1
        #samtools idxstats mapped_sorted.bam >> ../../LOG_mapping 2>&1
        cd ../..
    done
    
    (bakta) jhuang@WS-2290C:/mnt/md1/DATA/Data_DAMIAN_Post-processing_Flavivirus_and_FSME/vrap_p20534_7448_on_FSME/bowtie$ samtools flagstat mapped_sorted.bam
    7836046 + 0 in total (QC-passed reads + QC-failed reads)
    7836046 + 0 primary
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 primary duplicates
    0 + 0 mapped (0.00% : N/A)
    0 + 0 primary mapped (0.00% : N/A)
    5539082 + 0 paired in sequencing
    2769541 + 0 read1
    2769541 + 0 read2
    0 + 0 properly paired (0.00% : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    (bakta) jhuang@WS-2290C:/mnt/md1/DATA/Data_DAMIAN_Post-processing_Flavivirus_and_FSME/vrap_p20534_7448_on_Flavivirus/bowtie$ samtools flagstat mapped_sorted.bam
    7836234 + 0 in total (QC-passed reads + QC-failed reads)
    7836234 + 0 primary
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 primary duplicates
    52 + 0 mapped (0.00% : N/A)
    52 + 0 primary mapped (0.00% : N/A)
    5539458 + 0 paired in sequencing
    2769729 + 0 read1
    2769729 + 0 read2
    0 + 0 properly paired (0.00% : N/A)
    4 + 0 with itself and mate mapped
    13 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    samtools view -F 4 mapped_sorted.bam > mapped_reads.sam
    awk '{print $3}' mapped_reads.sam | sort | uniq -c
    52 KY766069.1 Zika virus isolate Pf13/251013-18, complete genome
    
    # ------------------ DEBUG ----------------------
    samtools idxstats mapped_sorted.bam | cut -f 1
    
    for ref in PV424649.1 PV424650.1 PV424648.1 PV424643.1 PV424646.1 PV424645.1 PV424644.1 PV424647.1; do
        echo "Reference: $ref"
        samtools view -b mapped_sorted.bam "$ref" | samtools flagstat -
    done
    
    When I run samtools flagstat mapped_sorted.bam
    
    49572521 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    1169 + 0 mapped (0.00% : N/A)
    38247374 + 0 paired in sequencing
    19123687 + 0 read1
    19123687 + 0 read2
    884 + 0 properly paired (0.00% : N/A)
    934 + 0 with itself and mate mapped
    227 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    however, wenn I run for ref in PV424649.1 PV424650.1 PV424648.1 PV424643.1 PV424646.1 PV424645.1 PV424644.1 PV424647.1; do
            echo "Reference: $ref"
            samtools view -b mapped_sorted.bam "$ref" | samtools flagstat -
            done
    
    Reference: PV424647.1
    83 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    72 + 0 mapped (86.75% : N/A)
    82 + 0 paired in sequencing
    41 + 0 read1
    41 + 0 read2
    56 + 0 properly paired (68.29% : N/A)
    60 + 0 with itself and mate mapped
    11 + 0 singletons (13.41% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    I want to also the same total name as "samtools flagstat mapped_sorted.bam". How?
    
    samtools view -b mapped_sorted.bam PV424649.1
    
    for ref in PV424649.1 PV424650.1 PV424648.1 PV424643.1 PV424646.1 PV424645.1 PV424644.1 PV424647.1; do
    echo "Reference: $ref"
    samtools view -h mapped_sorted.bam | grep -E "^@|$ref" | samtools view -Sb - | samtools flagstat -
    done
    
    # ---- DEBUG END ----
    
    #draw some plots for some representative isolates which found in the first round (see Excel-file).
    samtools depth -m 0 -a mapped_sorted.bam > coverage.txt
    #grep "PV424649.1" coverage.txt > FluB_PB1_coverage.txt
    #grep "PV424650.1" coverage.txt > FluB_PB2_coverage.txt
    #grep "PV424648.1" coverage.txt > FluB_PA_coverage.txt
    #grep "PV424643.1" coverage.txt > FluB_HA_coverage.txt
    #grep "PV424646.1" coverage.txt > FluB_NP_coverage.txt
    #grep "PV424645.1" coverage.txt > FluB_NB_NA_coverage.txt
    #grep "PV424644.1" coverage.txt > FluB_M1_BM2_coverage.txt
    #grep "PV424647.1" coverage.txt > FluB_NEP_NS1_coverage.txt
    
            import pandas as pd
            import matplotlib.pyplot as plt
    
            # Load coverage data
            df = pd.read_csv("coverage.txt", sep="\t", header=None, names=["chr", "pos", "coverage"])
    
            # Plot
            plt.figure(figsize=(10,4))
            plt.plot(df["pos"], df["coverage"], color="blue", linewidth=0.5)
            plt.xlabel("Genomic Position")
            plt.ylabel("Coverage Depth")
            plt.title("BAM Coverage Plot")
            plt.show()
  5. Report

    Subject: Mapping Results for FluB Representative Isolate
    
    I have re-analyzed sample P20534 (7448) with a focus on Flaviviruses and FSME.
    
    Using curated reference sets from NCBI (Taxonomy ID 3044782 for Flavivirus, comprising 9,431 complete genomes—see attached flavivirus_names.txt for details; and Taxonomy ID 11084 for FSME, with 219 complete genomes), I performed targeted mapping. The key findings are summarized below:
    
        * Total reads: 7,836,234
        * Mapped to Flavivirus: 52 reads
          All 52 reads mapped specifically to Zika virus (KY766069.1, complete genome of isolate Pf13/251013-18)
        * Mapped to FSME: No significant hits detected
    
    Please find attached a coverage plot for the Zika virus genome (KY766069).
  6. Preparing a database containing all representative viruses from NCBI Virus #https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ #Download All Records (18,708) am 26.05.2025

    # ------------ Manually update the internal viral databases --------------
    ##https://www.ebi.ac.uk/ena/browser/view/10239
    #esearch -db nucleotide -query "txid10239[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10239_ncbi.fasta
    #esearch -db protein -query "txid11520[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > protein_11520_ncbi.fasta
    #mv ~/Tools/vrap/database/viral_db/nucleotide.fa ~/Tools/vrap/database/viral_db/nucleotide_Human_alphaherpesvirus_1.fa
    #mv ~/Tools/vrap/database/viral_db/protein.fa ~/Tools/vrap/database/viral_db/protein_Human_alphaherpesvirus_1.fa
    #cp genome_11520_ncbi.fasta ~/Tools/vrap/database/viral_db/nucleotide.fa
    #cp protein_11520_ncbi.fasta ~/Tools/vrap/database/viral_db/protein.fa
    #cd ~/Tools/vrap/database/viral_db
    #~/Tools/vrap/external_tools/blast/makeblastdb -in nucleotide.fa -dbtype nucl -parse_seqids -out virus_nucleotide
    #~/Tools/vrap/external_tools/blast/makeblastdb -in protein.fa -dbtype prot -parse_seqids -out virus_protein
    #vrap/vrap_noassembly.py  -1 AW005486_R1.fastq.gz -2 AW005486_R2.fastq.gz -o vrap_AW005486_on_InfluB  --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/complete_11520_ncbi.fasta  -t 20 -l 200  -g
    
    # ----------- Three databases ----------
    #db is [virus_user_db]
    /home/jhuang/Tools/vrap/external_tools/blast/makeblastdb -in /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/blast/custom_viral_seq.fa -dbtype nucl -parse_seqids -out /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/blast/db/virus >> /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/vrap.log 2>> /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/vrap.log
    
    #db is ~/Tools/vrap/database/viral_db/nucleotide.fa  [Human alphaherpesvirus 1] [virus_nt_db]
    /home/jhuang/Tools/vrap/external_tools/blast/blastn -num_threads 20 -query /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/vrap_contig.fasta -db "/home/jhuang/Tools/vrap/database/viral_db/viral_nucleotide /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/blast/db/virus"  -evalue 1e-4 -outfmt "6 qseqid qstart qend sstart send evalue length pident sseqid stitle qcovs qcovhsp sacc slen qlen" -max_target_seqs 1 -out /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/blast/blastn.csv >> /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/vrap.log
    Warning: [blastn] Examining 5 or more matches is recommended
    
    #db is ~/Tools/vrap/database/viral_db/protein.fa [Human alphaherpesvirus 1] [virus_aa_db]
    /home/jhuang/Tools/vrap/external_tools/blast/blastx -num_threads 20 -query /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/blast/blastn.fa -db "/home/jhuang/Tools/vrap/database/viral_db/viral_protein"  -evalue 1e-6 -outfmt "6 qseqid qstart qend sstart send evalue length pident sseqid stitle qcovs qcovhsp sacc slen qlen" -max_target_seqs 1 -out /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/blast/blastx.csv >> /mnt/md1/DATA/Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links/vrap_AW005486_on_InfluB/vrap.log

阳光房漏水怎么办?丁基胶带才是最佳密封选择

TODO: 以后只用 丁基胶带 粘Wintergarten上 玻璃的侧边和金属的连接处!

HSButyl FoilBand https://www.hsbutyl.com/de/product/foilband-butyl-tape/

Das Premium Glasfalz Silikon OTTOSEAL S120 310ml Alle Farben (Transparent) Innen- und Außen https://www.ebay.de/itm/266565929135?var=566331892689

  1. 丁基胶带(Butyl Tape)简介:

    • 材料:由丁基橡胶制成,是一种柔软、自粘、防水的密封带。

    • 特点: ✅ 强力粘附:适用于玻璃、金属等材料 ✅ 防水防漏:密封性优良 ✅ 耐高低温、抗紫外线:适合户外使用 ✅ 无毒无味:环保 ✅ 易施工:可剪裁、手贴,干净整洁

    • 适用场景:

      • 屋顶玻璃、窗户缝隙密封
      • 房车、冷藏车、金属屋顶防水修补
      • 管道和通风口密封
  2. 沥青胶带(Bitumen Tape)简介:

    • 材料:由改性沥青制成,表面有铝箔,常用于建筑屋顶大面积防水。
    • 特点:

      ✅ 防水性能强 ❌ 易脏、气味重 ❌ 夏天会软化、流淌,污染玻璃 ❌ 冬天易变硬、粘性下降 ❌ 不适合用于玻璃细缝密封

✅ 哪种更适合密封阳光房玻璃缝?

比较项 丁基胶带(推荐)    沥青胶带(不推荐)
与玻璃的兼容性 ✅ 非常好   ❌ 易弄脏玻璃,附着差
抗紫外线能力  ✅ 优秀,适合阳光房长期使用  ❌ 容易老化、融化
外观美观    ✅ 整洁,可选灰黑色  ❌ 黑色粗糙,难清理
温度适应性   ✅ 热不融、冷不裂   ❌ 高温融化、低温变硬
安装简便    ✅ 自粘易贴、整洁   ❌ 易弄脏手、施工复杂
气味环保    ✅ 无味环保  ❌ 有强烈沥青味道

3 使用建议:

* 清洁玻璃与框架表面(干净、干燥、无油)
* 贴上丁基胶带,沿缝隙压实贴牢
* 可选:在外部再加一层铝箔反光带或密封压条,增加防晒耐久性
* 使用橡胶滚轮压实效果更佳

4 德国购买关键词:

* Butyl-Dichtband
* UV-beständig
* Für Glas und Aluminium
* 推荐品牌:Tesa、Sika、3M 等
* 购买渠道:Amazon.de、OBI、Bauhaus、Hornbach

✅ 总结结论:你要密封Wintergarten屋顶玻璃缝隙,请优先选择丁基胶带,不要使用沥青胶带。前者更干净、耐久、环保,且长期使用不影响玻璃美观与结构密封性。

Variant calling (inter-host + intra-host) for Data_Pietschmann_229ECoronavirus_Mutations_2025 (via docker own_viral_ngs) v2

  1. Input data:

    ln -s ../raw_data_2024/hCoV229E_Rluc_R1.fastq.gz hCoV229E_Rluc_R1.fastq.gz
    ln -s ../raw_data_2024/hCoV229E_Rluc_R2.fastq.gz hCoV229E_Rluc_R2.fastq.gz
    ln -s ../raw_data_2024/p10_DMSO_R1.fastq.gz p10_DMSO_R1.fastq.gz
    ln -s ../raw_data_2024/p10_DMSO_R2.fastq.gz p10_DMSO_R2.fastq.gz
    ln -s ../raw_data_2024/p10_K22_R1.fastq.gz p10_K22_R1.fastq.gz
    ln -s ../raw_data_2024/p10_K22_R2.fastq.gz p10_K22_R2.fastq.gz
    ln -s ../raw_data_2024/p10_K7523_R1.fastq.gz p10_K7523_R1.fastq.gz
    ln -s ../raw_data_2024/p10_K7523_R2.fastq.gz p10_K7523_R2.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R1_001.fastq.gz p16_DMSO_R1.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R2_001.fastq.gz p16_DMSO_R2.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R1_001.fastq.gz p16_K22_R1.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R2_001.fastq.gz p16_K22_R2.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R1_001.fastq.gz p16_X7523_R1.fastq.gz
    ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R2_001.fastq.gz p16_X7523_R2.fastq.gz
  2. Call variant calling using snippy

    ln -s ~/Tools/bacto/db/ .;
    ln -s ~/Tools/bacto/envs/ .;
    ln -s ~/Tools/bacto/local/ .;
    cp ~/Tools/bacto/Snakefile .;
    cp ~/Tools/bacto/bacto-0.1.json .;
    cp ~/Tools/bacto/cluster.json .;
    
    #download CU459141.gb from GenBank
    mv ~/Downloads/sequence\(2\).gb db/PP810610.gb
    
    #setting the following in bacto-0.1.json
        "fastqc": false,
        "taxonomic_classifier": false,
        "assembly": true,
        "typing_ariba": false,
        "typing_mlst": true,
        "pangenome": true,
        "variants_calling": true,
        "phylogeny_fasttree": true,
        "phylogeny_raxml": true,
        "recombination": false, (due to gubbins-error set false)
        "genus": "Alphacoronavirus",
        "kingdom": "Viruses",
        "species": "Human coronavirus 229E",
        "mykrobe": {
            "species": "corona"
        },
        "reference": "db/PP810610.gb"
    
    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  3. Summarize all SNPs and Indels from the snippy result directory.

    #Output: snippy/summary_snps_indels.csv
    # IMPORTANT_ADAPT the array isolates = ["AYE-S", "AYE-Q", "AYE-WT on Tig4", "AYE-craA on Tig4", "AYE-craA-1 on Cm200", "AYE-craA-2 on Cm200"]
    python3 ~/Scripts/summarize_snippy_res.py snippy
    cd snippy
    #grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
  4. Using spandx calling variants (almost the same results to the one from viral-ngs!)

    mamba activate /home/jhuang/miniconda3/envs/spandx
    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
    cp PP810610.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610    #-d
    ~/Scripts/genbank2fasta.py PP810610.gb
    mv PP810610.gb_converted.fna PP810610.fasta    #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file
    ln -s /home/jhuang/Tools/spandx/ spandx
    (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref PP810610.fasta --annotation --database PP810610 -resume
    
    # Rerun SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation
    cd Outputs/Master_vcf
    (spandx) cp -r ../../snippy/hCoV229E_Rluc/reference .
    (spandx) cp ../../spandx/bin/SNP_matrix.sh ./
    #Note that ${variant_genome_path}=NC_001348 in the following command, but it was not used after command replacement.
    #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
    "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
    (spandx) bash SNP_matrix.sh PP810610 .
  5. Calling inter-host variants by merging the results from snippy+spandx (Manually!)

    # Inter-host variants(宿主间变异):一种病毒在两个人之间有不同的基因变异,这些变异可能与宿主的免疫反应、疾病表现或病毒传播的方式相关。
    cp All_SNPs_indels_annotated.txt All_SNPs_indels_annotated_backup.txt
    vim All_SNPs_indels_annotated.txt
    
    #in the file ids: grep "$(echo -e '\t')353$(echo -e '\t')" All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt
    #Replace \n with " All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt\ngrep "
    #Replace grep " --> grep "$(echo -e '\t')
    #Replace " All_ --> $(echo -e '\t')" All_
    
    # Potential intra-host variants: 10871, 19289, 23435.
    CHROM   POS     REF     ALT     TYPE    hCoV229E_Rluc_trimmed   p10_DMSO_trimmed        p10_K22_trimmed p10_K7523_trimmed       p16_DMSO_trimmed        p16_K22_trimmed p16_X7523_trimmed       Effect  Impact  Functional_Class        Codon_change    Protein_and_nucleotide_change   Amino_Acid_Length       Gene_name       Biotype
    PP810610        1464    T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        gTt/gCt p.Val416Ala/c.1247T>C   6757    CDS_1   protein_coding
    PP810610        1699    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val494Val/c.1482C>T   6757    CDS_1   protein_coding
    PP810610        6691    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  tgC/tgT p.Cys2158Cys/c.6474C>T  6757    CDS_1   protein_coding
    PP810610        6919    C       G       SNP     G       G       G       G       G       G       G       synonymous_variant      LOW     SILENT  ggC/ggG p.Gly2234Gly/c.6702C>G  6757    CDS_1   protein_coding
    PP810610        7294    T       A       SNP     A       A       A       A       A       A       A       missense_variant        MODERATE        MISSENSE        agT/agA p.Ser2359Arg/c.7077T>A  6757    CDS_1   protein_coding
    * PP810610       10871   C       T       SNP     C       C/T     T       C/T     C/T     T       C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu3552Phe/c.10654C>T 6757    CDS_1   protein_coding
    PP810610        14472   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        aTg/aCg p.Met4752Thr/c.14255T>C 6757    CDS_1   protein_coding
    PP810610        15458   T       C       SNP     C       C       C       C       C       C       C       synonymous_variant      LOW     SILENT  Ttg/Ctg p.Leu5081Leu/c.15241T>C 6757    CDS_1   protein_coding
    PP810610        16035   C       A       SNP     A       A       A       A       A       A       A       stop_gained     HIGH    NONSENSE        tCa/tAa p.Ser5273*/c.15818C>A   6757    CDS_1   protein_coding
    PP810610        17430   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu5738Ser/c.17213T>C 6757    CDS_1   protein_coding
    * PP810610       19289   G       T       SNP     G       G       T       G       G       G/T     G       missense_variant        MODERATE        MISSENSE        Gtt/Ttt p.Val6358Phe/c.19072G>T 6757    CDS_1   protein_coding
    PP810610        21183   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        tTt/tGt p.Phe230Cys/c.689T>G    1173    CDS_2   protein_coding
    PP810610        22636   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        aaT/aaG p.Asn714Lys/c.2142T>G   1173    CDS_2   protein_coding
    PP810610        23022   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu843Ser/c.2528T>C   1173    CDS_2   protein_coding
    * PP810610       23435   C       T       SNP     C       C       T       C/T     C       C/T     C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu981Phe/c.2941C>T   1173    CDS_2   protein_coding
    PP810610        24512   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctc/Ttc p.Leu36Phe/c.106C>T     88      CDS_4   protein_coding
    PP810610        24781   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        aCt/aTt p.Thr36Ile/c.107C>T     77      CDS_5   protein_coding
    PP810610        25163   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu82Phe/c.244C>T     225     CDS_6   protein_coding
    PP810610        25264   C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val115Val/c.345C>T    225     CDS_6   protein_coding
    PP810610        26838   G       T       SNP     T       T       T       T       T       T       T
  6. Calling intra-host variants using viral-ngs

    # Intra-host variants(宿主内变异):同一个人感染了某种病毒,但在其体内的不同细胞或器官中可能存在多个不同的病毒变异株。
    
    #How to run and debug the viral-ngs docker?
    # ---- DEBUG_2025_1: using docker instead ----
    mkdir viralngs; cd viralngs
    ln -s ~/Tools/viral-ngs_docker/Snakefile Snakefile
    ln -s  ~/Tools/viral-ngs_docker/bin bin
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/refsel.acids refsel.acids
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/lastal.acids lastal.acids
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/config.yaml config.yaml
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-runs.txt samples-runs.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-depletion.txt samples-depletion.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-metagenomics.txt samples-metagenomics.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly.txt samples-assembly.txt
    cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly-failures.txt samples-assembly-failures.txt
    # Adapt the sample-*.txt
    
    mkdir viralngs/data
    mkdir viralngs/data/00_raw
    
    mkdir bams
    ref_fa="PP810610.fasta";
    #for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
    for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
        bwa index ${ref_fa}; \
        bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_trimmed_P_1.fastq trimmed/${sample}_trimmed_P_2.fastq | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
    done
    
    conda activate viral-ngs4
    #for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
    #for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
    for sample in p16_K22; do
        picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs/data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
    done
    conda deactivate
    
    # -- ! Firstly set the samples-assembly.txt empty, so that only focus on running depletion!
    docker run -it -v /mnt/md1/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs:/work -v /home/jhuang/Tools/viral-ngs_docker:/home/jhuang/Tools/viral-ngs_docker -v /home/jhuang/REFs:/home/jhuang/REFs -v /home/jhuang/Tools/GenomeAnalysisTK-3.6:/home/jhuang/Tools/GenomeAnalysisTK-3.6 -v /home/jhuang/Tools/novocraft_v3:/home/jhuang/Tools/novocraft_v3 -v /usr/local/bin/gatk:/usr/local/bin/gatk   own_viral_ngs bash
    cd /work
    snakemake --directory /work --printshellcmds --cores 40
    
    # -- ! Secondly manully run assembly steps
    # --> By itereative add the unfinished assembly in the list, each time replace one, and run "snakemake --directory /work --printshellcmds --cores 40"
    
        # # ---- NOTE that the following steps need rerun --> DOES NOT WORK, USE STRATEGY ABOVE ----
        # #for sample in p10_K22 p10_K7523; do
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/read_utils.py merge_bams data/01_cleaned/${sample}.cleaned.bam tmp/01_cleaned/${sample}.cleaned.bam --picardOptions SORT_ORDER=queryname
        #     bin/read_utils.py rmdup_mvicuna_bam tmp/01_cleaned/${sample}.cleaned.bam data/01_per_sample/${sample}.cleaned.bam --JVMmemory 30g
        # done
        #
        # #Note that the error generated by nextflow is from the step gapfill_gap2seq!
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/assembly.py assemble_spades data/01_per_sample/${sample}.taxfilt.bam /home/jhuang/REFs/viral_ngs_dbs/trim_clip/contaminants.fasta tmp/02_assembly/${sample}.assembly1-spades.fasta --nReads 10000000 --threads 15 --memLimitGb 12
        # done
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        # for sample in p10_K22 p10_K7523; do
        #     bin/assembly.py order_and_orient tmp/02_assembly/${sample}.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/${sample}.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/${sample}.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta --threads 15
        # done
        #
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/assembly.py gapfill_gap2seq tmp/02_assembly/${sample}.assembly2-scaffolded.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
        # done
    
    #IMPORTANT: Reun the following commands!
    for sample in hCoV229E_Rluc  p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        bin/assembly.py impute_from_reference tmp/02_assembly/${sample}.assembly2-gapfilled.fasta tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta tmp/02_assembly/${sample}.assembly3-modify.fasta --newName ${sample} --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index  --loglevel DEBUG
    done
    
        # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
        #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly3-modify.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly4-refined.fasta --outVcf tmp/02_assembly/${sample}.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 15  --loglevel DEBUG
        #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly4-refined.fasta data/01_per_sample/${sample}.cleaned.bam data/02_assembly/${sample}.fasta --outVcf tmp/02_assembly/${sample}.assembly4.vcf.gz --min_coverage 3 --novo_params '-r Random -l 20 -g 40 -x 20 -t 100' --threads 15  --loglevel DEBUG
        # done
    
    # -- ! Thirdly set the samples-assembly.txt completely and run "snakemake --directory /work --printshellcmds --cores 40"
    
    # ---------------------------- BUG list of the docker pipeline, mostly are due to the version incompability ----------------------------
    #BUG_1: FileNotFoundError: [Errno 2] No such file or directory: '/home/jhuang/Tools/samtools-1.9/samtools': '/home/jhuang/Tools/samtools-1.9/samtools'
    #DEBUG_1 (DEPRECATED):
            # - In docker install independent samtools
            conda create -n samtools-1.9-env samtools=1.9 -c bioconda -c conda-forge
            # - persistence the modified docker, next time run own docker image
            docker ps
            #CONTAINER ID   IMAGE                              COMMAND   CREATED         STATUS         PORTS     NAMES
            #881a1ad6a990   quay.io/broadinstitute/viral-ngs   "bash"    8 minutes ago   Up 8 minutes             intelligent_yalow
            docker commit 881a1ad6a990 own_viral_ngs
            docker image ls
            docker run -it own_viral_ngs bash
            #Change the path as "/opt/miniconda/envs/samtools-1.9-env/bin/samtools" in /work/bin/tools/samtools.py
            #         If another tool expect for samtools could not be installed, also use the same method above to install it on own_viral_ngs!
    #DEBUG_1_BETTER_SIMPLE: TOOL_VERSION = '1.6' --> '1.9' in ~/Tools/viral-ngs_docker/bin/tools/samtools.py
    
    #BUG_2:
            bin/taxon_filter.py deplete data/00_raw/2040_04.bam tmp/01_cleaned/2040_04.raw.bam tmp/01_cleaned/2040_04.bmtagger_depleted.bam tmp/01_cleaned/2040_04.rmdup.bam data/01_cleaned/2040_04.cleaned.bam --bmtaggerDbs /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA --blastDbs /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus --threads 15 --srprismMemory 14250 --JVMmemory 50g --loglevel DEBUG
            #2025-05-23 09:58:45,326 - __init__:445:_attempt_install - DEBUG - Currently installed version of blast: 2.7.1-h4422958_6
            #2025-05-23 09:58:45,327 - __init__:448:_attempt_install - DEBUG - Expected version of blast:            2.6.0
            #2025-05-23 09:58:45,327 - __init__:449:_attempt_install - DEBUG - Incorrect version of blast installed. Removing it...
    #DEBUG_2: TOOL_VERSION = "2.6.0" --> "2.7.1" in ~/Tools/viral-ngs_docker/bin/tools/blast.py
    
    #BUG_3:
            bin/read_utils.py bwamem_idxstats data/01_cleaned/1762_04.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/1762_04.spike_count.txt --minScoreToFilter 60 --loglevel DEBUG
    #DEBUG_3: TOOL_VERSION = "0.7.15" --> "0.7.17" in ~/Tools/viral-ngs_docker/bin/tools/bwa.py
    
    #BUG_4: FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/bin/trimmomatic': '/usr/local/bin/trimmomatic'
    #DEBUG_4: TOOL_VERSION = "0.36" --> "0.38" in ~/Tools/viral-ngs_docker/bin/tools/trimmomatic.py
    
    #BUG_5: FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/spades.py': '/usr/bin/spades.py'
    #DEBUG_5:  TOOL_VERSION = "0.36" --> "0.38" in ~/Tools/viral-ngs_docker/bin/tools/trimmomatic.py
    #                def install_and_get_path(self):
    #                        # the conda version wraps the jar file with a shell script
    #                        return 'trimmomatic'
    
    #BUG_6: bin/assembly.py order_and_orient tmp/02_assembly/2039_04.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/2039_04.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/2039_04.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta --threads 15 --loglevel DEBUG
    2025-05-23 17:40:19,526 - __init__:445:_attempt_install - DEBUG - Currently installed version of mummer4: 4.0.0beta2-pl526hf484d3e_4
    2025-05-23 17:40:19,527 - __init__:448:_attempt_install - DEBUG - Expected version of mummer4:            4.0.0rc1
    2025-05-23 17:40:19,527 - __init__:449:_attempt_install - DEBUG - Incorrect version of mummer4 installed. Removing it..
    DEBUG_6:  TOOL_VERSION = "4.0.0rc1" --> "4.0.0beta2" in ~/Tools/viral-ngs_docker/bin/tools/mummer.py
    
    #BUG_7: bin/assembly.py order_and_orient tmp/02_assembly/2039_04.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/2039_04.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/2039_04.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta --threads 15 --loglevel DEBUG
            File "bin/assembly.py", line 549, in 
    base_counts = [sum([len(seg.seq.replace(“N”, “”)) for seg in scaffold]) \ AttributeError: ‘Seq’ object has no attribute ‘replace’ DEBUG_7: base_counts = [sum([len(seg.seq.replace(“N”, “”)) for seg in scaffold]) –> base_counts = [sum([len(seg.seq.ungap(‘N’)) for seg in scaffold]) in ~/Tools/viral-ngs_docker/bin/assembly.py BUG_8: bin/assembly.py refine_assembly tmp/02_assembly/1243_2.assembly3-modify.fasta data/01_per_sample/1243_2.cleaned.bam tmp/02_assembly/1243_2.assembly4-refined.fasta –outVcf tmp/02_assembly/1243_2.assembly3.vcf.gz –min_coverage 2 –novo_params ‘-r Random -l 20 -g 40 -x 20 -t 502’ –threads 15 –loglevel DEBUG File “/work/bin/tools/gatk.py”, line 75, in execute FileNotFoundError: [Errno 2] No such file or directory: ‘/usr/local/bin/gatk’: ‘/usr/local/bin/gatk’ #DEBUG_8: -v /usr/local/bin/gatk:/usr/local/bin/gatk in ‘docker run’ and change default python in the script via a shebang; TOOL_VERSION = “3.8” –> “3.6” in ~/Tools/viral-ngs_docker/bin/tools/gatk.py BUG_9: pyyaml is missing! #DEBUG_9: NO_ERROR if rerun! bin/assembly.py impute_from_reference tmp/02_assembly/2039_04.assembly2-gapfilled.fasta tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta tmp/02_assembly/2039_04.assembly3-modify.fasta –newName 2039_04 –replaceLength 55 –minLengthFraction 0.05 –minUnambig 0.05 –index –loglevel DEBUG for sample in 2039_04 2040_04; do for sample in 1762_04 1243_2 875_04; do bin/assembly.py impute_from_reference tmp/02_assembly/${sample}.assembly2-gapfilled.fasta tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta tmp/02_assembly/${sample}.assembly3-modify.fasta –newName ${sample} –replaceLength 55 –minLengthFraction 0.05 –minUnambig 0.05 –index –loglevel DEBUG done #BUG_10: bin/reports.py consolidate_fastqc reports/fastqc/2039_04/align_to_self reports/fastqc/2040_04/align_to_self reports/fastqc/1762_04/align_to_self reports/fastqc/1243_2/align_to_self reports/fastqc/875_04/align_to_self reports/summary.fastqc.align_to_self.txt #DEBUG_10: File “bin/intrahost.py”, line 527 and line 579 in merge_to_vcf # #MODIFIED_BACK samp_to_seqIndex[sampleName] = seq.seq.ungap(‘-‘) #samp_to_seqIndex[sampleName] = seq.seq.replace(“-“, “”) #BUG_11: bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/2039_04.fasta data/02_assembly/2040_04.fasta data/02_assembly/1762_04.fasta data/02_assembly/1243_2.fasta data/02_assembly/875_04.fasta data/03_multialign_to_ref –ep 0.123 –maxiters 1000 –preservecase –localpair –outFilePrefix aligned –sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt –threads 15 –loglevel DEBUG 2025-05-26 15:04:19,014 – cmd:195:main_argparse – INFO – command: bin/interhost.py multichr_mafft inFastas=[‘ref_genome/reference.fasta’, ‘data/02_assembly/2039_04.fasta’, ‘data/02_assembly/2040_04.fasta’, ‘data/02_assembly/1762_04.fasta’, ‘data/02_assembly/1243_2.fasta’, ‘data/02_assembly/875_04.fasta’] localpair=True globalpair=None preservecase=True reorder=None gapOpeningPenalty=1.53 ep=0.123 verbose=False outputAsClustal=None maxiters=1000 outDirectory=data/03_multialign_to_ref outFilePrefix=aligned sampleRelationFile=None sampleNameListFile=data/03_multialign_to_ref/sampleNameList.txt threads=15 loglevel=DEBUG tmp_dir=/tmp tmp_dirKeep=False 2025-05-26 15:04:19,014 – cmd:209:main_argparse – DEBUG – using tempDir: /tmp/tmp-interhost-multichr_mafft-nuws9mhp 2025-05-26 15:04:21,085 – __init__:445:_attempt_install – DEBUG – Currently installed version of mafft: 7.402-0 2025-05-26 15:04:21,085 – __init__:448:_attempt_install – DEBUG – Expected version of mafft: 7.221 2025-05-26 15:04:21,085 – __init__:449:_attempt_install – DEBUG – Incorrect version of mafft installed. Removing it… #DEBUG_11: TOOL_VERSION = “7.221” –> “7.402” in ~/Tools/viral-ngs_docker/bin/tools/mafft.py #BUG_12: bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz PP810610.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de –loglevel DEBUG 2025-06-10 13:14:07,526 – __init__:445:_attempt_install – DEBUG – Currently installed version of snpeff: 4.3.1t-3 2025-06-10 13:14:07,527 – __init__:448:_attempt_install – DEBUG – Expected version of snpeff: 4.1l #DEBUG_12: -v /usr/local/bin/gatk:/usr/local/bin/gatk in ‘docker run’ and change default python in the script via a shebang; TOOL_VERSION = “4.1l” –> “4.3.1t” in ~/Tools/viral-ngs_docker/bin/tools/snpeff.py
  7. Comparing intra- and inter-host variants, comparing the variants to the alignments of the assemblies to confirm its correctness.

    From the step 5, only 5 inter-host variants were confirmed: they are 10871, 19289, 23435.
    
    PP810610    10871   hCoV229E_Rluc   hCoV229E_Rluc       C,T 0.0057070386810399495   0.011348936781066188    1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
    PP810610    10871   p10_DMSO    p10_DMSO        C,T 0.0577716643741403  0.10886819833916395 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
    PP810610    10871   p10_K22 p10_K22     C,T 1.0 0.0 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
    PP810610    10871   p10_K7523   p10_K7523       C,T 0.8228321896444167  0.2915587546587828  1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
    PP810610    10871   p16_DMSO    p16_DMSO        C,T 0.02927088877062267 0.05682820768240093 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
    PP810610    10871   p16_K22 p16_K22     C,T 0.9911209766925638  0.017600372505084394    1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
    PP810610    10871   p16_X7523   p16_X7523       C,T 0.8776699029126214  0.21473088886794223 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
    
    PP810610    19289   hCoV229E_Rluc   hCoV229E_Rluc       G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
    PP810610    19289   p10_DMSO    p10_DMSO        G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
    PP810610    19289   p10_K22 p10_K22     G,T 1.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
    PP810610    19289   p10_K7523   p10_K7523       G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
    PP810610    19289   p16_DMSO    p16_DMSO        G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
    PP810610    19289   p16_K22 p16_K22     G,T 0.9884823848238482  0.02276991943361173 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
    PP810610    19289   p16_X7523   p16_X7523       G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
    
    PP810610    23435   hCoV229E_Rluc   hCoV229E_Rluc       C,T 0.0 0.0 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
    PP810610    23435   p10_DMSO    p10_DMSO        C,T 0.031912415560214305    0.061788026586653055    1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
    PP810610    23435   p10_K22 p10_K22     C,T 1.0 0.0 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
    PP810610    23435   p10_K7523   p10_K7523       C,T 0.8352090032154341  0.27526984832663026 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
    PP810610    23435   p16_DMSO    p16_DMSO        C,T 0.0 0.0 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
    PP810610    23435   p16_K22 p16_K22     C,T 0.958498023715415   0.07955912449811753 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
    PP810610    23435   p16_X7523   p16_X7523       C,T 0.13175164058556285 0.22878629157715102 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
  8. Generate variant_annot.xls and coverages.xls

    sudo chown -R jhuang:jhuang data
    # -- generate isnvs_annot_complete__.txt, isnvs_annot_0.05.txt from ~/DATA/Data_Pietschmann_RSV_Probe3/data/04_intrahost
    cp isnvs.annot.txt isnvs.annot_complete.txt
    ~/Tools/csv2xls-0.4/csv_to_xls.py isnvs.annot_complete.txt -d$'\t' -o isnvs.annot_complete.xls
    #delete the columns patient, time, Hw and Hs and the header in the xls and save as txt file.
    
    awk '{printf "%.3f\n", $5}' isnvs.annot_complete.csv > f5
    cut -f1-4 isnvs.annot_complete.csv > f1_4
    cut -f6- isnvs.annot_complete.csv > f6_
    paste f1_4 f5 > f1_5
    paste f1_5 f6_ > isnvs_annot_complete_.txt
    #correct f5 in header of isnvs_annot_complete_.txt to iSNV_freq
    #header: chr    pos sample  alleles iSNV_freq   eff_type    eff_codon_dna   eff_aa  eff_aa_pos  eff_prot_len    eff_gene    eff_protein
    ~/Tools/csv2xls-0.4/csv_to_xls.py isnvs_annot_complete_.txt -d$'\t' -o variant_annot.xls
    
    #MANUALLY generate variant_annot_0.01.csv variant_annot_0.05.csv
    awk ' $5 >= 0.05 ' isnvs_annot_complete_.txt > 0.05.csv
    cut -f2 0.05.csv
    
    awk ' $5 >= 0.01 ' isnvs_annot_complete_.txt > 0.01.csv
    cut -f2 0.05.csv | uniq > ids_0.05
    cut -f2 0.01.csv | uniq > ids_0.01
    
    #Replace '\n' with '\\t" isnvs_annot_complete_.txt >> isnvs_annot_0.05.txt\ngrep -P "PP810610\\t' in ids_0.05 and then deleting the 'pos' line
    #Replace '\n' with '\\t" isnvs_annot_complete_.txt >> isnvs_annot_0.01.txt\ngrep -P "PP810610\\t'  in ids_0.01 and then deleting the 'pos' line
    #Run ids_0.05 and ids_0.01
    
    cp ../../Outputs/Master_vcf/All_SNPs_indels_annotated.txt ../../Outputs/Master_vcf/All_SNPs_indels_annotated.txt hCoV229E_Rluc_variants
    # Delete the three records which already reported in intra-host results hCoV229E_Rluc_variants: they are 10871, 19289, 23435.
    PP810610       10871   C       T       SNP     C       C/T     T       C/T     C/T     T       C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu3552Phe/c.10654C>T 6757    CDS_1   protein_coding
    PP810610       19289   G       T       SNP     G       G       T       G       G       G/T     G       missense_variant        MODERATE        MISSENSE        Gtt/Ttt p.Val6358Phe/c.19072G>T 6757    CDS_1   protein_coding
    PP810610       23435   C       T       SNP     C       C       T       C/T     C       C/T     C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu981Phe/c.2941C>T   1173    CDS_2   protein_coding
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py isnvs_annot_0.05.txt isnvs_annot_0.01.txt hCoV229E_Rluc_variants -d$'\t' -o variant_annot.xls
    #Modify sheetname to variant_annot_0.05 and variant_annot_0.01 and add the header in Excel file.
    #Note in the complete list, Set 2024 is NOT a subset of Set 2025 because the element 26283 is in set 2024 but missing from set 2025.
    
    # -- calculate the coverage
    samtools depth ./data/02_align_to_self/hCoV229E_Rluc.mapped.bam > hCoV229E_Rluc_cov.txt
    samtools depth ./data/02_align_to_self/p10_DMSO.mapped.bam > p10_DMSO_cov.txt
    samtools depth ./data/02_align_to_self/p10_K22.mapped.bam > p10_K22_cov.txt
    samtools depth ./data/02_align_to_self/p10_K7523.mapped.bam > p10_K7523_cov.txt
    ~/Tools/csv2xls-0.4/csv_to_xls.py hCoV229E_Rluc_cov.txt p10_DMSO_cov.txt p10_K22_cov.txt p10_K7523_cov.txt -d$'\t' -o coverages.xls
    #draw coverage and see if they are continuous?
    samtools depth ./data/02_align_to_self/p16_DMSO.mapped.bam > p16_DMSO_cov.txt
    samtools depth ./data/02_align_to_self/p16_K22.mapped.bam > p16_K22_cov.txt
    samtools depth ./data/02_align_to_self/p16_X7523.mapped.bam > p16_K7523_cov.txt
    ~/Tools/csv2xls-0.4/csv_to_xls.py p16_DMSO_cov.txt p16_K22_cov.txt p16_K7523_cov.txt -d$'\t' -o coverages_p16.xls
    
            # Load required packages
            library(ggplot2)
            library(dplyr)
    
            # Read the coverage data
            cov_data <- read.table("p16_K7523_cov.txt", header = FALSE, sep = "\t",
                            col.names = c("Chromosome", "Position", "Coverage"))
    
            # Create full position range for the given chromosome
            full_range <- data.frame(Position = seq(min(cov_data$Position), max(cov_data$Position)))
    
            # Merge with actual coverage data and fill missing positions with 0
            cov_full <- full_range %>%
            left_join(cov_data[, c("Position", "Coverage")], by = "Position") %>%
            mutate(Coverage = ifelse(is.na(Coverage), 0, Coverage))
    
            # Save the plot to PNG
            png("p16_K7523_coverage_filled.png", width = 1200, height = 600)
    
            ggplot(cov_full, aes(x = Position, y = Coverage)) +
            geom_line(color = "steelblue", size = 0.3) +
            labs(title = "Coverage Plot for p16_K7523 (Missing = 0)",
            x = "Genomic Position",
            y = "Coverage Depth") +
            theme_minimal() +
            theme(
            plot.title = element_text(hjust = 0.5),
            axis.text = element_text(size = 10),
            axis.title = element_text(size = 12)
            )
    
            dev.off()
  9. (Optional) Consensus sequences of each and of all isolates

    cat PP810610.1.fa OZ035258.1.fa MZ712010.1.fa OK662398.1.fa OK625404.1.fa KF293664.1.fa NC_002645.1.fa > all.fa
    cp data/02_assembly/*.fasta ./
    for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523; do \
    mv ${sample}.fasta ${sample}.fa
    cat all.fa ${sample}.fa >> all.fa
    done
    
    cat RSV_dedup.fa all.fa > RSV_all.fa
    mafft --clustalout --adjustdirection RSV_all.fa > RSV_all.aln
    snp-sites RSV_all.aln -o RSV_all_.aln
  10. Report

    Please find attached the variant analysis results for Thomas. Variant frequencies in the new samples are highlighted in yellow.
    
    Although PP810610 is used as the reference, only differences observed in the samples p10_DMSO, p10_K22, p10_K7523, p16_DMSO, p16_K22, and p16_X7523 compared to hCoV229E_Rluc are reported in the sheets variant_annot_0.05 and variant_annot_0.01 (see variant_annot.xls). Variants already present in hCoV229E_Rluc are excluded from these sheets. In total, 17 mutations were found in hCoV229E_Rluc relative to PP810610, detailed in the sheet “hCoV229E_Rluc_variants” (see variant_annot.xls).
    
    ------ Explanation of iSNV_freq in the sheets variant_annot_0.05 and variant_annot_0.01 ------
    
    The iSNV_freq column shows the frequency of the second allele at each position. For example, at position 23435 on chr PP810610:
    
    chr               Position    Sample            Alleles    iSNV_freq
    PP810610    23435    hCoV229E_Rluc    C,T        0
    PP810610    23435    p10_DMSO           C,T       0.032
    PP810610    23435    p10_K22              C,T       0.995
    PP810610    23435    p10_K7523          C,T       0.835
    PP810610    23435    p16_DMSO          C,T        0
    PP810610    23435    p16_K22              C,T       0.958
    PP810610    23435    p16_X7523          C,T       0.132
    
    The second allele (T) frequencies are:
    0 (only C)
    0.032 (3.2% T)
    0.995 (99.5% T)
    0.835 (83.5% T)
    0 (only C)
    0.958 (95.8% T)
    0.132 (13.2% T)
    
    # --
    
    Regarding the mutation at position 19289 — you're absolutely right, and I had also noticed the discrepancy.
    
    In the 2024 analysis, I performed intra-host variant calling, which detects only those variants with frequencies strictly between 0% and 100% within a single sample. Since position 19289 showed 100% G in p10_DMSO, 100% T in p10_K22, and 100% G in p10_K7523, it was not identified as an intra-host variant at that time. Rather, it's a clear example of an inter-host variant — a fixed difference between samples.
    
    In the 2025 analysis, I again used intra-host variant calling. This time, the mutation at position 19289 in p16_K22 was detected at 98.8% T, which falls within the threshold and therefore appears in the intra-host variant table.
    
    After noticing this, I also ran a dedicated inter-host variant calling analysis, which specifically highlights differences between samples rather than within them. The results can be found in the third table ("hCoV229E_Rluc_variants") of the variant_annot.xls file I sent you previously. As you’ll see, all 17 positions are identical across the 7 samples, indicating that no additional inter-host variants were detected beyond what we had already observed.
    
    Lastly, please find the coverage data in the attached files.
    
    # --
    
    Just following up on the mutation at position 19289. By tweaking some settings in the inter-host variant calling, we can also detect variants at positions like 19289. However, in these results, a “/” indicates intra-host variants that require further validation through intra-host variant calling. The intra-host variant calling uses a more precise mapping strategy, enabling a more accurate estimation of allele frequencies.
    
    Here’s an example from the inter-host variant table showing the mutation at 19289 with the adjusted settings:
    
            CHROM       POS      REF    ALT    TYPE    hCoV229E_Rluc    p10_DMSO    p10_K22    p10_K7523    p16_DMSO    p16_K22    p16_X7523
            PP810610    19289    G      T      SNP          G               G           T           G          G           G/T          G