Author Archives: gene_x

Which assemblies of S.epidermidis_1585_5179_HD05 have been submitted to NCBI?

TODO: after annotation check if the app-gene has been truncated in 5179-R1. The 5179 generates 180 kDa and 220 kDa (isoforms) while 5179-R1 generated only a 140 kDa protein.

# See the paper: Induction of Staphylococcus epidermidis biofilm formation via proteolytic processing of the accumulation-associated protein by staphylococcal and host proteases

#------------ 5179 ------------
# -- from unicycler (5179_bold) --
    total         16       4   2,524,342   2,469,173         2,469,173
        1          1       1   2,469,173   2,469,173         2,469,173     complete
        2          1       1      17,749      17,749            17,749     complete
        3          1       0       4,761       4,761             4,761   incomplete
        4          1       1       4,595       4,595             4,595     complete -->
        5          1       0       3,735       3,735             3,735   incomplete
        6          1       0       3,718       3,718             3,718   incomplete
        7          1       0       3,573       3,573             3,573   incomplete
        8          1       1       2,449       2,449             2,449     complete -->
        9          1       0       2,411       2,411             2,411   incomplete
       10          1       0       2,371       2,371             2,371   incomplete
       11          1       0       2,365       2,365             2,365   incomplete
       12          1       0       1,637       1,637             1,637   incomplete
       13          1       0       1,568       1,568             1,568   incomplete
       14          1       0       1,505       1,505             1,505   incomplete
       15          1       0       1,403       1,403             1,403   incomplete
       16          1       0       1,329       1,329             1,329   incomplete

      1   2,469,173    1.00x   UniRef90_Q5HJZ9       1,901,872   reverse     100.0%     100.0%
      2      17,749    2.27x   UniRef90_A0A0H2VIR3       4,771   forward      93.2%      99.7%
      4       4,595   10.19x   none found (Submitted!)
      8       2,449   17.14x   none found (Submitted!)

# -- from trycycler (Submitted after adding the contig 4 and 8 since longer!) --
cluster_001_consensus   2471445 23      70      71  (Submitted!)
cluster_002_consensus   17749   23      70      71  (Submitted!)

#------------ 5179-R1 ------------
# -- from unicycler (5179R1_normal) --
Component   Segments   Links   Length      N50         Longest segment   Status
    total          2       2   2,486,311   2,468,563         2,468,563
        1          1       1   2,468,563   2,468,563         2,468,563   complete
        2          1       1      17,748      17,748            17,748   complete

# -- from trycycler --
cluster_001_consensus   2470004 23      70      71 (Submitted!)
cluster_002_consensus   17748   23      70      71 (Submitted!)
luster_003 with only one contig
luster_004 with only one contig
luster_005 with only one contig

#makeblastdb -in 5179_trycycler_chr.fasta -dbtype nucl
#blastn -db 5179_trycycler_chr.fasta -query 2-16.fasta -out 2-16_vs_trycycler_1.blastn -evalue 0.00000000001 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1

#------------ 1585 ------------

# 1585_normal/unicycler/chrom_plasmids.fasta
# -- from unicycler (1585_normal) --
Segment   Length      Depth    Starting gene     Position    Strand    Identity   Coverage
      1   2,443,574    1.00x   UniRef90_Q5HJZ9   1,085,369   forward      99.3%     100.0%
      2       9,014    3.72x   none found
      7       2,344   11.44x   none found
      8       2,255    9.81x   none found
# after correction using polypolish and polca (see ~/DATA/Data_Holger_S.epidermidis_1585_5179_HD05/1585_normal/unicycler)
1       2443600 3       70      71 (Submitted!)
2       9014    2478515 70      71 (Submitted!)
7       2344    2487661 70      71 (Submitted!)
8       2255    2490042 70      71 (Submitted!)

# -- from trycycler --
cluster_001_consensus   2443176 23      70      71
cluster_002_consensus   9014    23      70      71
cluster_003_consensus   2255    23      70      71
cluster_004_consensus   2344    23      70      71

RNA-seq data analysis on S. epidermidis 1585

  1. prepare input reads and samplesheet

    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_A/EX_15_min_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_A/EX_15_min_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_B/EX_15_min_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_B/EX_15_min_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_C/EX_15_min_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_C/EX_15_min_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_A/EX_1h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_A/EX_1h_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_B/EX_1h_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_B/EX_1h_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_C/EX_1h_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_C/EX_1h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_A/EX_2h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_A/EX_2h_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_B/EX_2h_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_B/EX_2h_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_C/EX_2h_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_C/EX_2h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_A/EX_4h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_A/EX_4h_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_B/EX_4h_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_B/EX_4h_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_C/EX_4h_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_C/EX_4h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_6h_A/EX_6h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_6h_A/EX_6h_A_2.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_B/EX_6h_B_1.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_B/EX_6h_B_2.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_C/EX_6h_C_1.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_C/EX_6h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_A/EX_Day_4_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_A/EX_Day_4_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_B/EX_Day_4_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_B/EX_Day_4_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_C/EX_Day_4_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_C/EX_Day_4_C_2.fq.gz .
    
    sample,fastq_1,fastq_2,strandedness
    EX_15_min_A,EX_15_min_A_1.fq.gz,EX_15_min_A_2.fq.gz,auto
    EX_15_min_B,EX_15_min_B_1.fq.gz,EX_15_min_B_2.fq.gz,auto
    EX_15_min_C,EX_15_min_C_1.fq.gz,EX_15_min_C_2.fq.gz,auto
  2. Notes and Debugs

    #-- NOTE1: STAR cannot work simoutenously twice at the same time! --
    
    #-- NOTE2 --
    The GTF file might be corrupted!
      Stop at line : NC_019234.1    RefSeq  exon    1600    1977    .   +   0   ID "cds-HXG45_RS00010"; Parent "gene-HXG45_RS00010"; Note "incomplete%3B partial in the middle of a contig%3B missing N-terminus and C-terminus"; Ontology_term "GO:0015074"; end_range "1977,."; gbkey "CDS"; go_process "DNA integration|0015074||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010972061.1"; locus_tag "HXG45_RS00010"; partial "true"; product "DDE-type integrase/transposase/recombinase"; pseudo "true"; start_range ".,1600"; transl_table "11"
      Error Message: Cannot find gene_id!
    
    #cut -d$'\t' -f2 RP62A.gtf | sort -u
    cmsearch
    GeneMarkS-2+
    Protein Homology --> ProteinHomology
    RefSeq
    tRNAscan-SE
    
    #cut -d$'\t' -f3 RP62A.gtf | sort -u
    gene
    pseudogene
    
    CDS
    rRNA
    tRNA
    
    direct_repeat
    exon
    ncRNA
    binding_site
    region
    riboswitch
    RNase_P_RNA
    
    sequence_feature
    SRP_RNA
    tmRNA
    
    #-- BUG1: mamba update picard --
    #2.18.27-SNAPSHOT --> bioconda::picard=3.0.0
    conda install -c conda-forge mamba
    mamba update picard
      - ca-certificates   2023.7.22  hbcca054_0     conda-forge                    
      + ca-certificates  2023.11.17  hbcca054_0     conda-forge/linux-64     Cached
      - nextflow            20.04.1  hecda079_1     bioconda                       
      + nextflow            23.10.0  hdfd78af_0     bioconda/noarch          Cached
      - openjdk              11.0.1  h516909a_1016  conda-forge                    
      + openjdk              17.0.3  h58dac75_5     conda-forge/linux-64     Cached
      - picard                3.0.0  hdfd78af_0     bioconda                       
      + picard                3.1.1  hdfd78af_0     bioconda/noarch          Cached
    mamba update rsem
    
    #-- BUG2: mamba update gffread --
    cd /mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/work/b4/703a2595f234c7866e49d9a033f943
    gffread \
          RP62A.gff \
          --keep-exon-attrs -F -T \
          -o RP62A.gtf
      - gffread   0.9.12  0           bioconda                    
      + gffread   0.12.7  hdcf5f25_3  bioconda/linux-64     Cached
    
    #-- NOTE: If you use nf-core/rnaseq for your analysis please cite: --
    #* The pipeline
    #  https://doi.org/10.5281/zenodo.1400710
    #* The nf-core framework
    #  https://doi.org/10.1038/s41587-020-0439-x
    #* Software dependencies
    #  https://github.com/nf-core/rnaseq/blob/master/CITATIONS.md
    
    #-- BUG3: under rnaseq_2021 open R --
    #using R host rather than conda R
    mamba remove r-base
    #-I"/home/jhuang/miniconda3/envs/rnaseq_2021/lib/R/include"
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("SummarizedExperiment")
    BiocManager::install("tximport")
    #Error in citation("tximeta") : there is no package called ‘tximeta’
    #  Execution halted
    
    #sudo apt-get update
    #sudo apt-get install libcurl4-openssl-dev
    
    #BiocManager::install(version = "3.16")
    BiocManager::install(c("AnnotationDbi", "httr", "BiocFileCache", "biomaRt", "GenomicFeatures", "ensembldb", "curl", "AnnotationHub"))
    BiocManager::install(c("biomaRt", "GenomicFeatures", "ensembldb", "AnnotationHub", "tximeta"))
    BiocManager::install(c("tximeta"))
    
    #conda update -n base -c defaults conda
    #conda update --all
    
    #!/usr/bin/env Rscript
    /mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/rnaseq/bin/salmon_tximport.r \
          NULL \
          salmon \
          salmon.merged
  3. prepare gtf file

    #  --gtf_extra_attributes        [string]  By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when 
                                              running Salmon. [default: gene_name] 
    #  --gtf_group_features          [string]  Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id]
    #  --featurecounts_group_type    [string]  The attribute type used to group feature types in the GTF file when generating the biotype plot with 
                                              featureCounts. [default: gene_biotype] 
    #  --featurecounts_feature_type  [string]  By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]
    #--gtf_extra_attributes gene 
    #--gtf_group_features Parent --featurecounts_group_type gene_biotype --featurecounts_feature_type CDS (outdated since [CHANGE1,2,3]) 
    
    #under hamm
    conda activate rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    
    # ------------ gff_to_gtf.py, then modify *.gtf file as follows -------------
    #Upload the scripts gff_to_gtf.py and modify_gtf.py.
    python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
    
    #[CHANGE1] ID in gene and Parent in CDS --> gene_id; ID in exon --> transcript_id; Name in gene --> gene_name
    #[CHANGE2] "Protein Homology" --> "RefSeq"
    #[CHANGE3] CDS --> exon
    #[CHANGE4?] --gtf_extra_attributes ||||gene|||| refers to ID=gene-SERP_RS00220;Name=noc;gbkey=Gene;||||gene=noc||||; "gene=" exists in both gene and exon. choose gene as gtf_extra_attributes! (see --gtf_extra_attributes gene in the next command)
    
    [1] python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
    [2] python3 modify_gtf.py RP62A.gtf RP62A_m.gtf
    [3] sed -i -e 's/gene_id "rna-/gene_id "gene-/g' RP62A_m.gtf
    #NOTE that "ProteinHomology exon" are from "ProteinHomology CDS" 
    [4] MANUALLY checking why two are missing, because the gene has two CDS/exons.
    #(rnaseq) jhuang@hamm:/mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results/genome$ grep ">cds" results/genome/genome.transcripts.fa  | sed 's/>//g' - | sort > CDS_2.txt
    #2415
    #(rnaseq) jhuang@hamm:/mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results/genome$ grep ">rna" genome.transcripts.fa  | wc -l
    #84
    #cut -d$'\t' -f9 CDS_1.txt | cut -d';' -f1 | sed 's/ID=//g' - | sort > CDS_1_.txt
    #2417
    #2268d2267
    #< cds-WP_010959142.1
    #2401d2399
    #< cds-WP_161384733.1
    
    NC_002976.3     RefSeq  gene    423814  424930  .       +       .       gene_id "gene-SERP_RS02220"; gene_name "prfB"; gbkey "Gene"; gene "prfB"; gene_biotype "protein_coding"; locus_tag "SERP_RS02220"; old_locus_tag "SE0421%2CSERP0421"
    NC_002976.3     ProteinHomology exon    423814  423885  .       +       0       transcript_id "cds-WP_010959142.1"; gene_id "gene-SERP_RS02220"; Dbxref "GenBank:WP_010959142.1"; Name "WP_010959142.1"; Note "programmed frameshift"; Ontology_term "GO:0006415,GO:0003747"; exception "ribosomal slippage"; gbkey "CDS"; gene "prfB"; go_function "translation release factor activity|0003747||IEA"; go_process "translational termination|0006415||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010959142.1"; locus_tag "SERP_RS02220"; product "peptide chain release factor 2"; protein_id "WP_010959142.1"; transl_table "11"
    NC_002976.3     ProteinHomology exon    423887  424930  .       +       0       transcript_id "cds-WP_010959142.1"; gene_id "gene-SERP_RS02220"; Dbxref "GenBank:WP_010959142.1"; Name "WP_010959142.1"; Note "programmed frameshift"; Ontology_term "GO:0006415,GO:0003747"; exception "ribosomal slippage"; gbkey "CDS"; gene "prfB"; go_function "translation release factor activity|0003747||IEA"; go_process "translational termination|0006415||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010959142.1"; locus_tag "SERP_RS02220"; product "peptide chain release factor 2"; protein_id "WP_010959142.1"; transl_table "11"
    NC_002976.3     RefSeq  gene    2256196 2256869 .       +       .       gene_id "gene-SERP_RS10980"; gene_name "SERP_RS10980"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "SERP_RS10980"; old_locus_tag "SERP2219"
    NC_002976.3     ProteinHomology exon    2256196 2256385 .       +       0       transcript_id "cds-WP_161384733.1"; gene_id "gene-SERP_RS10980"; Dbxref "GenBank:WP_161384733.1"; Name "WP_161384733.1"; Note "programmed frameshift"; Ontology_term "GO:0004803"; exception "ribosomal slippage"; gbkey "CDS"; go_function "transposase activity|0004803||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_064305258.1"; locus_tag "SERP_RS10980"; product "IS6-like element IS257 family transposase"; protein_id "WP_161384733.1"; transl_table "11"
    NC_002976.3     ProteinHomology exon    2256385 2256869 .       +       2       transcript_id "cds-WP_161384733.1"; gene_id "gene-SERP_RS10980"; Dbxref "GenBank:WP_161384733.1"; Name "WP_161384733.1"; Note "programmed frameshift"; Ontology_term "GO:0004803"; exception "ribosomal slippage"; gbkey "CDS"; go_function "transposase activity|0004803||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_064305258.1"; locus_tag "SERP_RS10980"; product "IS6-like element IS257 family transposase"; protein_id "WP_161384733.1"; transl_table "11"
    
    --> "cds-WP_010959142.1" and "cds-WP_161384733.1" occur twice! 
    
    NC_002976.3     RefSeq  gene    423814  424930  .       +       .       gene_id "gene-SERP_RS02220"; gene_name "prfB"; gbkey "Gene"; gene "prfB"; gene_biotype "protein_coding"; locus_tag "SERP_RS02220"; old_locus_tag "SE0421%2CSERP0421"
    NC_002976.3     ProteinHomology exon    423814  424930  .       +       0       transcript_id "cds-WP_010959142.1"; gene_id "gene-SERP_RS02220"; Dbxref "GenBank:WP_010959142.1"; Name "WP_010959142.1"; Note "programmed frameshift"; Ontology_term "GO:0006415,GO:0003747"; exception "ribosomal slippage"; gbkey "CDS"; gene "prfB"; go_function "translation release factor activity|0003747||IEA"; go_process "translational termination|0006415||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010959142.1"; locus_tag "SERP_RS02220"; product "peptide chain release factor 2"; protein_id "WP_010959142.1"; transl_table "11"
    NC_002976.3     RefSeq  gene    2256196 2256869 .       +       .       gene_id "gene-SERP_RS10980"; gene_name "SERP_RS10980"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "SERP_RS10980"; old_locus_tag "SERP2219"
    NC_002976.3     ProteinHomology exon    2256196 2256869 .       +       0       transcript_id "cds-WP_161384733.1"; gene_id "gene-SERP_RS10980"; Dbxref "GenBank:WP_161384733.1"; Name "WP_161384733.1"; Note "programmed frameshift"; Ontology_term "GO:0004803"; exception "ribosomal slippage"; gbkey "CDS"; go_function "transposase activity|0004803||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_064305258.1"; locus_tag "SERP_RS10980"; product "IS6-like element IS257 family transposase"; protein_id "WP_161384733.1"; transl_table "11"
    
    --> "cds-WP_010959142.1" and "cds-WP_161384733.1" occur once! 
    
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    
    (rnaseq) nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta RP62A.fasta --gtf RP62A_m.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon             --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0  --gtf_extra_attributes gene
    #grep ">" genome.transcripts.fa  | wc -l --> 2499  (2,529= CDSs (2,445) + RNA(84))
    
    #-- Ben project: human + bacterium + plasmid --
    .gff3
    python3 gff_to_gtf.py CP009367.gff3 CP009367.gtf
    python3 modify_gtf.py CP009367.gtf CP009367_m.gtf
    sed -i -e 's/gene_id "rna-/gene_id "gene-/g' CP009367_m.gtf
    
    python3 gff_to_gtf.py NC_019234.gff3 NC_019234.gtf
    python3 modify_gtf.py NC_019234.gtf NC_019234_m.gtf
    sed -i -e 's/gene_id "rna-/gene_id "gene-/g' NC_019234_m.gtf
    
    python3 genbank_to_gtf.py 1585.gb 1585.gtf
    python3 modify_gtf.py 1585.gtf 1585_m.gtf
    #python3 correct_gtf.py
    python3 transform_gtf.py 1585_m.gtf 1585_m_.gtf
    sed -i -e 's/transcript_id \"/transcript_id \"tx-/g' 1585_m_.gtf
    sed -i -e 's/exon_id \"/exon_id \"exon-/g' 1585_m_.gtf
    #BUG: why 7052 >= expected 7014
    
    #-- MANUALLY: gene positions out of boundary --
    #Transcript cds-WP_011100764.1 is out of chromosome NC_019234.1's boundary!
    NC_019234.1     RefSeq  gene    66549   67571   .       +       .       gene_id "gene-HXG45_RS00005"; gene_name "repA"; gbkey "Gene"; gene "repA"; gene_biotype "protein_coding"; locus_tag "HXG45_RS00005"; old_locus_tag "D743_p54"
    NC_019234.1     RefSeq  exon    66549   67571   .       +       0       transcript_id "cds-WP_011100764.1"; gene_id "gene-HXG45_RS00005"; Dbxref "GenBank:WP_011100764.1"; Name "WP_011100764.1"; gbkey "CDS"; gene "repA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011100764.1"; locus_tag "HXG45_RS00005"; product "plasmid replication initiator RepA"; protein_id "WP_011100764.1"; transl_table "11"
    #-->
    NC_019234.1 RefSeq  gene    66549   66845   .   +   .   gene_id "gene-HXG45_RS00005_1"; gene_name "repA"; gbkey "Gene"; gene "repA"; gene_biotype "protein_coding"; locus_tag "HXG45_RS00005"; old_locus_tag "D743_p54"
    NC_019234.1 RefSeq  exon    66549   66845   .   +   0   transcript_id "cds-WP_011100764.1_1"; gene_id "gene-HXG45_RS00005_1"; Dbxref "GenBank:WP_011100764.1"; Name "WP_011100764.1"; gbkey "CDS"; gene "repA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011100764.1"; locus_tag "HXG45_RS00005"; product "plasmid replication initiator RepA"; protein_id "WP_011100764.1"; transl_table "11"
    NC_019234.1 RefSeq  gene    1   726 .   +   .   gene_id "gene-HXG45_RS00005_2"; gene_name "repA"; gbkey "Gene"; gene "repA"; gene_biotype "protein_coding"; locus_tag "HXG45_RS00005"; old_locus_tag "D743_p54"
    NC_019234.1 RefSeq  exon    1   726 .   +   0   transcript_id "cds-WP_011100764.1_2"; gene_id "gene-HXG45_RS00005_2"; Dbxref "GenBank:WP_011100764.1"; Name "WP_011100764.1"; gbkey "CDS"; gene "repA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011100764.1"; locus_tag "HXG45_RS00005"; product "plasmid replication initiator RepA"; protein_id "WP_011100764.1"; transl_table "11"
    
    (rnaseq) nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_NC_019234 --fasta NC_019234.fasta --gtf NC_019234_m.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon             --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0  --gtf_extra_attributes gene
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_CP009367 --fasta CP009367.fasta --gtf CP009367_m.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon             --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0  --gtf_extra_attributes gene
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38 -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_align_intermeds --save_unaligned --aligner star_salmon --pseudo_aligner salmon
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m_.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0 
    
    #TODO: look if gene-HXG45_RS00010 exist!
    
    # -- check the intermediate results --
    #vim salmon.merged.gene_counts.tsv
    gene_id gene_name       EX_15_min_A
    gene-SERP_RS00195       rpmH    298
    gene-SERP_RS00200       rnpA    90
    gene-SERP_RS00205       mnmE    1733
    
    #vim salmon.merged.transcript_counts.tsv
    tx      gene_id EX_15_min_A
    cds-WP_000240855.1      gene-SERP_RS00195       298
    cds-WP_001832522.1      gene-SERP_RS00200       90
    cds-WP_001831768.1      gene-SERP_RS00205       1733
  4. gff_to_gtf.py (code)

    import sys
    
    def gff3_to_gtf(gff3_file, gtf_file):
        with open(gff3_file, 'r') as gff3, open(gtf_file, 'w') as gtf:
            for line in gff3:
                if line.startswith('#'):
                    continue  # Skip header lines
    
                parts = line.strip().split('\t')
                if len(parts) != 9:
                    continue  # Skip invalid lines
    
                # Extract fields from GFF3
                seqname, source, feature, start, end, score, strand, frame, attributes = parts
    
                # Convert attributes to GTF format
                attr_pairs = [f.strip() for f in attributes.split(';') if f.strip() != '']
                attr_str = '; '.join([f'{key} "{value}"' for part in attr_pairs for key, value in [part.split('=')]])
    
                # Write to GTF file
                gtf_line = f'{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{attr_str};\n'
                gtf.write(gtf_line)
    
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python script.py <input.gff3> <output.gtf>")
            sys.exit(1)
    
        gff3_file = sys.argv[1]
        gtf_file = sys.argv[2]
        gff3_to_gtf(gff3_file, gtf_file)
  5. genbank_to_gtf.py (code)

    import argparse
    from Bio import SeqIO
    
    def genbank_to_gtf(genbank_file, gtf_file):
        with open(genbank_file, "r") as input_handle, open(gtf_file, "w") as output_handle:
            for record in SeqIO.parse(input_handle, "genbank"):
                for feature in record.features:
                    if feature.type == "gene" or feature.type == "CDS":
                        start = feature.location.start.position + 1
                        end = feature.location.end.position
                        strand = '+' if feature.location.strand == 1 else '-'
                        attributes = []
    
                        gene_id = None
                        if "gene_id" in feature.qualifiers:
                            gene_id = feature.qualifiers["gene_id"][0]
                        elif "gene" in feature.qualifiers:
                            gene_id = feature.qualifiers["gene"][0]
                        elif "locus_tag" in feature.qualifiers:
                            gene_id = feature.qualifiers["locus_tag"][0]
    
                        if gene_id is not None:
                            attributes.append(f'gene_id "{gene_id}"')
    
                        transcript_id = None
                        if "locus_tag" in feature.qualifiers:
                            transcript_id = feature.qualifiers["locus_tag"][0]
                            attributes.append(f'transcript_id "{transcript_id}"')
    
                        attribute_str = "; ".join(attributes) + ";"
                        gtf_line = f"{record.id}\t.\t{feature.type}\t{start}\t{end}\t.\t{strand}\t.\t{attribute_str}\n"
                        output_handle.write(gtf_line)
    
    def main():
        parser = argparse.ArgumentParser(description="Convert GenBank file to GTF format.")
        parser.add_argument("genbank_file", help="Input GenBank file")
        parser.add_argument("gtf_file", help="Output GTF file")
        args = parser.parse_args()
    
        genbank_to_gtf(args.genbank_file, args.gtf_file)
    
    if __name__ == "__main__":
        main()
  6. modify_gtf.py (code)

    import sys
    
    def modify_gtf(input_file, output_file):
        with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
            for line in infile:
                if line.startswith('#'):
                    outfile.write(line)  # Copy comment lines
                    continue
    
                parts = line.strip().split('\t')
                if len(parts) != 9:
                    continue  # Skip invalid lines
    
                record_type = parts[2]
                attributes = parts[8]
    
                # Change "Protein Homology" to "RefSeq"
                if parts[1] == "Protein Homology":
                    parts[1] = "ProteinHomology"
    
                new_attributes = []
                for attr in attributes.split(';'):
                    if attr.strip():
                        key_value = attr.strip().split(' ', 1)  # Split only on the first space
                        if len(key_value) == 2:
                            key, value = key_value
                            if (record_type == 'gene' or record_type == 'pseudogene') and key == 'ID':
                                key = 'gene_id'
                            elif (record_type == 'gene' or record_type == 'pseudogene') and key == 'Name':
                                key = 'gene_name'
                            elif (record_type == 'CDS' or record_type == 'tRNA' or record_type == 'rRNA' or record_type == 'SRP_RNA') and key == 'ID':
                                key = 'transcript_id'
                            elif (record_type == 'CDS' or record_type == 'tRNA' or record_type == 'rRNA' or record_type == 'SRP_RNA') and key == 'Parent':
                                key = 'gene_id'
                            elif (record_type == 'exon') and key == 'ID':
                                key = 'exon_id'
                            elif (record_type == 'exon') and key == 'Parent':
                                key = 'transcript_id'
                                new_attributes.append(f'gene_id {value}')
                            new_attributes.append(f'{key} {value}')
                        else:
                            new_attributes.append(attr.strip())
    
                # Change record type from CDS to exon
                if record_type == "CDS":
                    parts[2] = "exon"
    
                parts[8] = '; '.join(new_attributes)
                outfile.write('\t'.join(parts) + '\n')
    
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python modify_gtf.py 
    “) sys.exit(1) input_file = sys.argv[1] output_file = sys.argv[2] modify_gtf(input_file, output_file)
  7. correct_gtf.py (code)

    input_file = '1585_m.gtf'  # Replace with your input file path
    output_file = '1585_m_.gtf'  # Replace with your desired output file path
    
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        for line in infile:
            if line.strip() and not line.startswith('#'):  # Skip empty or comment lines
                parts = line.strip().split('\t')
                if parts[2] == 'gene':  # Check if the feature type is 'gene'
                    attributes = parts[8].split(';')
                    attributes = [attr.strip() for attr in attributes if 'transcript_id' not in attr]  # Remove transcript_id for gene
                    parts[8] = '; '.join(attributes) + ';' if attributes else '.'
                    line = '\t'.join(parts) + '\n'
            outfile.write(line)
    
    print(f"Processed file saved as {output_file}")
  8. transform_gtf.py (code)

    import sys
    
    def transform_data(input_data):
        output_data = []
    
        for line in input_data:
            fields = line.split()
            entry_type = fields[2]
            attributes = ' '.join(fields[8:])
            attributes_dict = dict(item.split(' ', 1) for item in attributes.split('; '))
    
            gene_id = attributes_dict.get('gene_id').replace('"', '')
            transcript_id = attributes_dict.get('transcript_id').replace('"', '')
    
            if not gene_id.startswith("SF028"):
                gene_name = f'gene_name "{gene_id}"'
                gene_id = f'gene_id "{transcript_id}"'
            else:
                gene_name = ''
    
            if entry_type == "gene":
                gene_line = f"{fields[0]}\t{fields[1]}\tgene\t{fields[3]}\t{fields[4]}\t{fields[5]}\t{fields[6]}\t{fields[7]}\t{gene_id}; {gene_name}".strip()
                output_data.append(gene_line)
                transcript_line = f"{fields[0]}\t{fields[1]}\ttranscript\t{fields[3]}\t{fields[4]}\t{fields[5]}\t{fields[6]}\t{fields[7]}\t{gene_id}; transcript_id \"{transcript_id}\""
                output_data.append(transcript_line)
            elif entry_type == "exon":
                exon_line = f"{fields[0]}\t{fields[1]}\texon\t{fields[3]}\t{fields[4]}\t{fields[5]}\t{fields[6]}\t{fields[7]}\t{gene_id}; transcript_id \"{transcript_id}\"; exon_id \"{transcript_id}\""
                output_data.append(exon_line)
    
        return output_data
    
    def main(input_file, output_file):
        with open(input_file, 'r') as file:
            input_data = file.readlines()
    
        transformed_data = transform_data(input_data)
    
        with open(output_file, 'w') as file:
            for line in transformed_data:
                file.write(line + '\n')
    
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python script.py 
    “) sys.exit(1) input_file = sys.argv[1] output_file = sys.argv[2] main(input_file, output_file)

RNA-seq recalculate p-values on sT- and LT-transcripts

  1. goal

    #p604 and p601 to sT transcript
    #p602 and p600 to LT transcript
    #p605 and p600 to LTtr transcript (or LT transcript is also fine)
    
    p602+604 compared to sT transcript (p602and604_d3 vs. p601_d3?)
    p605+604 compared to sT transcript (p604and605_d9/d12 vs. p600and601_d9/d12?)
    p602+604 compared to LT transcript (p602and604_d3 vs. p601_d3?)
    p605+604 compared to LT transcript (p604and605_d9/d12 vs. p600and601_d9/d12?)
  2. under ~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq_Merged_Corrected_28+2

    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    library("org.Hs.eg.db")
    
    #d.raw<- read.delim2("gene_counts_hg38+virus_on_LT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    d.raw<- read.delim2("gene_counts_hg38+virus_on_sT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    
    # [1] V_8_0_mock_DonorI             V_8_0_mock_DonorII           
    # [3] V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII      
    # [5] V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII      
    # [7] V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI       
    # [9] V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI       
    #[11] V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII      
    #[13] V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII      
    #[15] V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI       
    #[17] V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI       
    #[19] V_8_4_1_p602_d8_DonorII       V_8_4_1_p602_d8_DonorI       
    #[21] V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI
    #[23] V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII
    #[25] V_8_4_2_p602_d3_DonorI        V_8_4_2_p602_d3_DonorII      
    #[27] V_8_5_1_p602and604_d3_DonorI  V_8_5_2_p602and604_d3_DonorII
    
    col_order <- c("V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII", "V_8_1_5_p604_d3_DonorII", "V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII",   "V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI",  "V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII", "V_8_2_3_p605_d8_DonorII",  "V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI",  "V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI",  "V_8_3_1_p600and601_d12_DonorI", "V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII",    "V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII",    "V_8_5_1_p602and604_d3_DonorI", "V_8_5_2_p602and604_d3_DonorII")
    identical(colnames(d.raw),col_order)
    
    #reordered.raw <- d.raw[,col_order]
    
    replicates = as.factor(c("untreated","untreated", "p601_d3","p604_d3", "p601_d8","p604_d8",   "p601_d3","p604_d3","p601_d8","p604_d8",  "p600_d3","p605_d3","p600_d8", "p605_d8",  "p600_d3","p605_d3","p600_d8","p605_d8",  "p602_d8","p602_d8",  "p600and601_d9d12", "p604and605_d9d12","p600and601_d9d12","p604and605_d9d12",     "p602_d3","p602_d3",    "p602and604_d3","p602and604_d3"))
    
    batch = as.factor(c("200420", "200420", "190927", "190927",    "190927", "190927", "190228", "190228",    "190228", "190228", "191008", "191008",    "191008", "191008", "190228", "190228",     "190228", "190228", "200817", "200817",       "200420", "200420", "200817", "200817",      "210302", "210302",  "210504","210504"))
    
    ids = as.factor(c("untreated_DonorI","untreated_DonorII", "p601_d3_DonorII","p604_d3_DonorII", "p601_d8_DonorII","p604_d8_DonorII",   "p601_d3_DonorI","p604_d3_DonorI","p601_d8_DonorI","p604_d8_DonorI",  "p600_d3_DonorII","p605_d3_DonorII","p600_d8_DonorII", "p605_d8_DonorII",  "p600_d3_DonorI","p605_d3_DonorI","p600_d8_DonorI","p605_d8_DonorI",  "p602_d8_DonorII","p602_d8_DonorI",  "p600and601_d12_DonorI", "p604and605_d12_DonorI","p600and601_d9_DonorII","p604and605_d9_DonorII",        "p602_d3_DonorI","p602_d3_DonorII",      "p602and604_d3_DonorI","p602and604_d3_DonorII"))
    
    donor = as.factor(c("DonorI","DonorII",  "DonorII","DonorII", "DonorII","DonorII",   "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorII","DonorII","DonorII",  "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorI",   "DonorI", "DonorI","DonorII","DonorII",        "DonorI","DonorII",    "DonorI","DonorII"))
    
    cData = data.frame(row.names=colnames(d.raw), replicates=replicates, donor=donor, batch=batch, ids=ids)
    #dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~batch+replicates) # ERROR due to the factor 'batch'
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~donor+replicates)
    
    sizeFactors(dds)
    >NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    #            V_8_0_mock_DonorI            V_8_0_mock_DonorII 
    #                    0.6152727                     0.8401306 
    #      V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII 
    #                    1.0196629                     1.0293038 
    #      V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII 
    #                    0.8901261                     0.9062621 
    #       V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI 
    #                    1.3752853                     1.3504344 
    #       V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI 
    #                    1.2316027                     1.3085877 
    #      V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII 
    #                    0.9323440                     0.9381003 
    #      V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII 
    #                    0.7179836                     0.9400712 
    #       V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI 
    #                    1.3953243                     1.4836134 
    #       V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI 
    #                    1.2135489                     1.8270271 
    #      V_8_4_1_p602_d8_DonorII        V_8_4_1_p602_d8_DonorI 
    #                    1.5281322                     1.1579240 
    #V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI 
    #                    0.9895927                     0.8560524 
    #V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII 
    #                    1.3462308                     1.1142589 
    #       V_8_4_2_p602_d3_DonorI       V_8_4_2_p602_d3_DonorII 
    #                    0.9675338                     1.1800615 
    # V_8_5_1_p602and604_d3_DonorI V_8_5_2_p602and604_d3_DonorII 
    #                    0.3940273                     0.3490526 
    
    rld <- rlogTransformation(dds)
    
    # -- before pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("replicates"))
    #plotPCA(rld, intgroup = c("replicates", "batch"))
    #plotPCA(rld, intgroup = c("replicates", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    
    # -- before heatmap --
    ## generate the pairwise comparison between samples
    library(gplots) 
    library("RColorBrewer")
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    #paste( rld$dex, rld$cell, sep="-" )
    #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates,batch, sep=":"))
    rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates))
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
    
    #---- 3 ----
    dds$replicates <- relevel(dds$replicates, "p601_d3")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p602and604_d3_vs_p601_d3")
    
    dds$replicates <- relevel(dds$replicates, "p600and601_d9d12")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p604and605_d9d12_vs_p600and601_d9d12")
    
    for (i in clist) {
      contrast = paste("replicates", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      geness <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(res), keytype = "SYMBOL", columns = c("ENSEMBL", "ENTREZID", "SYMBOL", "GENENAME"))
      geness <- geness[!duplicated(geness$SYMBOL), ]
      res$SYMBOL = rownames(res)
      rownames(geness) <- geness$SYMBOL
      identical(rownames(res), rownames(geness))
      res_df <- as.data.frame(res)
      geness_res <- merge(geness, res_df)
      dim(geness_res)
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      #write.csv(res_df, file = paste(i, "background.txt", sep="_"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=1)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-1)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p601_d3-all.txt \
    p602and604_d3_vs_p601_d3-up.txt \
    p602and604_d3_vs_p601_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p601_d3_on_LT.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p600and601_d9d12-all.txt \
    p604and605_d9d12_vs_p600and601_d9d12-up.txt \
    p604and605_d9d12_vs_p600and601_d9d12-down.txt \
    -d$',' -o p604and605_d9d12_vs_p600and601_d9d12_on_LT.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p601_d3-all.txt \
    p602and604_d3_vs_p601_d3-up.txt \
    p602and604_d3_vs_p601_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p601_d3_on_sT.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p600and601_d9d12-all.txt \
    p604and605_d9d12_vs_p600and601_d9d12-up.txt \
    p604and605_d9d12_vs_p600and601_d9d12-down.txt \
    -d$',' -o p604and605_d9d12_vs_p600and601_d9d12_on_sT.xls;
  3. goal

    p783 compared to LT transcript (p783_d8 vs. p600_d8?)
  4. Under ~/DATA/Data_Denise_LT_K331A_RNASeq/Data_Denise_RNASeq_Merged_Corrected_8

    d.raw<- read.delim2("gene_name_gene_counts_hg38+virus.csv",sep=",", header=TRUE, row.names=1)
    library(dplyr)
    # Remove duplicate rows based on gene_name
    d.raw.unique <- d.raw %>% distinct(gene_name, .keep_all = TRUE)
    # Convert to a traditional data frame
    d.raw.df <- as.data.frame(d.raw.unique)
    # Set gene_name as row names
    rownames(d.raw.df) <- d.raw.df$gene_name
    # Remove the now redundant gene_name column
    d.raw.df$gene_name <- NULL
    
    #d.summarized <- d.raw %>% group_by(gene_name) %>% summarize(across(everything(), sum, na.rm = TRUE))
    #rownames(d.raw.df) <- d.raw.df$gene_name
    ## Remove the now redundant gene_name column
    #d.raw.df$gene_name <- NULL
    
    replicates = as.factor(c("p600_d8","p600_d8", "p602_d8","p602_d8",  "p605_d8","p605_d8",    "p783_d8","p783_d8"))
    ids = as.factor(c("p600_d8_DonorI","p600_d8_DonorII", "p602_d8_DonorI","p602_d8_DonorII",  "p605_d8_DonorI","p605_d8_DonorII",    "p783_d8_DonorI","p783_d8_DonorII"))
    donor = as.factor(c("DonorI","DonorII",   "DonorI","DonorII",   "DonorI","DonorII",    "DonorI","DonorII"))
    cData = data.frame(row.names=colnames(d.raw.df), replicates=replicates, donor=donor, ids=ids)
    #dds<-DESeqDataSetFromMatrix(countData=d.raw.df, colData=cData, design=~batch+replicates) # ERROR due to the factor 'batch'
    dds<-DESeqDataSetFromMatrix(countData=d.raw.df, colData=cData, design=~donor+replicates)
    
    sizeFactors(dds)
    >NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    
    #  control_DI  control_DII        LT_DI       LT_DII      LTtr_DI     LTtr_DII 
    #   0.9393201    0.5490296    0.9066729    1.1874750    1.4258546    0.7305989 
    # LT_K331A_DI LT_K331A_DII 
    #   1.4819629    1.2744066 
    
    rld <- rlogTransformation(dds)
    
    dds$replicates <- relevel(dds$replicates, "p600_d8")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p783_d8_vs_p600_d8")
    
    for (i in clist) {
      contrast = paste("replicates", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      geness <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(res), keytype = "SYMBOL", columns = c("ENSEMBL", "ENTREZID", "SYMBOL", "GENENAME"))
      geness <- geness[!duplicated(geness$SYMBOL), ]
      res$SYMBOL = rownames(res)
      rownames(geness) <- geness$SYMBOL
      identical(rownames(res), rownames(geness))
      res_df <- as.data.frame(res)
      geness_res <- merge(geness, res_df)
      dim(geness_res)
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      #write.csv(res_df, file = paste(i, "background.txt", sep="_"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=1)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-1)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p783_d8_vs_p600_d8-all.txt \
    p783_d8_vs_p600_d8-up.txt \
    p783_d8_vs_p600_d8-down.txt \
    -d$',' -o p783_d8_vs_p600_d8_on_LT.xls;
  5. Deduplication is not performed until UMI-tools are used.

    • UMI-tools dedup

      • After extracting the UMI information from the read sequence (see UMI-tools extract), the second step in the removal of UMI barcodes involves deduplicating the reads based on both mapping and UMI barcode information using the UMI-tools dedup command.

      • This will generate a filtered BAM file after the removal of PCR duplicates.

      • picard MarkDuplicates

      • Unless you are using UMIs it is not possible to establish whether the fragments you have sequenced from your sample were derived via true biological duplication (i.e. sequencing independent template fragments) or as a result of PCR biases introduced during the library preparation.

      • By default, the pipeline uses picard MarkDuplicates to mark the duplicate reads identified amongst the alignments to allow you to guage the overall level of duplication in your samples.

      • However, for RNA-seq data it is not recommended to physically remove duplicate reads from the alignments (unless you are using UMIs) because you expect a significant level of true biological duplication that arises from the same fragments being sequenced from for example highly expressed genes.

      • This step will be skipped automatically when using the –with_umi option or explicitly via the –skip_markduplicates parameter.

        compare the STAR bam-files with markDuplicates bam-files. They are the same, since in the markDuplicates step, the duplicated reads were only marked, not physically removed! jhuang@hamburg:~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq/results/STAR$ samtools flagstat ./V_8_4_2_p602_d3_DonorIIAligned.sortedByCoord.out.bam 19968762 + 0 in total (QC-passed reads + QC-failed reads) 2323677 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 19968762 + 0 mapped (100.00% : N/A) jhuang@hamburg:~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq/results/markDuplicates$ samtools flagstat V_8_4_2_p602_d3_DonorIIAligned.sortedByCoord.out.markDups.bam 19968762 + 0 in total (QC-passed reads + QC-failed reads) 2323677 + 0 secondary 0 + 0 supplementary 15013903 + 0 duplicates 19968762 + 0 mapped (100.00% : N/A)

  6. goal

    #| p600and601_d9_DonorII | GFP+mCherry control | Day 9 | 
    #| p600and601_d12_DonorI | GFP+mCherry control | Day 12 | 
    #-->| p604and605_d9_DonorII | sT+LTtr | Day 9 | 
    #-->| p604and605_d12_DonorI | sT+LTtr | Day 12 | 
    #-->| p602and604_d3_DonorI  | sT+LT | Day 3 | 
    #-->| p602and604_d3_DonorII | sT+LT | Day 3 |
    p602+604 d3 versus p602 d3 on sT+LT transcript
    p602+p604 d3 versus p604 d3 on sT+LT transcript
    p604+605 d9/12 versus p604 d8 on sT+LT transcript
    p604+605 d9/12 versus p605 d8 on sT+LT transcript
  7. under ~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq_Merged_Corrected_28+2

    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    library("org.Hs.eg.db")
    #d.raw<- read.delim2("gene_counts_hg38+virus_on_LT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    #d.raw<- read.delim2("gene_counts_hg38+virus_on_sT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    d.raw<- read.delim2("gene_counts_hg38+virus_on_sT+LT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    # [1] V_8_0_mock_DonorI             V_8_0_mock_DonorII           
    # [3] V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII      
    # [5] V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII      
    # [7] V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI       
    # [9] V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI       
    #[11] V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII      
    #[13] V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII      
    #[15] V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI       
    #[17] V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI       
    #[19] V_8_4_1_p602_d8_DonorII       V_8_4_1_p602_d8_DonorI       
    #[21] V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI
    #[23] V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII
    #[25] V_8_4_2_p602_d3_DonorI        V_8_4_2_p602_d3_DonorII      
    #[27] V_8_5_1_p602and604_d3_DonorI  V_8_5_2_p602and604_d3_DonorII
    col_order <- c("V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII", "V_8_1_5_p604_d3_DonorII", "V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII",   "V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI",  "V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII", "V_8_2_3_p605_d8_DonorII",  "V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI",  "V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI",  "V_8_3_1_p600and601_d12_DonorI", "V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII",    "V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII",    "V_8_5_1_p602and604_d3_DonorI", "V_8_5_2_p602and604_d3_DonorII")
    identical(colnames(d.raw),col_order)
    #reordered.raw <- d.raw[,col_order]
    replicates = as.factor(c("untreated","untreated", "p601_d3","p604_d3", "p601_d8","p604_d8",   "p601_d3","p604_d3","p601_d8","p604_d8",  "p600_d3","p605_d3","p600_d8", "p605_d8",  "p600_d3","p605_d3","p600_d8","p605_d8",  "p602_d8","p602_d8",  "p600and601_d9d12", "p604and605_d9d12","p600and601_d9d12","p604and605_d9d12",     "p602_d3","p602_d3",    "p602and604_d3","p602and604_d3"))
    batch = as.factor(c("200420", "200420", "190927", "190927",    "190927", "190927", "190228", "190228",    "190228", "190228", "191008", "191008",    "191008", "191008", "190228", "190228",     "190228", "190228", "200817", "200817",       "200420", "200420", "200817", "200817",      "210302", "210302",  "210504","210504"))
    ids = as.factor(c("untreated_DonorI","untreated_DonorII", "p601_d3_DonorII","p604_d3_DonorII", "p601_d8_DonorII","p604_d8_DonorII",   "p601_d3_DonorI","p604_d3_DonorI","p601_d8_DonorI","p604_d8_DonorI",  "p600_d3_DonorII","p605_d3_DonorII","p600_d8_DonorII", "p605_d8_DonorII",  "p600_d3_DonorI","p605_d3_DonorI","p600_d8_DonorI","p605_d8_DonorI",  "p602_d8_DonorII","p602_d8_DonorI",  "p600and601_d12_DonorI", "p604and605_d12_DonorI","p600and601_d9_DonorII","p604and605_d9_DonorII",        "p602_d3_DonorI","p602_d3_DonorII",      "p602and604_d3_DonorI","p602and604_d3_DonorII"))
    donor = as.factor(c("DonorI","DonorII",  "DonorII","DonorII", "DonorII","DonorII",   "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorII","DonorII","DonorII",  "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorI",   "DonorI", "DonorI","DonorII","DonorII",        "DonorI","DonorII",    "DonorI","DonorII"))
    cData = data.frame(row.names=colnames(d.raw), replicates=replicates, donor=donor, batch=batch, ids=ids)
    #dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~batch+replicates) # ERROR due to the factor 'batch'
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~donor+replicates)
    sizeFactors(dds)
    >NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    #            V_8_0_mock_DonorI            V_8_0_mock_DonorII 
    #                    0.6152727                     0.8401306 
    #      V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII 
    #                    1.0196629                     1.0293038 
    #      V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII 
    #                    0.8901261                     0.9062621 
    #       V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI 
    #                    1.3752853                     1.3504344 
    #       V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI 
    #                    1.2316027                     1.3085877 
    #      V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII 
    #                    0.9323440                     0.9381003 
    #      V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII 
    #                    0.7179836                     0.9400712 
    #       V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI 
    #                    1.3953243                     1.4836134 
    #       V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI 
    #                    1.2135489                     1.8270271 
    #      V_8_4_1_p602_d8_DonorII        V_8_4_1_p602_d8_DonorI 
    #                    1.5281322                     1.1579240 
    #V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI 
    #                    0.9895927                     0.8560524 
    #V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII 
    #                    1.3462308                     1.1142589 
    #       V_8_4_2_p602_d3_DonorI       V_8_4_2_p602_d3_DonorII 
    #                    0.9675338                     1.1800615 
    # V_8_5_1_p602and604_d3_DonorI V_8_5_2_p602and604_d3_DonorII 
    #                    0.3940273                     0.3490526
    rld <- rlogTransformation(dds)
    # -- before pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("replicates"))
    #plotPCA(rld, intgroup = c("replicates", "batch"))
    #plotPCA(rld, intgroup = c("replicates", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    # -- before heatmap --
    ## generate the pairwise comparison between samples
    library(gplots) 
    library("RColorBrewer")
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    #paste( rld$dex, rld$cell, sep="-" )
    #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates,batch, sep=":"))
    rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates))
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
    
    #---- 3 ----
    dds$replicates <- relevel(dds$replicates, "p602_d3")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p602and604_d3_vs_p602_d3")
    dds$replicates <- relevel(dds$replicates, "p604_d3")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p602and604_d3_vs_p604_d3")
    dds$replicates <- relevel(dds$replicates, "p604_d8")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p604and605_d9d12_vs_p604_d8")
    dds$replicates <- relevel(dds$replicates, "p605_d8")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p604and605_d9d12_vs_p605_d8")
    for (i in clist) {
    contrast = paste("replicates", i, sep="_")
    res = results(dds, name=contrast)
    res <- res[!is.na(res$log2FoldChange),]
    geness <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(res), keytype = "SYMBOL", columns = c("ENSEMBL", "ENTREZID", "SYMBOL", "GENENAME"))
    geness <- geness[!duplicated(geness$SYMBOL), ]
    res$SYMBOL = rownames(res)
    rownames(geness) <- geness$SYMBOL
    identical(rownames(res), rownames(geness))
    res_df <- as.data.frame(res)
    geness_res <- merge(geness, res_df)
    dim(geness_res)
    write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
    #write.csv(res_df, file = paste(i, "background.txt", sep="_"))
    up <- subset(geness_res, padj<=0.05 & log2FoldChange>=1)
    down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-1)
    write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
    write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p602_d3-all.txt \
    p602and604_d3_vs_p602_d3-up.txt \
    p602and604_d3_vs_p602_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p602_d3.xls;
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p604_d3-all.txt \
    p602and604_d3_vs_p604_d3-up.txt \
    p602and604_d3_vs_p604_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p604_d3.xls;
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p604_d8-all.txt \
    p604and605_d9d12_vs_p604_d8-up.txt \
    p604and605_d9d12_vs_p604_d8-down.txt \
    -d$',' -o p604and605_d9d12_vs_p604_d8.xls;
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p605_d8-all.txt \
    p604and605_d9d12_vs_p605_d8-up.txt \
    p604and605_d9d12_vs_p605_d8-down.txt \
    -d$',' -o p604and605_d9d12_vs_p605_d8.xls;

RNA-seq data analysis (R-part) for S. epidermidis 1585

  1. Data input and generate PCA plot

    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    setwd("/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/star_salmon")
    # Define paths to your Salmon output quantification files
    files <- c("15m_A" = "./EX_15_min_A/quant.sf",
              "15m_B" = "./EX_15_min_B/quant.sf",
              "15m_C" = "./EX_15_min_C/quant.sf",
              "1h_A" = "./EX_1h_A/quant.sf",
              "1h_B" = "./EX_1h_B/quant.sf",
              "1h_C" = "./EX_1h_C/quant.sf",
              "2h_A" = "./EX_2h_A/quant.sf",
              "2h_B" = "./EX_2h_B/quant.sf",
              "2h_C" = "./EX_2h_C/quant.sf",
              "4h_A" = "./EX_4h_A/quant.sf",
              "4h_B" = "./EX_4h_B/quant.sf",
              "4h_C" = "./EX_4h_C/quant.sf",
              "6h_A" = "./EX_6h_A/quant.sf",
              "6h_B" = "./EX_6h_B/quant.sf",
              "6h_C" = "./EX_6h_C/quant.sf",
              "4d_A" = "./EX_Day_4_A/quant.sf",
              "4d_B" = "./EX_Day_4_B/quant.sf",
              "4d_C" = "./EX_Day_4_C/quant.sf")
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    # Define the replicates and condition of the samples
    replicate <- factor(c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"))
    condition <- factor(c("15m", "15m", "15m", "1h", "1h", "1h", "2h", "2h", "2h", "4h", "4h", "4h", "6h", "6h", "6h", "4d", "4d", "4d"))
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    # In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
    # Filter data to retain only genes with more than 2 counts > 3 across all samples
    # dds <- dds[rowSums(counts(dds) > 3) > 2, ]
    # Output raw count data to a CSV file
    write.csv(counts(dds), file="transcript_counts.csv")
    # -- gene-level count data --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    #tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_counts.csv -d',' -o gene_counts.xls
    
    #TODO: why a lot of reads were removed due to the too_short?
    #STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output
    dim(counts(dds))
    head(counts(dds), 10)
    #DEBUG: DESeq should not used here!?
    #TODO_NEXT_WEEK: rerun without fistly DESeq(dds) to compare if the results is the same to process_1 
    #dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    
    # draw simple pca and heatmap
    library(gplots) 
    library("RColorBrewer")
    #mat <- assay(rld)
    #mm <- model.matrix(~condition, colData(rld))
    #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
    #assay(rld) <- mat
    # -- pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    # -- heatmap --
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
  2. Analysis on differentially expressed genes

    setwd("/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/star_salmon/degenes")
    
    dds$condition <- relevel(dds$condition, "15m")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1h_vs_15m", "2h_vs_15m", "4h_vs_15m", "6h_vs_15m",  "4d_vs_15m")
    
    for (i in clist) {
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    library("EnhancedVolcano")
    for i in 1h_vs_15m 2h_vs_15m 4h_vs_15m 6h_vs_15m 4d_vs_15m; do
      echo "res <- read.csv(file = paste(\"${i}\", \"all.txt\", sep=\"-\"), row.names = 1)"
      echo "res_df <- as.data.frame(res)"
      echo "png(\"${i}.png\",width=800, height=600)"
      #legendPosition = 'right',legendLabSize = 12,  arrowheads = FALSE,
      #echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.2, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(~Delta*\"$(echo $i | cut -d'_' -f1) versus \" *~Delta*\"$(echo $i | cut -d'_' -f3)\"))"
      echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(\"$(echo $i | cut -d'_' -f1) versus $(echo $i | cut -d'_' -f3)\"))"
      echo "dev.off()"
    done
    
    res <- read.csv(file = paste("1h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("1h_vs_15m.png",width=1000, height=800)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 1), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("1h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("2h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("2h_vs_15m.png",width=1000, height=1000)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 4), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("2h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("4h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("4h_vs_15m.png",width=1000, height=1200)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 5.5), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("4h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("6h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("6h_vs_15m.png",width=1000, height=1600)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 17), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("6h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("4d_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("4d_vs_15m.png",width=1000, height=1600)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("4d versus 15m"))
    dev.off()
    
    #under DIR degenes under KONSOLE
    for i in 1h_vs_15m 2h_vs_15m 4h_vs_15m 6h_vs_15m 4d_vs_15m; do 
    echo "~/Tools/csv2xls-0.4/csv_to_xls.py ${i}-all.txt ${i}-up.txt ${i}-down.txt -d$',' -o ${i}.xls;"; 
    done
  3. Clustering the genes and draw heatmap

    for i in 1h_vs_15m 2h_vs_15m 4h_vs_15m 6h_vs_15m 4d_vs_15m; do
    echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
    cat *.id | sort -u > ids
    #add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id
    RNASeq.NoCellLine <- assay(rld)
    #install.packages("gplots")
    library("gplots")
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    #datamat = RNASeq.NoCellLine
    write.csv(as.data.frame(datamat), file ="gene_expressions.txt")
    constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
    if(any(constant_rows)) {
      cat("Removing", sum(constant_rows), "constant rows.\n")
      datamat <- datamat[!constant_rows, ]
    }
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.4)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "MAGENTA", "CYAN", "GREEN", "RED", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTGREEN"); #"LIGHTRED", 
    mycol = mycol[as.vector(mycl)]
    #png("DEGs_heatmap.png", width=900, height=800)
    #cex.lab=10, labRow="",
    png("DEGs_heatmap.png", width=900, height=1000)
    #heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
    #            scale='row',trace='none',col=bluered(75), 
    #            RowSideColors = mycol, margins=c(10,20), cexRow=1.5, srtCol=45, lhei = c(2, 8))  #rownames(datamat)  
    # cexCol, cexRow
    heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,2), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=40, lhei=c(1,7), cexCol=2)
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') 
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')  
    write.csv(names(subset(mycl, mycl == '4')),file='cluster4_MAGENTA.txt')
    write.csv(names(subset(mycl, mycl == '5')),file='cluster5_CYAN.txt')
    write.csv(names(subset(mycl, mycl == '6')),file='cluster6_GREEN.txt')
    ~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    #~/Tools/csv2xls-0.4/csv_to_xls.py significant_gene_expressions.txt -d',' -o DEGs_heatmap_expression_data.xls;
  4. Generate Excel files from Genbank-file

    python3 /home/jhuang/Scripts/gb_to_excel.py 1585.gb 1585.xlsx
    #insert C./P. (Chrom or Plasmid1, Plasmid2, Plasmid3) as the second chrom in the 1585.xlsx.

Preparing GTF file for nextflow rnaseq regarding S. epidermidis 1585

  1. In the example RP62A

    #[IMPORTANT FINAL STEPS]
    [1] python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf or python3 ~/Scripts/genbank_to_gtf.py 1585.gb 1585.gtf
    
    [2] python3 modify_gtf.py RP62A.gtf RP62A_m.gtf
        python3 modify_gtf.py 1585.gtf 1585_m.gtf  #replace CDS with exon!
    
        gene_biotype "protein_coding"; 
    [3] sed -i -e 's/gene_id "rna-/gene_id "gene-/g' RP62A_m.gtf
    
    #  --gtf_extra_attributes        [string]  By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when 
                                              running Salmon. [default: gene_name] 
    #  --gtf_group_features          [string]  Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id]
    #  --featurecounts_group_type    [string]  The attribute type used to group feature types in the GTF file when generating the biotype plot with 
                                              featureCounts. [default: gene_biotype] 
    #  --featurecounts_feature_type  [string]  By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]
    
    #--gtf_extra_attributes gene 
    #--gtf_group_features Parent --featurecounts_group_type gene_biotype --featurecounts_feature_type CDS (outdated since [CHANGE1,2,3]) 
    
    # ------------ gff_to_gtf.py, then modify *.gtf file as follows -------------
    #Upload the scripts gff_to_gtf.py and modify_gtf.py.
    python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
    
    #[CHANGE1] ID in gene and Parent in CDS --> gene_id; ID in exon --> transcript_id; Name in gene --> gene_name
    #[CHANGE2] "Protein Homology" --> "RefSeq"
    #[CHANGE3] CDS --> exon
    #[CHANGE4?] --gtf_extra_attributes ||||gene|||| refers to ID=gene-SERP_RS00220;Name=noc;gbkey=Gene;||||gene=noc||||; "gene=" exists in both gene and exon. choose gene as gtf_extra_attributes! (see --gtf_extra_attributes gene in the next command)
  2. In the example of 1585 wildtype genome 1585.gb (CP143516-CP143519)

                Features Annotated                :: Gene; CDS; rRNA; tRNA; ncRNA
                Genes (total)                     :: 2,332 (CDS + RNA)
                CDSs (total)                      :: 2,240 (CDS)  grep "CDS" 1585.gtf | wc -l
                Genes (coding)                    :: 2,183 (Pseudo has also CDS?)
                CDSs (with protein)               :: 2,183
    
                Genes (RNA)                       :: 92  
                rRNAs                             :: 7, 6, 6 (5S, 16S, 23S)  grep "rRNA" 1585.gtf | wc -l
                complete rRNAs                    :: 7, 6, 6 (5S, 16S, 23S)
                tRNAs                             :: 69  grep "tRNA" 1585.gtf | wc -l
                ncRNAs                            :: 4  grep "ncRNA" 1585.gtf | wc -l
    
                Pseudo Genes (total)              :: 57
                CDSs (without protein)            :: 57
                Pseudo Genes (ambiguous residues) :: 0 of 57
                Pseudo Genes (frameshifted)       :: 33 of 57
                Pseudo Genes (incomplete)         :: 30 of 57
                Pseudo Genes (internal stop)      :: 25 of 57
                Pseudo Genes (multiple problems)  :: 21 of 57
    
        gene            730216..731766
                        /locus_tag="V0R30_03330"
        rRNA            730216..731766
                        /locus_tag="V0R30_03330"
    
    #Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
    cut -d$'\t' -f3 1585_m.gtf | sort -u
    exon
    #gene
    ncRNA
    rRNA
    tmRNA            
    tRNA
    #Ausführung: STEP0: one record (tmRNA) lacking the "tmRNA" row, manually corrected! 
    #                Transfer-messenger RNA (tmRNA, 10Sa RNA, ssrA) is bacterial RNA having both tRNA and mRNA properties
    #            STEP1: python3 genbank_to_gtf.py 1585.gb 1585.gtf
    #            STEP2: python3 modify_gtf.py 1585.gtf 1585_m.gtf  #"CDS" --> "exon"
    #            STEP3: replace transcript_id " --> transcript_id "tx-
    #            STEP4: ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
    #                   using "python3 add_geneid_genename_genebiotype.py 1585_m.gtf 1585_m_.gtf
    
    #DEBUG ERROR Stop at line : CP143516       biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
    #    Error Message: Cannot find transcript_id!
    #            STEP5:     Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described.   with "python3 add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py 1585_m_.gtf 1585_m__.gtf"
    
    #            STEP6: MANUALLY modify the lines ncRNA (3), tmRNA (1), rRNA (19), tRNA (69): tRNA --> exon, 'gene_biotype "tRNA";';
    #                   with "grep "tRNA" 1585_m_.gtf | wc -l" check they are correctly changed!
    
    #RESULTS:
    #CP143516        biopython       gene    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
    #CP143516        biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
    #...
    
    #/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
    #grep ">" genome.transcripts.fa  | wc -l
    #2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!
    
    #            STEP7 (Optional) To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
    #                see 1585_m___.gtf
    #                grep "exon" 1585_m___.gtf | wc -l
    #                2332
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
    #                2332
    
    #For example exon-V0R30_00090 is a tRNA.
    #>tx-V0R30_00090
    #GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
    
    #We can also filter the count number in the downstream analysis from the big matrix!
    #tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
    >tx-V0R30_05655
    AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
    >tx-V0R30_06520
    TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
    >tx-V0R30_11060
    CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG
    
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m___.gtf -profile test_full -resume --max_memory 200.GB --max_time 2400.h --save_reference --aligner star_salmon   --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0

The structural proteome of eukaryotic viruses

Original text please find under https://www.biorxiv.org/content/10.1101/2024.01.22.576744v1.full

  1. The structural proteome of eukaryotic viruses /ˈproʊ.ti.oʊm/ (Fig. 1 + Supplementary Fig. 1)

Fig1.jpg Sup_Fig1.jpg

    [ColabFold is advanced version of AlphaFold (https://steineggerlab.com/en/)]
    [Viral protein seqeunces --> structure prediction --> Seqeunce clustering using MMseqs2 --> Structural clustering using Foldseek]
    - Eukaryotic viral protein sequences --> Structure prediction (67715 Strucutures) using Colabfold --> Sequence clustering using MMseqs2 (70% Coverage, 20% identity) --> Structural clustering using Foldseek (70% Coverage and TMscore>=0.4)
    - This dataset includes a large diversity of viruses, including 4,463 species from 132 different viral families (Fig. 1B, C). 
    - Clusters are structurally consistent, as implementing DALI 24 to align cluster representatives to each member for clusters with at least 100 members yields a median cluster-average DALI Z score of 13.1 (Supplementary Fig. 1C). DALI Z-scores above 8 indicate
    two proteins are likely homologous 25.

    - We investigated how well this database represents viral diversity, and if it reconstitutes core viral hallmark genes. 
    - We grouped viral families into viral genome types based on their Baltimore classes26 with slight modifications – DNA viruses were split into large, medium, and small groupings based on their average genome length, while RNA viruses without single-stranded positive- or negative- sense genomes were grouped into the RNA (Other) category. 
    - Large double-stranded DNA (dsDNA) viruses have the most protein clusters per species and, despite constituting only 14 of the 132 viral families in the dataset, account for the majority of viral proteins (Fig. 1D, F). 
    - As expected, protein cluster count correlates strongly with genome size (Supplementary Fig. 1D). 
    - With their larger genomes, dsDNA viruses have the capacity to encode more auxiliary genes without sacrificing genome stability. 
    - RNA viruses make up a large fraction of the families present in the dataset, but a smaller fraction of the total proteins (Fig. 1E, F). 
    - Structural homology between viral families with a similar genome type is common, with large dsDNA viruses sharing many protein folds (Supplementary Fig. 1E).
    - As expected, the predominant protein clusters in the dataset as a whole (Supplementary Fig. 1F) and within each genome type (Fig. 1G) are largely involved in fundamental aspects of the viral life cycle. 
    - These include the single jellyroll fold, which comprises viral capsids ['kæpsid] and is present in viruses of all genome types. 
    - The double jellyroll fold also comprises viral capsids, although it is restricted to dsDNA viruses27. 
    - RNA viral families often encode nucleocapsids, responsible for packaging of viral RNA, and RNA-dependent RNA polymerases (RdRPs) responsible for genome replication. 
    - While the RdRP is universally conserved in RNA viruses, it is split amongst multiple protein clusters due to variation in protein length. 
    - In contrast, small dsDNA viruses such as papillomaviruses and polyomaviruses encode a viral replicase with conserved origin-binding and helicase domains. 
    -! Altogether, we find that our structural database successfully reconstitutes conserved viral proteins across diverse viral subtypes.
    - The pLDDT (predicted Local Distance Difference Test) value is a metric used in protein structure prediction, particularly by tools like AlphaFold, to assess the confidence in the predicted positions of amino acids in a protein structure. pLDDT scores range from 0 to 100, where higher scores indicate higher confidence in the accuracy of the predicted structure at specific positions. Scores above 90 are considered very high confidence, scores between 70 to 90 indicate confidence, scores between 50 to 70 are low confidence, and scores below 50 are very low confidence, suggesting that the model is uncertain about those regions of the protein structure.
    - Viral Protein Clusters ---Foldseek structural comparison---> Clustered Alphafold Database (human proteins)

    #viral protein cluster
    - We next investigated the taxonomic distribution of viral protein clusters. 
    - We conducted structural alignments of viral protein cluster representatives against 2.3 million cluster representatives from the entire Alphafold Database (AFDB)3 (Fig. 1H). 
    - For each virus protein cluster, we determined the last common ancestor of viruses encoding a cluster member. 
    - We found that 29% of protein clusters are present in multiple viral families, the majority of which are present in the Alphafold Database, suggesting that they are evolutionarily ancient (Fig. 1I). 
    - In addition, we found that 62% of viral proteins (or 55% of proteins from non-singleton clusters) are restricted to a single viral family and lack homologs in the AFDB (Fig. 1I). This shows that viral evolution generates substantial numbers of novel proteins that are absent from current structure
    databases.

2.Structural similarities between annotated and uncharacterized viral proteins (Fig. 2+Supplementary Fig. 2)

Fig2.jpg Sup_Fig2.jpg

    - We investigated the ability of structural alignments to identify relationships not apparent from protein sequence alone. 
    - We found that many representatives of sequence clusters are structurally similar despite low sequence similarity (Fig. 2A). --> the structure is more conserved than seqeunce level?
    - In fact, adding structural information to protein clustering efforts leads to more taxonomically diverse protein clusters, with significantly more viral families per cluster (Fig. 2B). : with the structure and seqeunce, we have much more viral families per cluster!
    - This is especially important for finding homology between proteins from divergent viruses, resulting in a substantial increase in protein clusters encompassing proteins encoded by viruses of different genome types (Fig. 2C). 51 vs 23
    - We asked if [structural alignments] can link poorly-annotated sequence clusters with those that are more annotated (Fig. 2D). We used the sequence-based classifier InterProScan28 to assign all proteins Pfam29, CDD30, and TIGRFAM31 classifications. 
    - Sequence clusters contain almost entirely annotated or entirely unannotated members, resulting in a bimodal distribution of sequence clusters (Fig. 2E). 
    - Of the proteins in clusters with more than one member, over 25% ( 5.2/(1.6+3.6+14.6)=0.26 ) of unannotated proteins are located in either an annotated sequence cluster or a protein cluster that contains an annotated sequence cluster (Fig. 2F).
    - Many protein clusters encompass a mixture of annotated and unannotated sequence clusters (Supplementary Fig. 2). Supplementary Fig. 2: Many unannotated proteins have structural homlogy to annotated protein clusters.
    - We find that these connections between sequence clusters are useful to determine putative functions of poorly characterized proteins across the virome. For example, while the single jellyroll fold is the most abundant protein cluster, many members of this cluster are not correctly annotated (Fig. 2G). Many other protein clusters include both annotated and unannotated sequence clusters, including clusters encoding enzymes such as nucleotide-phosphate kinases (Fig. 2H), NUDIX Hydrolases (Fig. 2I), DNA ligases (Fig. 2J), and nucleases (Fig. 2K). 
    - One cluster of note includes members that resemble the UL43 family of late herpesvirus proteins (Fig. 2L), which will be discussed later. 
    - Together, these results demonstrate that large scale clustering based on sequence plus predicted structure enables functional inference of poorly characterized viral proteins.

    #The Dali database is a structural classification based on precomputed all-against-all structural similarities within the PDB.
  1. Structural alignments suggest functions of human pathogen proteins ((Fig. 3)comparisions of viral (our databases) and non-viral proteins (Alphafold Databases) using Foldseek!)

Fig3.jpg

    Foldseek faster than TM-align and Dali!!!!!
    #- Foldseek is a tool designed for fast and sensitive comparison of large sets of protein structures. It uses a novel approach by encoding structures as sequences over a 20-state 3D interaction alphabet, enabling comparisons through sequence alignments. This method significantly speeds up structural comparisons, making Foldseek much faster than traditional structural aligners like TM-align and Dali, while maintaining high sensitivity. Foldseek is available as open-source software and also through a webserver, facilitating rapid and accurate protein structure searches
    #- Using Dali for structural comparison of proteins 

    - Unlike nucleotide or protein sequence, structural features are often conserved over large evolutionary timescales. 
    - Thus, we investigated if alignment between predicted viral and non-viral protein structures can offer insight into the function of poorly-annotated proteins encoded by human pathogens.
    - To do this, we used Foldseek (for comparisisons) to align [our virus protein structure database!!!!] with the initial release of the Alphafold Database, which contains over 300,000 proteins from 21 organisms across eukaryotes, bacteria, and archaea2 (Fig. 3A). 
    - This revealed pervasive structural homology between viral and non-viral proteins, with high structural similarity in the face of low amino acid identity (Fig. 3B).
    - Ultimately, 14,531 predicted viral proteins have an alignment to a member of the Alphafold database, with the majority of alignments being against proteins encoded by eukaryotes (Fig. 3C). 
    - These alignments include proteins that are unannotated but are encoded by human pathogens. 
    - To reduce rates of false negatives, we conducted a series of alignments using DALI24, which is slower than Foldseek but substantially more sensitive. 
    - First, we found that a set of proteins encoded by poxviruses are structurally similar to the auto-inhibitory domain of mammalian gasdermins (Fig. 3D)32. 
    -* In the context of "auto-inhibitory domain of mammalian gasdermins," the term "domain" refers to a specific part of a protein that has a distinct structure and function. In proteins like gasdermins, which are involved in cell death pathways, an auto-inhibitory domain acts as a regulatory segment. This domain can prevent the protein from acting until certain conditions are met, effectively inhibiting the protein's activity to regulate processes such as inflammation and cell death until it is appropriate for the protein to become active.

    - Similarly, several poxvirus proteins are structurally homologous to the human galactosyltransferase COLGALT1, thought to enable virus binding to
    surface glycosaminoglycans during viral entry (Fig. 3E)33. 
    - Next, we found that human herpesviruses proteins, including the protein BMRF2 from Epstein Barr herpesvirus (EBV) and Varicella zoster virus (VZV), share structural similarity with the human equilibrative nucleoside transporter ENT4 (Fig. 3F). 
    - EBV conducts substantial remodeling of host cell metabolism during viral infection34, and this finding suggests a potential metabolic role in addition to BMRF2 involvement in viral attachment35. 
    - In addition, transport of antiviral nucleoside analogues such as valacyclovir are mediated by nucleoside transporters36,37, raising questions about the
    interplay between this protein and valacyclovir during VZV infection. These proteins belong to a cluster of proteins similar to the UL43 family of late herpesvirus proteins, some of which are unannotated (Fig. 2J). 
    - In addition, we observed structural homology of Poxvirus C4-like proteins with eukaryotic dioxygenases (Fig. 3G). Vaccinia virus C4 is notable for antagonizing several innate immune pathways. C4 directly binds the pattern recognition receptor DNA-PK, blocking DNA binding and immune signaling through that pathway38. In addition, C4 inhibits NF-κB signaling downstream at or downstream of the IKK complex, but the mechanism of this inhibition is unknown39. Future studies are required to determine if its dioxygenase-like fold is involved in its innate immune antagonism. 
    - Altogether, these findings illustrate the ubiquity of structural homology between viral and non-viral proteins and show that this homology can be used to predict potential functions of poorly characterized viral proteins.
  1. Horizontal gene transfer creates taxonomically-diverse protein clusters (Supplementary Fig. 3)

Sup_Fig3.jpg

    -* What does "domain" in protein structure mean? While alpha-helices and beta-sheets are elements of secondary structure within a protein, a domain is a higher order of structure that often consists of multiple secondary structure elements arranged in a specific configuration. Domains can serve various functions, such as catalytic activity, binding to other molecules, or regulatory roles, and a single protein can contain multiple domains each performing different functions.

    - While we found that some protein clusters contain members encoded by viruses of different genome types, the evolutionary origin of such conservation is unclear. 
    - Many of these protein clusters are predominantly encoded by viruses of a single genome type but expressed in a small minority of viruses of a different genome type (Supplementary Fig. 3A). 
    - This observation is consistent with virus-virus or host-virus horizontal gene transfer.
    - To explore this possibility, we conducted Blast40 searches of sequence cluster representatives against viral- and non-viral protein databases and constructed phylogenetic trees of the top hits. 
    - We found that nucleoside-phosphate kinases in cluster 28 show a polyphyletic distribution with homologs in different viruses showing amino acid similarity to distinct sets of non-viral proteins (Supplementary Fig. 3B). Homalodisca vitripennis reovirus
    - There is a similar pattern with HrpA/B-like helicases in Cluster 55, with helicases in different viral families showing amino acid similarity to distinct sets of non-viral organisms (Supplementary Fig. 3C). 
    - These patterns are consistent with horizontal gene transfer from non-viral hosts. 
    - In contrast, other taxonomically distributed protein clusters such as cluster 56 (encoding parvovirus Rep proteins with homologs in some human herpesviruses) and cluster 735 (encoding a hemagglutinin lineage present in baculoviruses and some orthomyxoviruses) display a monophyletic taxonomic distribution consistent with horizontal gene transfer between viruses (Supplementary Fig. 3D, E). 
    - These data suggest that many protein clusters that contain proteins from viruses of different genome types arise from horizontal gene transfer, both from viral and non-viral sources.
  1. Structural alignments identify shared functional domains (Supplementary Fig. 4)

Sup_Fig4.jpg

    - We constructed protein clusters with a strict 70% coverage requirement, leaving open the possibility that individual domains can be identified through structure comparison3. 
    - We reasoned that protein domains present within multiple protein clusters may have particular biological importance. 
    - We used DALI to conduct all-by-all alignments of the representatives of all protein clusters having more than one member. 
    - This revealed substantial protein similarity with many alignments having Z scores greater than 8, indicating high confidence of structural homology25
    (Supplementary Fig. 4A). 
    - Protein clusters ultimately fall into a network of shared domains (Supplementary Fig. 4B). 
    - Here, distinct domains are often shared across protein clusters in context with various combinations of other domains, which can be seen with domains involved in interaction with the cytoskeleton (Supplementary Fig. 4C) and in metabolism (Supplementary Fig. 4D,E) in eukaryotic viruses and phage.
  1. Structural homology reveals phosphodiesterases that degrade 2’3’ cGAMP (e.g. LigT-like phophodiesterases (Fig. 4 + Supplementary Fig. 5))

Fig4.jpg Sup_Fig5.jpg

    Many aspects of eukaryotic and prokaryotic immunity have a shared origin41. One set of related
    pathways are the mammalian cGAS-STING and OAS pathways and prokaryotic
    Cyclic-oligonucleotide-based anti-phage signaling systems (CBASS). In both cases, a protein
    sensor detects a viral cue and generates a nucleotide second messenger, which activates a
    downstream antiviral effector (Fig. 4A). In the case of the cGAS (cyclic GMP-AMP synthase)
    pathway, cGAS recognizes cytoplasmic double-stranded DNA and generates 2’3’ cyclic
    GMP-AMP (2’3’ cGAMP). Many cGAS/DncV-like nucleotidyltransferases (CD-NTases) in
    prokaryotic CBASS’ make a similar second messenger, 3’3’ cGAMP, in response to viral cues42.
    In contrast, OAS (oligoadenylate synthase) recognizes double-stranded RNA and generates
    linear 2’5’ oligoadenylates (2’5’ OA)43,44. In prokaryotes, phage T4 encodes the ligT-like PDE
    anti-CBASS protein 1 (Acb1), which degrades 3’3’ cGAMP and a variety of other cyclic
    nucleotide substrates including 2’3’ cGAMP45.
    In eukaryotes, several RNA viruses encode PDEs that degrade 2’5’ OA46. Interestingly,
    we find that these PDEs have a ligT-like fold similar to Acb1. Given the conserved use of
    ligT-like PDEs in viral anti-immunity, we investigated their distribution and phylogeny. Structural
    searches revealed many different branches of ligT-like PDEs are present in eukaryotic viruses
    (Fig. 4B). Notably, there are multiple independent branches of ligT-like PDEs in RNA viruses.
    Linage A betacoronaviruses and Toroviruses share a clade of PDEs that is similar to the PDEs
    present in Rotaviruses. Surprisingly, lineage C betacoronaviruses contain a distinct branch of
    PDEs (Fig. 4A)47. This suggests that there were two independent PDE acquisition events within
    the betacoronavirus genus, showing the strong selective pressure for betacoronaviruses to
    evade the OAS pathway. We find that some large DNA viruses also contain ligT-like PDEs.
    Despite the extreme amino acid variability across the ligT-like PDE tree there is near-universal
    conservation of the two catalytic histidines (Fig. 4C), with the exception of the Mimivirus ligT-like
    branch.
    The presence of ligT-like PDEs in large DNA viruses raises the question of whether they
    have an anti-immune function. While the RNA-sensing OAS pathway is commonly targeted by
    ligT-like PDEs of RNA viruses, there is likely less pressure for large DNA viruses to target OAS.
    Thus, we tested whether ligT-like PDEs encoded by large DNA viruses have activity against 2’3’
    cGAMP. To address this question, we generated a synthetic STING circuit in 293T cells48,49 (Fig.
    4D). Here, STING can be activated by treatment with 2’3’ cGAMP or the non-nucleotide STING
    agonist diABZI50, which will lead to expression of firefly luciferase in a STING-dependent
    manner. We expect that a viral PDE that targets 2’3’ cGAMP should be able to inhibit 2’3’
    cGAMP- but not diABZI-mediated STING activity. Testing representative PDEs from each
    branch revealed that while PDEs encoded by RNA viruses and other large DNA viruses have
    only limited activity against 2’3’ cGAMP, PDEs encoded by avian poxviruses have very potent
    activity against 2’3’ cGAMP (Supplementary Fig. 5A). We found that the ligT-like PDEs encoded
    by Pigeonpox and Canarypox very potently restrict STING signaling stimulated by 2’3’ cGAMP
    but have limited activity against diABZI-mediated STING signaling (Fig. 4E). Furthermore,
    mutation of the catalytic histidines substantially reduces activity (Fig. 4E). Avian poxvirues are
    notable for their lack of Poxin51, the other 2’3’ cGAMP phosphodiesterase encoded by
    poxviruses, showing the strong selective pressure for poxviruses to evade cGAS-STING
    immunity. Altogether, we have leveraged structure homology to discover a novel mechanism of
    2’3’ cGAMP degradation by eukaryotic viruses and find that cGAMP targeting by ligT-like PDEs
    is a pan-viral mechanism of anti-immunity.

BCR (B-cell receptor) and TCR (T-cell receptor) sequencing

BCR (B-cell receptor) sequencing and TCR (T-cell receptor) sequencing are molecular techniques used to analyze the genetic diversity of B-cell and T-cell receptors within an immune response. These receptors are crucial for the adaptive immune system’s ability to recognize and bind to specific antigens (foreign substances or pathogens), leading to an immune response. Each B-cell or T-cell has a unique receptor that recognizes a specific antigen, and the vast diversity of these receptors allows the immune system to respond to a wide range of pathogens. BCR Sequencing

BCR sequencing focuses on the variable regions of the B-cell receptor, which determine its antigen specificity. BCRs are membrane-bound antibodies produced by B cells. Through BCR sequencing, researchers can identify the diversity of B-cell populations and understand their role in immune responses, including those against infections, in autoimmune diseases, and in cancer. It also helps in studying the clonal expansion of B cells and the development of antibody repertoires in response to vaccines or infections. TCR Sequencing

TCR sequencing analyzes the genetic sequences encoding the T-cell receptor. T cells are critical for cell-mediated immunity, recognizing peptides presented by the major histocompatibility complex (MHC) on the surface of other cells. TCR sequencing allows for the examination of the diversity and clonality of T-cell populations within a sample. This is important for understanding T-cell responses in various contexts, including infection, autoimmunity, vaccine responses, and tumor immunity.

B cells and T cells are essential components of the adaptive immune system, both equipped with receptors to recognize antigens. However, they serve different functions based on their receptor types and the mechanisms through which they mediate immune responses:

  • B Cells and B-cell Receptors (BCRs)

    • Antibody Production: Once activated, B cells can differentiate into plasma cells that produce antibodies. These antibodies are specific to the antigen that stimulated the B cell and can directly neutralize pathogens or tag them for destruction by other parts of the immune system.
    • Antigen Presentation: B cells can also act as antigen-presenting cells (APCs). They process antigens and present them on their surface to T cells, helping to activate T cells.
    • Memory: Some B cells become memory B cells after an infection, providing long-term immunity by responding more rapidly and effectively upon subsequent exposures to the same antigen.
  • T Cells and T-cell Receptors (TCRs)

    • Helper T Cells (CD4+ T Cells): These cells assist the immune response by secreting cytokines that regulate the activity of other immune cells, including B cells, cytotoxic T cells, and macrophages. They play a pivotal role in activating and directing the adaptive immune response.
    • Cytotoxic T Cells (CD8+ T Cells): These cells can directly kill cells infected by pathogens, such as virus-infected cells and cancer cells, by recognizing antigens presented on the surface of these cells.
    • Memory T Cells: After an immune response, some T cells become memory T cells, providing long-term protection by responding more quickly and effectively to subsequent exposures to the same antigen.

In summary, B cells primarily function by producing antibodies to neutralize foreign invaders, while T cells operate through direct cell killing and regulating the activities of other immune cells. The cooperation between these two cell types is crucial for mounting an effective immune response.

B细胞和T细胞都是适应性免疫系统的重要组成部分,都配备有识别抗原的受体。然而,它们基于其受体类型和介导免疫反应的机制具有不同的功能:

  • B细胞和B细胞受体(BCRs)

    • 抗体产生:激活后,B细胞可以分化成产生抗体的浆细胞。抗体特异性地针对刺激B细胞的抗原,它们可以直接中和病原体或将它们标记以供免疫系统的其他部分销毁。
    • 抗原呈递:B细胞也可以作为抗原呈递细胞(APCs)的功能。它们处理抗原并将其呈递在其表面给T细胞,帮助激活T细胞。
    • 记忆:一些B细胞在感染后成为记忆B细胞,通过对同一抗原的后续暴露反应更快、更有效提供长期免疫。
  • T细胞和T细胞受体(TCRs)

    • 辅助T细胞(CD4+ T细胞):这些细胞通过分泌细胞因子来支持免疫反应,这些细胞因子调节包括B细胞、细胞毒性T细胞和巨噬细胞在内的其他免疫细胞的活动。它们在激活和指导适应性免疫反应中起着关键作用。
    • 细胞毒性T细胞(CD8+ T细胞):这些细胞能够直接杀死被病原体感染的细胞,如病毒感染的细胞和癌细胞。它们通过识别这些细胞表面的抗原来实现这一点。
    • 记忆T细胞:在免疫反应后,一些T细胞变成记忆T细胞,为体提供长期保护,对同一抗原的再次暴露作出更快速和更有效的反应。

总的来说,B细胞主要通过产生抗体来中和外来入侵者,而T细胞则通过直接细胞杀伤和调节免疫反应的其他细胞来发挥作用。这两种细胞类型的合作对于形成有效的免疫反应至关重要。

Differences Between Molecular and Genomic Biomarkers

The terms “molecular biomarker” and “genomic biomarker” refer to specific types of biomarkers used in the field of biomedicine for diagnosis, prognosis, and monitoring the response to treatment, among other applications. Both fall under the broader category of biomarkers, which are measurable indicators of some biological state or condition, but they differ in what they measure and represent.

  • Molecular Biomarker

    A molecular biomarker refers to any molecule measured in tissues, cells, or bodily fluids that is a sign of a normal or abnormal process, or of a condition or disease. Molecular biomarkers can include a wide range of molecule types, such as:

    • Proteins/peptides: Changes in expression levels, modifications, or activity can serve as markers.
    • Lipids: Variations in lipid profiles can indicate metabolic states or diseases.
    • Metabolites: Small molecule byproducts of metabolic processes can indicate physiological or pathological states.
    • RNA: Variations in RNA expression levels, including mRNA and non-coding RNAs (like miRNA), can reflect gene expression changes in response to a disease or treatment.

      Molecular biomarkers thus encompass a broad range of biological molecules that reflect the state of biological processes or diseases.

  • Genomic Biomarker

    A genomic biomarker specifically refers to a DNA sequence or genetic variation that provides information on an individual’s characteristics, condition, disease risk, or response to treatment. Genomic biomarkers include:

    • Single nucleotide polymorphisms (SNPs): Single base-pair variations in the genome that can be associated with disease risk, drug response, or other traits.
    • Gene mutations: Alterations in DNA sequence that can be benign, or lead to disease.
    • Copy number variations (CNVs): Changes in the number of copies of a particular gene, which can affect gene expression levels and contribute to various diseases.
    • Epigenetic markers: Changes in gene expression or cellular phenotype caused by mechanisms other than changes in the DNA sequence itself, such as DNA methylation and histone modification.

      Genomic biomarkers are directly related to an individual’s genetic information and provide insights into genetic predispositions, disease risks, and how an individual might respond to a drug or a treatment based on their genetic makeup.

The main difference between molecular and genomic biomarkers lies in their nature and the level at which they operate. Molecular biomarkers can be any type of molecule that indicates something about the state of the organism, often reflecting the effect of genetic variations, environmental factors, and their interactions. Genomic biomarkers, on the other hand, are specifically related to the genetic information (DNA or epigenetic modifications) of an individual, providing insights into genetic predispositions and the potential for disease or response to treatment. Both types of biomarkers are crucial for personalized medicine, allowing for tailored treatments based on individual molecular and genetic profiles.

Bioinformatics Screening of Potential Biomarkers from mRNA Expression Profiles to Discover Drug Targets and Agents for Cervical Cancer https://www.mdpi.com/1422-0067/23/7/3968

Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

  1. In the example Yersinia enterocolitica strain WA

    #[IMPORTANT FINAL STEPS]
    [STEP1] genbank2fasta.py WA.gb
    [STEP2] python3 ~/Scripts/genbank_to_gtf.py WA.gb WA.gtf  #(1col or 3cols)
    [STEP3] python3 ~/Scripts/modify_gtf.py WA.gtf WA_m.gtf   #(1col or 3cols) "CDS" --> "exon"
  2. In the example of 1585 wildtype genome WA_m.gtf

    #Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
    #cut -d$'\t' -f3 WA_m.gtf | sort -u
    #exon
    #gene
    #rRNA
    #tRNA
    #Gene; CDS (renamed exon=4121); rRNA (22); tRNA (84); ncRNA
    [STEP4] replace transcript_id " --> transcript_id "tx- under kate
    [STEP5] python3 ~/Scripts/add_geneid_genename_genebiotype.py WA_m.gtf WA_m_.gtf  #(4cols+3cols) ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
        #NO ERROR in WA case.
        #DEBUG ERROR Stop at line : CP143516       biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
        #    Error Message: Cannot find transcript_id!
    [STEP6] python3 ~/Scripts/add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py WA_m_.gtf WA_m__.gtf  # (4cols+4cols) Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described.
    [STEP7] cp WA_m__.gtf WA_m___.gtf; MANUALLY modify the lines tRNA --> exon and 'gene_biotype "tRNA";' 
          #for ncRNA (3), tmRNA (1), rRNA (19), tRNA (69) in 1585_m___.gtf
          #for rRNA (22), tRNA (84) in WA_m___.gtf                            
          #grep "gene_biotype \"tRNA\"" WA_m___.gtf | wc -l  #CHECK they are correctly changed!
    
    [RESULTS]
    #CP143516        biopython       gene    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
    #CP143516        biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
    #...
    #[Intermediate RESULTS]
    #/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
    #grep ">" genome.transcripts.fa  | wc -l
    #2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!
    
    [STEP8 (Optional, not used in the example)] To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
    #                see WA_m___.gtf
    #                grep "exon" WA_m___.gtf | wc -l
    #                2332
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
    #                2332
    #                4227
    #For example exon-V0R30_00090 is a tRNA.
    #>tx-V0R30_00090
    #GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
    #We can also filter the count number in the downstream analysis from the big matrix!
    #tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
    >tx-V0R30_05655
    AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
    >tx-V0R30_06520
    TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
    >tx-V0R30_11060
    CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG
  3. nextflow running

    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_WA --fasta WA.fasta --gtf WA_m___.gtf -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon   --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0
    
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38   -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon  --skip_deseq2_qc
  4. [TODO] generate 1585 Excel table, we need DNA-sequence, we can use the DNA seqeunce from results_WA/genome/genome.transcripts.fa in the gb_to_excel.py

    python3 ~/Scripts/gb_to_excel.py 1585.gb 1585.xlsx
    #insert C./P. (Chrom or Plasmid1, Plasmid2, Plasmid3) as the second chrom in the 1585.xlsx.
  5. GO enrichment analysis of non-model bacterial RNA-seq_Sepidermidis

    ~/Scripts/1_filter_interproscan_out.py
    ~/Scripts/2_split_proteins_with_and_without_GO_terms3.py
    ~/Scripts/3_generate_fasta_not_annotated_by_interpro.py
    
    1. Use Custom Annotations for Enrichment Analysis
    
    For GO term enrichment analysis without standard database support, you can create a custom set of annotations if you have the necessary GO term associations for your genes. This involves manually mapping your genes to GO terms based on your experimental data or literature research.
    2. GO Term Enrichment without biomaRt
    
    If you already have GO term associations for your genes or can obtain them through other means, you can proceed with GO term enrichment analysis using clusterProfiler or similar tools by creating a custom OrgDb or using the enricher function directly with your custom GO term mappings.
    
    Here’s a simplified example using clusterProfiler with custom GO term mappings:
    
    library(clusterProfiler)
    
    # Assuming you have a named vector of genes and corresponding GO terms
    genes <- c("gene1", "gene2", "gene3", ...) # Your gene identifiers
    go_terms <- c("GO:0008150", "GO:0006355", "GO:0004672", ...) # Corresponding GO terms for these genes
    
    # Create a named list where names are GO terms and elements are vectors of gene IDs associated with each GO term
    go_list <- split(genes, go_terms)
    
    # Your list of interesting genes (e.g., those differentially expressed)
    interesting_genes <- c("gene1", "gene2", ...)
    
    # Perform enrichment analysis
    enrichment_results <- enricher(interesting_genes,
                                  TERM2GENE = go_list,
                                  pAdjustMethod = "BH",
                                  pvalueCutoff = 0.05,
                                  qvalueCutoff = 0.2)
    
    # View the enrichment results
    print(enrichment_results)
    
    #Install and Run InterProScan.
    #https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
    python3 setup.py -f interproscan.properties
    ./interproscan.sh -i 1585_CDS.fasta -f tsv -goterms -pa
    
    #InterProScan's remote option also supports additional output formats and analyses, such as producing graphical overviews (-svg for SVG output) or specific types of analyses only (-appl TIGRFAM,Pfam to limit the search to certain databases). Check the InterProScan documentation for a full list of options and more detailed explanations.
    
    https://interproscan-docs.readthedocs.io/en/latest/UserDocs.html#obtaining-a-copy-of-interproscan
    
    -dp,--disable-precalc                                     Optional.  Disables use of the precalculated match lookup
                                                              service.  All match calculations will be run locally.
    
    InterProScan test run
    
    This distribution of InterProScan provides a set of protein test sequences, which you can use to check how InterProScan behaves on your system. First, if you have not yet run the initialisation script run the following command:
    
    python3 setup.py -f interproscan.properties
    
    This command will press and index the hmm models to prepare them into a format used by hmmscan. This command need only be run once.
    
    You can then run the following two test case commands:
    
    ./interproscan.sh -i test_all_appl.fasta -f tsv -dp
    ./interproscan.sh -i test_all_appl.fasta -f tsv
    
    ./interproscan.sh -i 1585_CDS_test.fasta -f tsv -goterms -pa
    #./interproscan.sh -i 1585_CDS_test.fasta -f TSV,XML -goterms
    
    awk -F'\t' '{if(NR>1) print $1, $14}' 1585_CDS_test.fasta.tsv1 > filtered_GO_terms.txt
    
    #4d_vs_15m-up.id
    #4d_vs_15m-down.id
    
    # Install and load required packages
    #install.packages("rentrez")
    #install.packages("KEGGREST")
    #install.packages("dplyr") # for data manipulation
    #install.packages("ggplot2") # for visualization
    
    library(rentrez)
    library(KEGGREST)
    library(dplyr)
    library(ggplot2)
    
    # Read GenBank file
    genbank_file <- "../../../1585.gb"
    genbank_data <- readLines(genbank_file)
    
    # Extract accession number from the GenBank file
    accession <- grep("^ACCESSION", genbank_data, value = TRUE)
    accession <- gsub("^ACCESSION\\s+", "", accession)
    accession <- trimws(accession)
    
    # Function to retrieve sequence information from GenBank file
    get_sequence_from_genbank <- function(genbank_file) {
      # Read GenBank file
      genbank_data <- readLines(genbank_file)
    
      # Find sequence lines starting with "ORIGIN"
      origin_line <- grep("^ORIGIN", genbank_data)
    
      # Extract sequence lines
      sequence_lines <- genbank_data[(origin_line + 1):length(genbank_data)]
    
      # Remove non-sequence characters and concatenate into a single sequence
      sequence <- gsub("[0-9 ]", "", sequence_lines)
      sequence <- paste(sequence, collapse = "")
    
      return(sequence)
    }
    
    # GenBank file
    genbank_file <- "1585.gb"
    
    # Retrieve sequence from GenBank file
    sequence <- get_sequence_from_genbank(genbank_file)
    
    # Perform KEGG pathway enrichment analysis
    kegg_pathways <- keggList("pathway")
    
    # Toy example: Randomly select 100 genes and map to KEGG pathways
    set.seed(123)
    genes <- sample(1:1000, 100)
    
    # Mapping genes to KEGG pathways
    mapped_pathways <- sapply(genes, function(gene) {
      kegg_gene <- keggGet("link", gene)
      if (length(kegg_gene) > 0) {
        pathway_ids <- kegg_gene$ENTRY %>% unique()
        pathway_names <- kegg_pathways[kegg_pathways$ID %in% pathway_ids, "NAME"]
        if (length(pathway_names) > 0) {
          return(paste(pathway_names, collapse = "; "))
        } else {
          return(NA)
        }
      } else {
        return(NA)
      }
    })
    
    # Filter out NAs
    mapped_pathways <- mapped_pathways[!is.na(mapped_pathways)]
    
    # Count pathway occurrences
    pathway_counts <- table(unlist(strsplit(mapped_pathways, "; ")))
    
    # Sort pathways by occurrence
    pathway_counts <- sort(pathway_counts, decreasing = TRUE)
    
    # Top 10 enriched pathways
    top_pathways <- head(pathway_counts, 10)
    
    # Plot the top enriched pathways
    barplot(top_pathways, main = "Top 10 Enriched KEGG Pathways",
            xlab = "Pathway", ylab = "Frequency")

Virulence Plasmid Absence in Y. enterocolitica 1Aa Subcluster

ggtree_and_gheatmap_198

ggtree_and_gheatmap_198

Pathogenicity of Yersinia enterocolitica biotype 1A https://academic.oup.com/femspd/article/38/2/127/531753

Abstract

  - Yersinia enterocolitica strains of biotype 1A lack the known virulence determinants of strains in other categories, including the Yersinia virulence plasmid (pYV), and several chromosomal markers of pathogenicity. 
  - For this reason, and also because Y. enterocolitica strains of biotype 1A are frequently isolated from the environment or asymptomatic individuals, these bacteria are often assumed to be avirulent.
  - On the other hand, there is a considerable body of clinical, epidemiological and experimental evidence to indicate that at least some strains of Y. enterocolitica biotype 1A are able to cause gastrointestinal symptoms which resemble those caused by pYV-bearing strains.
  - The availability of a number of experimental systems, including cell culture and animal models of infection, provides an opportunity to identify and characterise the essential virulence determinants of biotype 1A strains.
  1. Introduction

    • The genus Yersinia encompasses a heterogeneous collection of facultatively anaerobic bacteria that belong to the family Enterobacteriaceae.

    • Of the 11 species within this genus [1], only three, Y. pestis, Y. pseudotuberculosis and Y. enterocolitica, are regarded as pathogenic for humans.

    • The virulence of bacteria in all three of these species correlates with carriage of a highly conserved plasmid, termed pYV [2], such that Yersinia spp. which lack this plasmid are considered avirulent.

    • Of the three pYV-bearing species of Yersinia, Y. enterocolitica is the most heterogeneous, being divisible into approximately 30 distinct serotypes (on the basis of antigenic variation in cell wall lipopolysaccharide), and six biotypes (on the basis of variations in biochemical behaviour) (Tables 1 and 2).

    • Of the six biotypes, biotype 1A is the most heterogeneous, and encompasses a wide range of serotypes (Table 2), of which serotypes O:5, O:6,30, O:6,31, O:7,8, O:10, as well as O-non-typable strains, are isolated most often (Table 3).

    • The diversity of biotype 1A strains is also evident when individual isolates are examined by techniques, such as ribotyping or pulsed-field gel electrophoresis, which show that strains of the same serotype, e.g. serotype O:5, show considerable genetic diversity, whereas pYV-bearing serotypes, including O:3 and O:9, are relatively well conserved [32,33].

    • One common feature of individual biotype 1A strains of Y. enterocolitica, however, is that they never carry pYV.

    • For this reason, and also because biotype 1A strains are frequently isolated from the environment, strains of this biotype are often said to be avirulent.

    • In this article, we review the epidemiological and experimental data regarding the pathogenicity of Y. enterocolitica biotype 1A and provide evidence that some strains of this biotype are able to cause disease by mechanisms that are independent of pYV or other known virulence determinants of Y. enterocolitica.

  2. Virulence determinants of pYV-bearing strains of Y. enterocolitica

    • Apart from pYV itself, pYV-bearing strains of Y. enterocolitica require a number of chromosomally encoded factors to express full virulence.

    • Some of these factors are restricted to pYV-bearing bacteria, whereas others occur more widely. -[NOTE that translocation and secretion are different: translocation is yop from bacteria to host cell, secretion is yop from bacteria to environment.]

    • Factors that are mostly limited to pYV-bearing strains of Y. enterocolitica include invasin, an outer membrane protein that is required for efficient translocation of bacteria across the intestinal epithelium [34]; Ail, another outer membrane protein that may contribute to adhesion, invasion and resistance to complement-mediated lysis [35]; Yst-a, a heat-stable enterotoxin that may contribute to the pathogenesis of diarrhoea associated with acute yersiniosis [36,37]; and Myf, a fimbrial antigen and putative adhesin [38].

    • [NOTE that we can find the high-pathogenicity island by comparing 1A with 1B.]

    • In addition, strains of biotype 1B, which are particularly virulent for humans and laboratory animals, carry a high-pathogenicity island which facilitates the uptake and utilisation of iron by bacterial cells, and hence may promote their growth under iron-limiting conditions in host tissues [39].

    • Virulence-associated determinants of pYV-bearing Y. enterocolitica that also occur in pYV-negative strains include cell surface lipopolysaccharide and SodA (a superoxide dismutase), which appear to facilitate bacterial survival in tissues [40,41], as well as urease, which enhances bacterial resistance to stomach acid and may also play a role in nitrogen assimilation [42,43].

    • pYV functions mainly as an anti-host genome that permits the bacteria which carry it to resist phagocytosis and complement-mediated lysis, thus allowing them to proliferate extracellularly in tissues.

    • The genes carried on pYV include those for: (i) an outer membrane protein adhesin, YadA; (ii) a type III protein secretory apparatus which translocates effector proteins, known as Yops, from the bacterial cell to the cytoplasm of susceptible host cells; (iii) at least six distinct anti-host effector Yops; and (iv) factors involved in the regulation of Yop biosynthesis, secretion and translocation [2].

    • The contribution of pYV-encoded factors, in particular YadA and the Yop effectors, to bacterial virulence has been established in a large number of experiments.

    • Strains of yersiniae which lack pYV are susceptible to killing by complement and polymorphonuclear leukocytes, although they are able to persist in macrophages and non-professional phagocytic cells, and cause short-lived infections which typically are asymptomatic [44].

  3. Evidence favouring the lack of virulence of biotype 1A strains (正)

    • Biotype 1A strains of Y. enterocolitica are often considered to be non-pathogenic primarily because they do not possess the virulence-associated determinants of pYV-bearing strains.

    • For example, pYV, which is considered indispensable for yersinia virulence, is never found in strains of biotype 1A [45].

    • In addition, biotype 1A strains typically lack Ail, Myf, the ysa type III secretion system or the high-pathogenicity island, and only occasionally produce Yst-a [38,45^50].

    • Another line of evidence that is taken to indicate the avirulence of biotype 1A strains of Y. enterocolitica is their relatively high prevalence in the environment and healthy animals.

    • Indeed, biotype 1A strains are ubiquitous, inhabiting a wide variety of environmental niches such as soil and various sources of water, including streams, lakes, water wells, and wastewater [7,16,51^53].

    • They are also frequently isolated from foods, including various vegetables and animal products, such as pork, poultry, packaged meat, seafoods, raw milk and pasteurised dairy products [16,54^60].

    • Biotype 1A yersiniae are also found in a vast array of animals, including birds, fish, various insects, frogs and a wide range of mammals, including cattle, sheep, pigs and rodents [7,16,61].

    • In most cases, animals that harbour biotype 1A strains are asymptomatic, thus giving support to the concept that these bacteria are avirulent commensals.

    • There are also some epidemiological data to suggest that Y. enterocolitica biotype 1A strains are not pathogenic for humans.

    • For example, two large studies in Belgium, involving the microbiological investigation of more than 24 000 faecal samples over a period of almost 16 years, revealed that infection with biotype 1A yersiniae was not associated with gastrointestinal symptoms and that biotype 1A strains were more frequent amongst healthy subjects than in those who were being investigated for gastrointestinal complaints [62,63].

    • The relative avirulence of some biotype 1A isolates is supported by the results of several studies of experimental infection of animals.

    • In contrast to pYV-bearing strains, which may cause a fatal infection in susceptible strains of mice, especially if the bacteria carry the high-pathogenicity island or the mice are pre-treated with iron or microbial siderophores [64,65], biotype 1A yersiniae are intrinsically avirulent for mice.

    • Pai et al. [66] and Une [67] infected rabbits perorally with different biotype 1A strains from raw fish (serotype O:6,30) and pig intestine (serotype O:5), respectively, and concluded that these bacteria were avirulent.

    • Similar findings were reported by Robins-Browne et al. [68] who found that gnotobiotic piglets, inoculated perorally with a biotype 1A strain of serotype O:5, which was originally isolated from milk, rapidly cleared the bacteria without developing any clinical or pathological evidence of disease.

    • Interestingly, however, Corbel et al. [69] reported that a biotype 1A strain (serotype O:6) which was isolated from the liver of an aborted lamb, caused placentitis and abortion when inoculated into pregnant ewes.

    • This finding suggests that animals may differ in their susceptibility to infection with biotype 1A yersiniae or that the pathogenicity of different biotype 1A strains may vary.

  4. Evidence supporting the pathogenicity of biotype 1A strains of Y. enterocolitica (反)

    • Despite the lack of identifiable virulence determinants in Y. enterocolitica strains of biotype 1A, these bacteria are frequently isolated from humans with gastrointestinal diseases as indicated in Table 3.
    • Most of the studies listed in Table 3 are uncontrolled clinical observations, however, and therefore cannot be taken to indicate a causative association between Y. enterocolitica strains of biotype 1A and gastrointestinal complaints.
    • In one study, however, Ebringer et al. [10] showed a significant association between infection with Y. enterocolitica biotype 1A and the occurrence of diarrhoea or other gastrointestinal symptoms.
    • Moreover, in a prospective case-control study of infants with diarrhoea in Santiago, Chile, Morris et al. [28] found Y. enterocolitica in 1.9% of patients and 0.6% of age-matched controls (P = 0.05).
    • Biotype 1A strains were isolated from seven children with diarrhoea, but not from any asymptomatic children (P = 0.02).
    • Additional evidence of the clinical significance of biotype 1A yersiniae is that some patients infected with these bacteria show a speci¢c antibody response to the infecting strain [15,20].
    • Although most reports of biotype 1A-associated disease refer to cases of sporadic infection, at least two outbreaks of gastrointestinal infection due to biotype 1A yersiniae have been reported.
    • Ratnam et al. [18] in Canada described a nosocomial outbreak of diarrhoea that was attributed to a strain of biotype 1A, serogroup O:5, which affected nine hospitalised patients, and was evidently acquired from two patients who had been hospitalised for intermittent diarrhoea.
    • More recently, Greenwood et al. [23] obtained Y. enterocolitica biotype 1A, serotype O:10, from 19 paediatric inpatients in a district general hospital in England over a 3-month period.
    • Shortly afterwards, Y. enterocolitica biotype 1A, serotype O:6,30 was isolated from a further 17 hospitalised children in 1 month.
    • The likely source of the O:6,30 strain was pasteurised milk that had been contaminated with a small quantity of raw milk [70].
    • Published reports of infection with biotype 1A Y. enterocolitica are likely to under-represent the true prevalence of these infections, because many diagnostic laboratories will choose to disregard the potential clinical significance of these strains and not report them.
    • In addition, laboratory techniques that are used to isolate Y. enterocolitica from faeces are not designed to isolate biotype 1A strains, and may even select against them.
    • For example, because enrichment of cultures at 4°C is known to increase the likelihood of isolating biotype 1A Y. enterocolitica from faecal samples [8,14,18,19,62,71,72], some authorities have advised against the use of cold enrichment to reduce the chances of isolating ‘non-pathogenic’ strains [63,73].
  5. Clinical manifestations of infections with biotype 1A Y. enterocolitica

    • Infection with pYV-bearing strains of Y. enterocolitica may have protean manifestations depending on the age and individual susceptibility of the host.

    • Among the more common presenting features are diarrhoea, especially in young children, enterocolitis, pseudo-appendicitis due to terminal ileitis, acute mesenteric lymphadenitis, pharyngitis and post-infection autoimmune sequelae, such as reactive arthritis and erythema nodosum.

    • Infrequent, but potentially life-threatening complications include visceral abscesses, pneumonia, intussusception, endocarditis and septicaemia [74].

    • By contrast, infections with biotype 1A strains are generally limited to the gastrointestinal tract with diarrhoea and abdominal pain the commonest symptoms.

    • Nevertheless, some cases associated with reactive arthritis and other autoimmune complications more usually associated with pYV-positive strains of Y. enterocolitica have been described [10,20].

    • Other symptoms of infection with biotype 1A yersiniae include fever, nausea, vomiting and malaise [17,18,25,28,75,76].

    • Infection may also manifest as colitis, terminal ileitis and pseudo-appendicitis [17,76].

    • A commonly reported feature of biotype 1A-associated diarrhoea is that the infection may persist for several weeks or months, sometimes involving cyclical periods of acute disease and remission [17-19,25,76].

    • Infection with biotype 1A strains is said to be frequent in all age groups, in contrast to pYV-bearing strains which are more frequent in children [6,13,17,25].

    • There is also evidence to suggest that infections with biotype 1A yersiniae are more common in individuals who are predisposed to infections in general [18,25], possibly attesting to the relatively low virulence of at least some biotype 1A strains.

    • In general, however, acute infection with biotype 1A yersiniae tends to mimic infection with pYV-bearing Y. enterocolitica in terms of clinical manifestations and severity.

    • A study by Burnens et al. [75] in Switzerland demonstrated that the duration and severity of infection associated with biotype 1A yersiniae was indistinguishable from that due to classical virulent biotypes.

    • Similar observations have been made in patients from England, Chile and Spain [26,28,77].

    • In contrast to pYV-bearing Y. enterocolitica, however, biotype 1A strains seldom give rise to extraintestinal infections or autoimmune sequelae.

    • Nevertheless, biotype 1A strains have been isolated from abdominal exudates, wounds, sputum, urine, a labial ulcer and the gall bladder [7,12,19], and have been linked to cases of reactive arthritis [10].

    • Presumably because they are susceptible to complement-mediated lysis, biotype 1A yersiniae rarely cause septicaemia and have only once been associated with a human death [21].

    • In this case, a strain of serotype O:7,8 was isolated post mortem from the spleen and small intestine of a patient in Bangladesh, but Shigella boydii was also isolated from the large intestine of this patient [21].

    • Patients who are convalescing from biotype 1A-induced gastroenteritis generally display low or undetectable titres of circulating antibodies to these bacteria [8].

    • This has resulted in these strains being overlooked as aetiological agents, because pYV-bearing Y. enterocolitica, in particular serotype O:3, frequently evoke serum agglutinating antibody titres of greater than 1 in 200.

    • However, low titres of antibody have also been found in patients with proven infection due to pYV-bearing Y. enterocolitica, including serotypes O:5,27, O:1,2,3 and O:21 [8,27], indicating that antibody titre does not necessarily correlate with the virulence of the infecting strain.

    • Interestingly, Fletcher et al. [78] found that levels of secretory IgA antibodies in faeces directed against the patient’s own strain were similar in individuals infected with serotype O:3 yersiniae and biotype 1A strains.

  6. Pathogenesis (6.1. Enterotoxins)

    • Little is known of the pathogenic mechanisms of biotype 1A strains.
    • Endoscopic examination of patients infected with these bacteria has revealed no obvious ulcerative or inflammatory changes [76], suggesting that the symptoms may be due to secreted toxins.
    • When first isolated from clinical samples, most pYV-bearing strains of Y. enterocolitica secrete a low-molecular-mass, heat-stable enterotoxin, known as Yst-a [79].
    • This toxin is active in infant mice, and may contribute to the production of diarrhoea by these bacteria [36].
    • Although biotype 1A yersiniae seldom produce Yst-a, more than 80% of strains carry the ystB gene, which encodes a closely related, mouse-reactive toxin, known as Yst-b (Table 4) [50,80].
    • The observation that many of the bacterial strains which carry the ystB gene do not produce detectable amounts of enterotoxin indicates that ystB is not always expressed in vitro.
    • A similar situation pertains to pYV-bearing Y. enterocolitica, which, after repeated passage or prolonged storage, frequently lose the ability to produce Yst-a.
    • This phenomenon is not caused by mutation of the ystA gene, but is due to silencing of this gene by YmoA (Yersinia modulator) [81].
    • As biotype 1A yersiniae also carry the ymoA gene (unpublished data), it is conceivable that in these bacteria YmoA is able to silence ystB.
    • However, it is also possible that ystB is non-functional in some biotype 1A strains.
    • One biotype 1A strain of serogroup O:6 that was isolated from a child with diarrhoea, secretes a novel heat-stable enterotoxin termed YST II, which is also mouse-reactive, but has a distinct mechanism of action compared with other heat-stable enterotoxins [82].
    • The gene encoding YST II has not been identi¢ed, and hence the prevalence of YST II production amongst biotype 1A strains in general has not been elucidated.
  7. Pathogenesis (6.2. Fimbrial adhesins)

    • Another potential virulence factor of biotype 1A yersiniae are ¢mbriae, which take various forms in these bacteria.
    • One of these, designated MR/Y-HA, is 8 nm in diameter, agglutinates erythrocytes of 10 di¡erent animal species in the presence of mannose, and is expressed in vitro at low temperatures, but not at 37°C [83].
    • A second type of ¢mbria, designated MR/K-like HA, is 4 nm in diameter and mediates mannose-resistant haemagglutination of chicken erythrocytes, but not erythrocytes from a variety of other species [83].
    • Expression of these fimbriae in vitro occurs only after serial passaging of bacteria for at least 7 days.
    • Moreover, as they do not mediate adherence of bacteria to cultured epithelial cells [84], their contribution to the pathogenesis of infection with biotype 1A strains is unknown.
    • Some strains of Y. enterocolitica produce a ¢mbrial adhesin, named Myf (for mucoid Yersinia fibrillae), because it bestows a mucoid appearance on bacterial colonies which express it [38].
    • Myf are narrow flexible fimbriae which resemble CS3, an essential colonisation factor of some human strains of enterotoxigenic Escherichia coli [85].
    • MyfA, the major structural subunit of Myf, shows some homology to the PapG protein of pyelonephritis-associated strains of E. coli, and is 44% identical at the DNA level to the pH6 antigen of Y. pseudotuberculosis and Y. pestis, which also has a ¢brillar structure and mediates thermoinducible binding of Y. pseudotuberculosis to tissue culture cells [86,87].
    • As with Yst-a, however, myf genes occur in a minority of biotype 1A strains [50] (Table 4), and their contribution to the virulence of these bacteria is unknown.
  8. Pathogenesis (6.3. Interaction with epithelial cells and macrophages)

    • Preliminary observations in our laboratory indicated that a biotype 1A isolate, which was obtained from a child with diarrhoea [28,82], was able to resist killing by murine peritoneal macrophages in vitro.

    • We also showed that this was not a uniform characteristic of biotype 1A Y. enterocolitica, as another serotype O:6 strain, which was originally isolated from milk [68], was killed by macrophages under the same experimental conditions.

    • As biotype 1A strains of Y. enterocolitica comprise a heterogeneous group of bacteria which occupy a wide range of environmental niches, we postulated there may be a pathogenic subgroup of these bacteria which cannot be readily identified because they lack the known virulence determinants of pYV-bearing strains.

    • To learn more about the possible virulence mechanisms of biotype 1A yersiniae, we divided the biotype 1A strains in our culture collection into two groups.

    • One group comprised strains isolated from patients with gastrointestinal symptoms consistent with yersiniosis, and therefore was hypothesised to include mostly virulent strains.

    • The second group comprised strains isolated from environmental sources, which were postulated to be saprophytic, and incapable of causing disease.

    • Initially, we examined the two groups of bacteria for their carriage of known or suspected virulence-associated genes of Y. enterocolitica, including ail, myfA, ystA, ystB and virulence genes borne on pYV [50].

    • The results of this investigation showed that apart from ystB, which was detected in more than 80% of strains, fewer than 15% of biotype 1A strains carried sequences that were homologous to virulence-associated genes of pYV-bearing bacteria (Table 4).

    • We also found that the frequency of these genes was similar in strains of clinical and environmental origin.

    • The ability of Y. enterocolitica to invade epithelial cells is an important correlate of virulence [46,91].

    • One reason that strains of biotype 1A have been considered to be avirulent is that they invade tissue culture cells to a lesser extent than pYV-bearing strains [88,89], although, paradoxically, pYV itself may retard cell invasion via the effects of translocated Yops on cytoskeletal proteins [2].

    • We have confirmed that biotype 1A strains penetrate HEp-2 cells, a continuous line of human-derived epithelial cells, less effectively than pYV-cured strains [50]. Interestingly, however, we observed that strains of clinical origin invaded HEp-2 cells, Chinese hamster ovary cells and T84 cells, a polarised epithelial cell line of intestinal origin, to a significantly greater extent than strains obtained from other sources (Fig. 1) [50].

    • Electron microscopic examination of HEp-2 and bone marrow-derived macrophages incubated with clinical isolates of Y. enterocolitica biotype 1A revealed a novel pattern of invasion, in that some cells contained large numbers of internalised bacteria which were sametimes located within vacuoles, but most cells were spared (Figs. 2A, 3A and 3B).

    • By contrast, cells infected with biotype 1A strains of environmental origin contained no more than two (often degraded) bacteria per cell, with the great majority of cells being uninfected (Fig. 2B and 3C).

    • Clinical isolates of Y. enterocolitica strains of biotypes 1B and 2, which bore classical chromosomally encoded virulence markers, but were cured of pYV, invaded cells in a more even distribution than that observed with biotype 1A yersiniae (Fig. 2C).

    • These findings suggest that biotype 1A strains of Y. enterocolitica may invade cells by a novel mechanism which differs from that employed by pYV-bearing strains.

    • In keeping with our original observation that a biotype 1A strain of clinical origin was more resistant to killing by macrophages than an environmental isolate, we subsequently showed that relative resistance to macrophage-induced killing is a characteristic of clinical biotype 1A strains in general, although there is considerable overlap between individual clinical and environmental strains [90].

    • Interestingly, we also found that clinical strains are able to replicate within epithelial cells and macrophages and then escape from these cells, whereas environmental strains are not [90].

    • The mechanism of bacterial escape is unknown, but insofar as it does not involve host cell lysis, it appears to resemble exocytosis [90].

    • Recently, we have used transposon mutagenesis in an attempt to identify the bacterial genes required for escape.

    • Preliminary results of this work indicate that bacteria require intact cell wall lipopolysaccharide to replicate within host cells before they can escape.

    • We have also used genomic subtractive hybridisation to determine genetic differences between a clinical and an environmental strain of Y. enterocolitica biotype 1A, with the aim of identifying putative virulence-associated genes in the clinical strain [91].

    • This strategy yielded 54 DNA fragments that were speci¢c for the clinical strain. Of these, 33 sequences had signi¢cant matches with known proteins in the GenBank database, such as a flagellar hook-associated protein, an adenine methyltransferase and an autotransporter protein.

    • Interestingly, four fragments exhibited homology to three insecticidal toxin complex (tc) genes that were ¢rst identi¢ed in Photorhabdus luminescens [92] and are also present in Y. pestis [93,94].

    • The contribution of these genes to the virulence of Y. enterocolitica is not known.

  9. Pathogenesis (6.4. Animal models of infection)

    • Studies to identify the essential virulence determinants of pYV-bearing Y. enterocolitica have been facilitated by access to animal models of infection, such as guinea pigs or mice pre-treated with iron and desferrioxamine [65,95].
    • The latter is a microbial siderophore that permits pYV-bearing strains of Y. enterocolitica, which lack the high-pathogenicity island [39], to grow under iron-limiting conditions in vitro and in vivo [65].
    • Until recently, studies to identify virulence determinants of biotype 1A yersiniae were hampered by the lack of a suitable animal model.
    • We have shown, however, that mice inoculated perorally with 10 9 colony-forming units of biotype 1A Y. enterocolitica may excrete the bacteria for up to 10 days (Fig. 4).
    • Importantly, strains of clinical origin colonise both the small and large intestine of mice for signi¢cantly longer periods than strains obtained from water or milk [50].
    • The extent and duration of bacterial excretion in the faeces in this model were enhanced somewhat by pre-treating the mice with iron and desferriox-amine.
  10. Summary and conclusions

    • Despite the absence of recognised virulence determinants in biotype 1A strains of Y. enterocolitica, there is considerable clinical, epidemiological and experimental evidence to indicate that at least some of these strains are able to cause gastrointestinal disease in humans that clinically resembles yersiniosis caused by pYV-bearing Y. enterocolitica strains of other biotypes. In this respect, Y. enterocolitica may resemble enterovirulent strains of E. coli, which are classified into distinct pathotypes that differ from each other in terms of their specific virulence determinants [96].
    • Our observations that biotype 1A strains of clinical origin di¡er signi¢cantly from non-clinical isolates in terms of their greater capacity to (1) penetrate cultured epithelial cells, (2) survive within macrophages, (3) exocytose from epithelial cells and macrophages and (4) colonise the intestinal tract of orally inoculated mice for prolonged periods provide a variety of convenient models in which to assay the pathogenicity of these bacteria.
    • These models could be used to identify and characterise the virulence determinants of biotype 1A yersiniae more precisely.
    • Once these factors are identified, it will be important to determine if they also contribute to the virulence of pYV-bearing strains of Y. enterocolitica, as well as pYV-negative strains of biotypes other than 1A (Table 2), and even other Yersinia species, such as Y. frederiksenii and Y. bercovieri, which have been associated with symptomatic infections in humans, and which also lack pYV [97].