Author Archives: gene_x

cfDNA Sequencing: Technological Approaches and Bioinformatic Issues

Cell free circulating DNA (cfDNA) refers to DNA fragments present outside of cells in body fluids such as plasma, urine, and cerebrospinal fluid (CSF). CfDNA was first identified in 1948 from plasma of healthy individuals [1]. Afterward, studies showed that the quantity of this cfDNA in the blood was increased under pathological conditions such as auto-immune diseases [2] but also cancers [3]. In 1989, Philippe Anker and Maurice Stroun, from the University of Geneva, demonstrated that this cfDNA from cancer patients carries the characteristics of the DNA from tumoral cells [4]. Next, using the recently developed technique of PCR, David Sidransky and his team found the same mutations of TP53 in bladder tumoral samples and urine pellets from patients [5]. Then, the research and identification of genomic anomalies specific of a cancer type in the circulating DNA, such as NRAS and KRAS mutations or HER-2 amplifications [6,7,8], started to expand, and for the first time, the term of circulating tumor DNA (ctDNA) appeared.

Since the highlighting of this circulating DNA of tumoral origin, technological developments in molecular biology, from quantitative and digital PCR to Next Generation Sequencing, turned it into a powerful liquid biopsy tool. At the era of precision medicine, it seems crucial to identify molecular alterations that will be able to guide the therapeutic management of patients. As tumors release DNA in the blood or other body fluids such as urine, this circulating tumoral DNA, containing the molecular characteristics of the tumor, can be collected with a simple body fluid sample. Since it is minimally invasive, this liquid biopsy is easily repeatable during follow up and in case of relapse. It is also of major interest in some particular cancers where a tumoral biopsy is difficult to obtain such as primary central nervous system lymphoma [9] or cancer subtypes with tissue biopsy containing very little tumoral cells such as Hodgkin lymphoma (HL) for which Reed–Sternbeg cells represent only 0.1 to 2% of the tumoral mass [10,11]. In these particular conditions and malignancies, the sequencing of ctDNA in body fluids could serve as a surrogate for a tumor biopsy. Other body fluids than blood are often used according to the localization of the tumor, such as urine for bladder cancers or cerebrospinal fluid for cerebral tumors [9,12] but blood is the body fluid most often used in studies.

In blood, average cfDNA concentration in healthy individuals can range between 0 and 100 ng/mL of plasma with an average of 30 ng/mL of plasma and is significantly higher in blood of cancer patients, varying between 0 and 1000 ng/mL, with an average of 180 ng/mL [13]. This concentration is correlated with the stage of the cancer, increasing with higher stages, and the size of the tumor. Circulating DNA of tumoral origin represents from 0.01 to more than 90% of the total cell free DNA found in blood [14]. In different types of cancers, a large scale ctDNA sequencing study has shown an association between ctDNA levels and mutational tumor burden [15]. Moreover, given the spatial heterogeneity observed in tumor tissue, ctDNA analysis can determine the complete molecular landscape of a patient’s tumor and give supplementary information on drug targetable alterations and resistant variants [16]. ctDNA kinetics during follow up is correlated with prognosis, as a drastic reduction in its level after treatment is associated with better prognosis, whereas an increase usually means the evolution of drug resistant clones and an ultimate therapeutic failure [17,18,19,20].

Detection of ctDNA during MRD follow up to predict early relapse and at diagnosis in early stages of cancer continues to be a challenge, as the fraction of tumoral DNA contents in total circulating DNA may be <0.01% [21,22]. The development of sequencing technologies being more and more sensitive allows the detection of alterations present in cfDNA at very low variant allele frequencies (VAF), not only for mutational profiling at diagnosis but also for the early detection of disease recurrence and monitoring for therapy response. However, several parameters can affect the sensitivity of ctDNA detection. First, adequate handling of the blood sample, from blood collection to the quality control of the cfDNA extracted, is crucial in analysis. Next, an important step is the choice of the biomarker (s) and the sequencing technology used to detect it. Then, bioinformatic analysis, using error suppression algorithms, is the ultimate tool to discriminate the true variant from false positives.

无细胞循环DNA(cfDNA)指的是体液中细胞外的DNA碎片,如血浆、尿液和脑脊液(CSF)。cfDNA最早在1948年从健康个体的血浆中被发现[1]。此后,研究表明,这种cfDNA在血液中的数量在如自身免疫性疾病[2]等病理状态下增加,以及癌症[3]。1989年,日内瓦大学的Philippe Anker和Maurice Stroun展示了癌症患者的cfDNA携带了肿瘤细胞DNA的特征[4]。接下来,使用新开发的PCR技术,David Sidransky及其团队在膀胱肿瘤样本和患者的尿沉渣中发现了相同的TP53突变[5]。然后,对循环DNA中特定癌症类型的基因组异常的研究和识别,如NRAS和KRAS突变或HER-2扩增[6,7,8]开始扩展,首次出现了循环肿瘤DNA(ctDNA)这一术语。

自从突出了这种肿瘤来源的循环DNA以来,分子生物学的技术发展,从定量和数字PCR到下一代测序,使其成为了一个强大的液体活检工具。在精准医学时代,识别能够指导患者治疗管理的分子改变似乎至关重要。由于肿瘤释放DNA到血液或其他体液如尿液,这种含有肿瘤分子特征的循环肿瘤DNA可以通过简单的体液样本收集。由于它是微创的,这种液体活检在随访和复发时容易重复。它在某些难以获得肿瘤活检的特定癌症中也非常重要,如原发性中枢神经系统淋巴瘤[9]或组织活检中肿瘤细胞很少的癌症亚型,如霍奇金淋巴瘤(HL),其Reed–Sternbeg细胞仅占肿瘤质量的0.1到2%[10,11]。在这些特殊条件和恶性病中,体液中的ctDNA测序可以作为肿瘤活检的替代品。除了血液外,根据肿瘤的位置,常常使用其他体液,如膀胱癌的尿液或脑瘤的脑脊液[9,12],但血液是研究中最常用的体液。

在血液中,健康个体的平均cfDNA浓度可以在0至100 ng/mL血浆之间,平均为30 ng/mL血浆,而癌症患者的血液中则显著更高,变化在0至1000 ng/mL,平均为180 ng/mL [13]。这个浓度与癌症的阶段相关,随着阶段的提高和肿瘤大小的增加而增加。肿瘤来源的循环DNA占血液中发现的总无细胞DNA的0.01%到90%以上[14]。在不同类型的癌症中,一个大规模ctDNA测序研究显示ctDNA水平与突变肿瘤负担之间存在关联[15]。此外,鉴于在肿瘤组织中观察到的空间异质性,ctDNA分析可以确定患者肿瘤的完整分子景观,并提供关于可药物靶向的改变和耐药变异的补充信息[16]。随访期间ctDNA动态与预后相关,治疗后其水平的急剧减少与更好的预后相关,而增加通常意味着耐药克隆的发展和最终的治疗失败[17,18,19,20]。

在MRD随访期间检测ctDNA以预测早期复发,在癌症早期阶段诊断中继续是一个挑战,因为总循环DNA中的肿瘤DNA比例可能小于0.01%[21,22]。越来越敏感的测序技术的发展允许在非常低的变异等位基因频率(VAF)下检测cfDNA中存在的改变,不仅用于诊断时的突变分析,也用于疾病复发的早期检测和治疗响应的监测。然而,几个参数可能影响ctDNA检测的敏感性。首先,从采血到提取的cfDNA的质量控制,血液样本的适当处理在分析中至关重要。接下来,选择生物标志物和用于检测它的测序技术是一个重要步骤。然后,使用错误抑制算法的生物信息学分析是区分真变异和假阳性的最终工具。

  1. using DAMIAN to analyse the cfDNA sequencing data

    cd ~/Tools/damian/databases/blast
    
    damian.rb --host human3 --type dna -1 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R1_001.fastq.gz -2 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R2_001.fastq.gz --sample neg_control_S2_megablast --blastn never --blastp never --min_contiglength 200 --threads 55 --force
    damian_report.rb
    zip -r neg_control_S2_megablast.zip neg_control_S2_megablast/
    echo -e "xxxx" | mutt -a "./neg_control_S2_megablast.zip" -s "New results from DAMIAN" -- "xxx@xxx.com"
    
    damian.rb --host human3 --type dna -1 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R1_001.fastq.gz -2 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R2_001.fastq.gz --sample 635724976_S_aureus_epidermidis_S3_megablast --blastn never --blastp never --min_contiglength 200 --threads 55 --force
    damian_report.rb
    zip -r 635724976_S_aureus_epidermidis_S3_megablast.zip 635724976_S_aureus_epidermidis_S3_megablast/
    echo -e "xxxx" | mutt -a "./635724976_S_aureus_epidermidis_S3_megablast.zip" -s "New results from DAMIAN" -- "xxx@xxx.com"
    
    damian.rb --host human3 --type dna -1 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R1_001.fastq.gz -2 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R2_001.fastq.gz --sample 635290002_CMV_S4_megablast --blastn never --blastp never --min_contiglength 200 --threads 55 --force
    damian_report.rb
    zip -r 635290002_CMV_S4_megablast.zip 635290002_CMV_S4_megablast/
    echo -e "xxxx " | mutt -a "./635290002_CMV_S4_megablast.zip" -s "New results from DAMIAN" -- "xxx@xxx.com"
    
    damian.rb --host human3 --type dna -1 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R1_001.fastq.gz -2 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R2_001.fastq.gz --sample 635850623_EBV_S5_megablast --blastn never --blastp never --min_contiglength 200 --threads 55 --force
    damian_report.rb
    zip -r 635850623_EBV_S5_megablast.zip 635850623_EBV_S5_megablast/
    echo -e "xxxx " | mutt -a "./635850623_EBV_S5_megablast.zip" -s "New results from DAMIAN" -- "xxx@xxx.com"
    
    damian.rb --host human3 --type dna -1 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R1_001.fastq.gz -2 ./231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R2_001.fastq.gz --sample 635031018_E_faecium_S1_megablast --blastn never --blastp never --min_contiglength 200 --threads 55 --force
    damian_report.rb
    zip -r 635031018_E_faecium_S1_megablast.zip 635031018_E_faecium_S1_megablast/
    echo -e "xxxx" | mutt -a "./635031018_E_faecium_S1_megablast.zip" -s "New results from DAMIAN" -- "xxx@xxx.com"
    
    #END
  2. using vrap to analyse the cfDNA sequencing data

    conda activate vrap
    cd vrap_outputs
    ln -s ~/Tools/vrap .
    
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R2_001.fastq.gz -o E_faecium_S1_vrap_out --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R2_001.fastq.gz -o neg_control_S2_vrap_out --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R2_001.fastq.gz -o S_aureus_epidermidis_S3_vrap_out --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    
    #txid10358 (https://www.ncbi.nlm.nih.gov/nuccore/?term=Cytomegalovirus)
    #txid10376 https://www.ncbi.nlm.nih.gov/nuccore/?term=Epstein-Barr-Virus
    sed -i -e 's/txid10239/txid10358/g' vrap/download_db.py
    sed -i -e 's/retmax=100000/retmax=10000000/g' vrap/download_db.py
    vrap/vrap.py -u
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R2_001.fastq.gz -o CMV_S4_vrap_out --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200 #[--virus=Cytomegalovirus.fasta]
    mv vrap/database/viral_db vrap/database/viral_db_CMV
    sed -i -e 's/txid10358/txid10376/g' vrap/download_db.py
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R2_001.fastq.gz -o EBV_S5_vrap_out --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200 #[--virus=Epstein-Barr-Virus.fasta]
    mv vrap/database/viral_db vrap/database/viral_db_EBV
    
    mv vrap/database/viral_db_orig vrap/database/viral_db
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R2_001.fastq.gz -o CMV_S4_vrap_out_host_CMV --host vrap/database/viral_db_CMV/nucleotide.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R2_001.fastq.gz -o EBV_S5_vrap_out_host_EBV --host vrap/database/viral_db_EBV/nucleotide.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    #show samtools flagstat mapped and screen of mapped on IGV, the bam and fasta files to her.  
    #END

Which assemblies of S.epidermidis_1585_5179_HD05 have been submitted to NCBI?

TODO: after annotation check if the app-gene has been truncated in 5179-R1. The 5179 generates 180 kDa and 220 kDa (isoforms) while 5179-R1 generated only a 140 kDa protein.

# See the paper: Induction of Staphylococcus epidermidis biofilm formation via proteolytic processing of the accumulation-associated protein by staphylococcal and host proteases

#------------ 5179 ------------
# -- from unicycler (5179_bold) --
    total         16       4   2,524,342   2,469,173         2,469,173
        1          1       1   2,469,173   2,469,173         2,469,173     complete
        2          1       1      17,749      17,749            17,749     complete
        3          1       0       4,761       4,761             4,761   incomplete
        4          1       1       4,595       4,595             4,595     complete -->
        5          1       0       3,735       3,735             3,735   incomplete
        6          1       0       3,718       3,718             3,718   incomplete
        7          1       0       3,573       3,573             3,573   incomplete
        8          1       1       2,449       2,449             2,449     complete -->
        9          1       0       2,411       2,411             2,411   incomplete
       10          1       0       2,371       2,371             2,371   incomplete
       11          1       0       2,365       2,365             2,365   incomplete
       12          1       0       1,637       1,637             1,637   incomplete
       13          1       0       1,568       1,568             1,568   incomplete
       14          1       0       1,505       1,505             1,505   incomplete
       15          1       0       1,403       1,403             1,403   incomplete
       16          1       0       1,329       1,329             1,329   incomplete

      1   2,469,173    1.00x   UniRef90_Q5HJZ9       1,901,872   reverse     100.0%     100.0%
      2      17,749    2.27x   UniRef90_A0A0H2VIR3       4,771   forward      93.2%      99.7%
      4       4,595   10.19x   none found (Submitted!)
      8       2,449   17.14x   none found (Submitted!)

# -- from trycycler (Submitted after adding the contig 4 and 8 since longer!) --
cluster_001_consensus   2471445 23      70      71  (Submitted!)
cluster_002_consensus   17749   23      70      71  (Submitted!)

#------------ 5179-R1 ------------
# -- from unicycler (5179R1_normal) --
Component   Segments   Links   Length      N50         Longest segment   Status
    total          2       2   2,486,311   2,468,563         2,468,563
        1          1       1   2,468,563   2,468,563         2,468,563   complete
        2          1       1      17,748      17,748            17,748   complete

# -- from trycycler --
cluster_001_consensus   2470004 23      70      71 (Submitted!)
cluster_002_consensus   17748   23      70      71 (Submitted!)
luster_003 with only one contig
luster_004 with only one contig
luster_005 with only one contig

#makeblastdb -in 5179_trycycler_chr.fasta -dbtype nucl
#blastn -db 5179_trycycler_chr.fasta -query 2-16.fasta -out 2-16_vs_trycycler_1.blastn -evalue 0.00000000001 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1

#------------ 1585 ------------

# 1585_normal/unicycler/chrom_plasmids.fasta
# -- from unicycler (1585_normal) --
Segment   Length      Depth    Starting gene     Position    Strand    Identity   Coverage
      1   2,443,574    1.00x   UniRef90_Q5HJZ9   1,085,369   forward      99.3%     100.0%
      2       9,014    3.72x   none found
      7       2,344   11.44x   none found
      8       2,255    9.81x   none found
# after correction using polypolish and polca (see ~/DATA/Data_Holger_S.epidermidis_1585_5179_HD05/1585_normal/unicycler)
1       2443600 3       70      71 (Submitted!)
2       9014    2478515 70      71 (Submitted!)
7       2344    2487661 70      71 (Submitted!)
8       2255    2490042 70      71 (Submitted!)

# -- from trycycler --
cluster_001_consensus   2443176 23      70      71
cluster_002_consensus   9014    23      70      71
cluster_003_consensus   2255    23      70      71
cluster_004_consensus   2344    23      70      71

RNA-seq data analysis on S. epidermidis 1585

  1. prepare input reads and samplesheet

    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_A/EX_15_min_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_A/EX_15_min_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_B/EX_15_min_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_B/EX_15_min_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_C/EX_15_min_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_15_min_C/EX_15_min_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_A/EX_1h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_A/EX_1h_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_B/EX_1h_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_B/EX_1h_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_C/EX_1h_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_1h_C/EX_1h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_A/EX_2h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_A/EX_2h_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_B/EX_2h_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_B/EX_2h_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_C/EX_2h_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_2h_C/EX_2h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_A/EX_4h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_A/EX_4h_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_B/EX_4h_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_B/EX_4h_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_C/EX_4h_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_4h_C/EX_4h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_6h_A/EX_6h_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_6h_A/EX_6h_A_2.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_B/EX_6h_B_1.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_B/EX_6h_B_2.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_C/EX_6h_C_1.fq.gz .
    ln -s ./F23A430001245-01_SEPzqcuP/soapnuke/clean/EX_6h_C/EX_6h_C_2.fq.gz .
    
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_A/EX_Day_4_A_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_A/EX_Day_4_A_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_B/EX_Day_4_B_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_B/EX_Day_4_B_2.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_C/EX_Day_4_C_1.fq.gz .
    ln -s ./F23A430001245_SEPiovzP/soapnuke/clean/EX_Day_4_C/EX_Day_4_C_2.fq.gz .
    
    sample,fastq_1,fastq_2,strandedness
    EX_15_min_A,EX_15_min_A_1.fq.gz,EX_15_min_A_2.fq.gz,auto
    EX_15_min_B,EX_15_min_B_1.fq.gz,EX_15_min_B_2.fq.gz,auto
    EX_15_min_C,EX_15_min_C_1.fq.gz,EX_15_min_C_2.fq.gz,auto
  2. Notes and Debugs

    #-- NOTE1: STAR cannot work simoutenously twice at the same time! --
    
    #-- NOTE2 --
    The GTF file might be corrupted!
      Stop at line : NC_019234.1    RefSeq  exon    1600    1977    .   +   0   ID "cds-HXG45_RS00010"; Parent "gene-HXG45_RS00010"; Note "incomplete%3B partial in the middle of a contig%3B missing N-terminus and C-terminus"; Ontology_term "GO:0015074"; end_range "1977,."; gbkey "CDS"; go_process "DNA integration|0015074||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010972061.1"; locus_tag "HXG45_RS00010"; partial "true"; product "DDE-type integrase/transposase/recombinase"; pseudo "true"; start_range ".,1600"; transl_table "11"
      Error Message: Cannot find gene_id!
    
    #cut -d$'\t' -f2 RP62A.gtf | sort -u
    cmsearch
    GeneMarkS-2+
    Protein Homology --> ProteinHomology
    RefSeq
    tRNAscan-SE
    
    #cut -d$'\t' -f3 RP62A.gtf | sort -u
    gene
    pseudogene
    
    CDS
    rRNA
    tRNA
    
    direct_repeat
    exon
    ncRNA
    binding_site
    region
    riboswitch
    RNase_P_RNA
    
    sequence_feature
    SRP_RNA
    tmRNA
    
    #-- BUG1: mamba update picard --
    #2.18.27-SNAPSHOT --> bioconda::picard=3.0.0
    conda install -c conda-forge mamba
    mamba update picard
      - ca-certificates   2023.7.22  hbcca054_0     conda-forge                    
      + ca-certificates  2023.11.17  hbcca054_0     conda-forge/linux-64     Cached
      - nextflow            20.04.1  hecda079_1     bioconda                       
      + nextflow            23.10.0  hdfd78af_0     bioconda/noarch          Cached
      - openjdk              11.0.1  h516909a_1016  conda-forge                    
      + openjdk              17.0.3  h58dac75_5     conda-forge/linux-64     Cached
      - picard                3.0.0  hdfd78af_0     bioconda                       
      + picard                3.1.1  hdfd78af_0     bioconda/noarch          Cached
    mamba update rsem
    
    #-- BUG2: mamba update gffread --
    cd /mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/work/b4/703a2595f234c7866e49d9a033f943
    gffread \
          RP62A.gff \
          --keep-exon-attrs -F -T \
          -o RP62A.gtf
      - gffread   0.9.12  0           bioconda                    
      + gffread   0.12.7  hdcf5f25_3  bioconda/linux-64     Cached
    
    #-- NOTE: If you use nf-core/rnaseq for your analysis please cite: --
    #* The pipeline
    #  https://doi.org/10.5281/zenodo.1400710
    #* The nf-core framework
    #  https://doi.org/10.1038/s41587-020-0439-x
    #* Software dependencies
    #  https://github.com/nf-core/rnaseq/blob/master/CITATIONS.md
    
    #-- BUG3: under rnaseq_2021 open R --
    #using R host rather than conda R
    mamba remove r-base
    #-I"/home/jhuang/miniconda3/envs/rnaseq_2021/lib/R/include"
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("SummarizedExperiment")
    BiocManager::install("tximport")
    #Error in citation("tximeta") : there is no package called ‘tximeta’
    #  Execution halted
    
    #sudo apt-get update
    #sudo apt-get install libcurl4-openssl-dev
    
    #BiocManager::install(version = "3.16")
    BiocManager::install(c("AnnotationDbi", "httr", "BiocFileCache", "biomaRt", "GenomicFeatures", "ensembldb", "curl", "AnnotationHub"))
    BiocManager::install(c("biomaRt", "GenomicFeatures", "ensembldb", "AnnotationHub", "tximeta"))
    BiocManager::install(c("tximeta"))
    
    #conda update -n base -c defaults conda
    #conda update --all
    
    #!/usr/bin/env Rscript
    /mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/rnaseq/bin/salmon_tximport.r \
          NULL \
          salmon \
          salmon.merged
  3. prepare gtf file

    #  --gtf_extra_attributes        [string]  By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when 
                                              running Salmon. [default: gene_name] 
    #  --gtf_group_features          [string]  Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id]
    #  --featurecounts_group_type    [string]  The attribute type used to group feature types in the GTF file when generating the biotype plot with 
                                              featureCounts. [default: gene_biotype] 
    #  --featurecounts_feature_type  [string]  By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]
    #--gtf_extra_attributes gene 
    #--gtf_group_features Parent --featurecounts_group_type gene_biotype --featurecounts_feature_type CDS (outdated since [CHANGE1,2,3]) 
    
    #under hamm
    conda activate rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    
    # ------------ gff_to_gtf.py, then modify *.gtf file as follows -------------
    #Upload the scripts gff_to_gtf.py and modify_gtf.py.
    python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
    
    #[CHANGE1] ID in gene and Parent in CDS --> gene_id; ID in exon --> transcript_id; Name in gene --> gene_name
    #[CHANGE2] "Protein Homology" --> "RefSeq"
    #[CHANGE3] CDS --> exon
    #[CHANGE4?] --gtf_extra_attributes ||||gene|||| refers to ID=gene-SERP_RS00220;Name=noc;gbkey=Gene;||||gene=noc||||; "gene=" exists in both gene and exon. choose gene as gtf_extra_attributes! (see --gtf_extra_attributes gene in the next command)
    
    [1] python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
    [2] python3 modify_gtf.py RP62A.gtf RP62A_m.gtf
    [3] sed -i -e 's/gene_id "rna-/gene_id "gene-/g' RP62A_m.gtf
    #NOTE that "ProteinHomology exon" are from "ProteinHomology CDS" 
    [4] MANUALLY checking why two are missing, because the gene has two CDS/exons.
    #(rnaseq) jhuang@hamm:/mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results/genome$ grep ">cds" results/genome/genome.transcripts.fa  | sed 's/>//g' - | sort > CDS_2.txt
    #2415
    #(rnaseq) jhuang@hamm:/mnt/h1/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results/genome$ grep ">rna" genome.transcripts.fa  | wc -l
    #84
    #cut -d$'\t' -f9 CDS_1.txt | cut -d';' -f1 | sed 's/ID=//g' - | sort > CDS_1_.txt
    #2417
    #2268d2267
    #< cds-WP_010959142.1
    #2401d2399
    #< cds-WP_161384733.1
    
    NC_002976.3     RefSeq  gene    423814  424930  .       +       .       gene_id "gene-SERP_RS02220"; gene_name "prfB"; gbkey "Gene"; gene "prfB"; gene_biotype "protein_coding"; locus_tag "SERP_RS02220"; old_locus_tag "SE0421%2CSERP0421"
    NC_002976.3     ProteinHomology exon    423814  423885  .       +       0       transcript_id "cds-WP_010959142.1"; gene_id "gene-SERP_RS02220"; Dbxref "GenBank:WP_010959142.1"; Name "WP_010959142.1"; Note "programmed frameshift"; Ontology_term "GO:0006415,GO:0003747"; exception "ribosomal slippage"; gbkey "CDS"; gene "prfB"; go_function "translation release factor activity|0003747||IEA"; go_process "translational termination|0006415||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010959142.1"; locus_tag "SERP_RS02220"; product "peptide chain release factor 2"; protein_id "WP_010959142.1"; transl_table "11"
    NC_002976.3     ProteinHomology exon    423887  424930  .       +       0       transcript_id "cds-WP_010959142.1"; gene_id "gene-SERP_RS02220"; Dbxref "GenBank:WP_010959142.1"; Name "WP_010959142.1"; Note "programmed frameshift"; Ontology_term "GO:0006415,GO:0003747"; exception "ribosomal slippage"; gbkey "CDS"; gene "prfB"; go_function "translation release factor activity|0003747||IEA"; go_process "translational termination|0006415||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010959142.1"; locus_tag "SERP_RS02220"; product "peptide chain release factor 2"; protein_id "WP_010959142.1"; transl_table "11"
    NC_002976.3     RefSeq  gene    2256196 2256869 .       +       .       gene_id "gene-SERP_RS10980"; gene_name "SERP_RS10980"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "SERP_RS10980"; old_locus_tag "SERP2219"
    NC_002976.3     ProteinHomology exon    2256196 2256385 .       +       0       transcript_id "cds-WP_161384733.1"; gene_id "gene-SERP_RS10980"; Dbxref "GenBank:WP_161384733.1"; Name "WP_161384733.1"; Note "programmed frameshift"; Ontology_term "GO:0004803"; exception "ribosomal slippage"; gbkey "CDS"; go_function "transposase activity|0004803||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_064305258.1"; locus_tag "SERP_RS10980"; product "IS6-like element IS257 family transposase"; protein_id "WP_161384733.1"; transl_table "11"
    NC_002976.3     ProteinHomology exon    2256385 2256869 .       +       2       transcript_id "cds-WP_161384733.1"; gene_id "gene-SERP_RS10980"; Dbxref "GenBank:WP_161384733.1"; Name "WP_161384733.1"; Note "programmed frameshift"; Ontology_term "GO:0004803"; exception "ribosomal slippage"; gbkey "CDS"; go_function "transposase activity|0004803||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_064305258.1"; locus_tag "SERP_RS10980"; product "IS6-like element IS257 family transposase"; protein_id "WP_161384733.1"; transl_table "11"
    
    --> "cds-WP_010959142.1" and "cds-WP_161384733.1" occur twice! 
    
    NC_002976.3     RefSeq  gene    423814  424930  .       +       .       gene_id "gene-SERP_RS02220"; gene_name "prfB"; gbkey "Gene"; gene "prfB"; gene_biotype "protein_coding"; locus_tag "SERP_RS02220"; old_locus_tag "SE0421%2CSERP0421"
    NC_002976.3     ProteinHomology exon    423814  424930  .       +       0       transcript_id "cds-WP_010959142.1"; gene_id "gene-SERP_RS02220"; Dbxref "GenBank:WP_010959142.1"; Name "WP_010959142.1"; Note "programmed frameshift"; Ontology_term "GO:0006415,GO:0003747"; exception "ribosomal slippage"; gbkey "CDS"; gene "prfB"; go_function "translation release factor activity|0003747||IEA"; go_process "translational termination|0006415||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010959142.1"; locus_tag "SERP_RS02220"; product "peptide chain release factor 2"; protein_id "WP_010959142.1"; transl_table "11"
    NC_002976.3     RefSeq  gene    2256196 2256869 .       +       .       gene_id "gene-SERP_RS10980"; gene_name "SERP_RS10980"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "SERP_RS10980"; old_locus_tag "SERP2219"
    NC_002976.3     ProteinHomology exon    2256196 2256869 .       +       0       transcript_id "cds-WP_161384733.1"; gene_id "gene-SERP_RS10980"; Dbxref "GenBank:WP_161384733.1"; Name "WP_161384733.1"; Note "programmed frameshift"; Ontology_term "GO:0004803"; exception "ribosomal slippage"; gbkey "CDS"; go_function "transposase activity|0004803||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_064305258.1"; locus_tag "SERP_RS10980"; product "IS6-like element IS257 family transposase"; protein_id "WP_161384733.1"; transl_table "11"
    
    --> "cds-WP_010959142.1" and "cds-WP_161384733.1" occur once! 
    
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    
    (rnaseq) nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta RP62A.fasta --gtf RP62A_m.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon             --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0  --gtf_extra_attributes gene
    #grep ">" genome.transcripts.fa  | wc -l --> 2499  (2,529= CDSs (2,445) + RNA(84))
    
    #-- Ben project: human + bacterium + plasmid --
    .gff3
    python3 gff_to_gtf.py CP009367.gff3 CP009367.gtf
    python3 modify_gtf.py CP009367.gtf CP009367_m.gtf
    sed -i -e 's/gene_id "rna-/gene_id "gene-/g' CP009367_m.gtf
    
    python3 gff_to_gtf.py NC_019234.gff3 NC_019234.gtf
    python3 modify_gtf.py NC_019234.gtf NC_019234_m.gtf
    sed -i -e 's/gene_id "rna-/gene_id "gene-/g' NC_019234_m.gtf
    
    python3 genbank_to_gtf.py 1585.gb 1585.gtf
    python3 modify_gtf.py 1585.gtf 1585_m.gtf
    #python3 correct_gtf.py
    python3 transform_gtf.py 1585_m.gtf 1585_m_.gtf
    sed -i -e 's/transcript_id \"/transcript_id \"tx-/g' 1585_m_.gtf
    sed -i -e 's/exon_id \"/exon_id \"exon-/g' 1585_m_.gtf
    #BUG: why 7052 >= expected 7014
    
    #-- MANUALLY: gene positions out of boundary --
    #Transcript cds-WP_011100764.1 is out of chromosome NC_019234.1's boundary!
    NC_019234.1     RefSeq  gene    66549   67571   .       +       .       gene_id "gene-HXG45_RS00005"; gene_name "repA"; gbkey "Gene"; gene "repA"; gene_biotype "protein_coding"; locus_tag "HXG45_RS00005"; old_locus_tag "D743_p54"
    NC_019234.1     RefSeq  exon    66549   67571   .       +       0       transcript_id "cds-WP_011100764.1"; gene_id "gene-HXG45_RS00005"; Dbxref "GenBank:WP_011100764.1"; Name "WP_011100764.1"; gbkey "CDS"; gene "repA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011100764.1"; locus_tag "HXG45_RS00005"; product "plasmid replication initiator RepA"; protein_id "WP_011100764.1"; transl_table "11"
    #-->
    NC_019234.1 RefSeq  gene    66549   66845   .   +   .   gene_id "gene-HXG45_RS00005_1"; gene_name "repA"; gbkey "Gene"; gene "repA"; gene_biotype "protein_coding"; locus_tag "HXG45_RS00005"; old_locus_tag "D743_p54"
    NC_019234.1 RefSeq  exon    66549   66845   .   +   0   transcript_id "cds-WP_011100764.1_1"; gene_id "gene-HXG45_RS00005_1"; Dbxref "GenBank:WP_011100764.1"; Name "WP_011100764.1"; gbkey "CDS"; gene "repA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011100764.1"; locus_tag "HXG45_RS00005"; product "plasmid replication initiator RepA"; protein_id "WP_011100764.1"; transl_table "11"
    NC_019234.1 RefSeq  gene    1   726 .   +   .   gene_id "gene-HXG45_RS00005_2"; gene_name "repA"; gbkey "Gene"; gene "repA"; gene_biotype "protein_coding"; locus_tag "HXG45_RS00005"; old_locus_tag "D743_p54"
    NC_019234.1 RefSeq  exon    1   726 .   +   0   transcript_id "cds-WP_011100764.1_2"; gene_id "gene-HXG45_RS00005_2"; Dbxref "GenBank:WP_011100764.1"; Name "WP_011100764.1"; gbkey "CDS"; gene "repA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011100764.1"; locus_tag "HXG45_RS00005"; product "plasmid replication initiator RepA"; protein_id "WP_011100764.1"; transl_table "11"
    
    (rnaseq) nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_NC_019234 --fasta NC_019234.fasta --gtf NC_019234_m.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon             --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0  --gtf_extra_attributes gene
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_CP009367 --fasta CP009367.fasta --gtf CP009367_m.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon             --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0  --gtf_extra_attributes gene
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38 -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_align_intermeds --save_unaligned --aligner star_salmon --pseudo_aligner salmon
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m_.gtf -profile test_full -resume --max_memory 100.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0 
    
    #TODO: look if gene-HXG45_RS00010 exist!
    
    # -- check the intermediate results --
    #vim salmon.merged.gene_counts.tsv
    gene_id gene_name       EX_15_min_A
    gene-SERP_RS00195       rpmH    298
    gene-SERP_RS00200       rnpA    90
    gene-SERP_RS00205       mnmE    1733
    
    #vim salmon.merged.transcript_counts.tsv
    tx      gene_id EX_15_min_A
    cds-WP_000240855.1      gene-SERP_RS00195       298
    cds-WP_001832522.1      gene-SERP_RS00200       90
    cds-WP_001831768.1      gene-SERP_RS00205       1733
  4. gff_to_gtf.py (code)

    import sys
    
    def gff3_to_gtf(gff3_file, gtf_file):
        with open(gff3_file, 'r') as gff3, open(gtf_file, 'w') as gtf:
            for line in gff3:
                if line.startswith('#'):
                    continue  # Skip header lines
    
                parts = line.strip().split('\t')
                if len(parts) != 9:
                    continue  # Skip invalid lines
    
                # Extract fields from GFF3
                seqname, source, feature, start, end, score, strand, frame, attributes = parts
    
                # Convert attributes to GTF format
                attr_pairs = [f.strip() for f in attributes.split(';') if f.strip() != '']
                attr_str = '; '.join([f'{key} "{value}"' for part in attr_pairs for key, value in [part.split('=')]])
    
                # Write to GTF file
                gtf_line = f'{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{attr_str};\n'
                gtf.write(gtf_line)
    
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python script.py <input.gff3> <output.gtf>")
            sys.exit(1)
    
        gff3_file = sys.argv[1]
        gtf_file = sys.argv[2]
        gff3_to_gtf(gff3_file, gtf_file)
  5. genbank_to_gtf.py (code)

    import argparse
    from Bio import SeqIO
    
    def genbank_to_gtf(genbank_file, gtf_file):
        with open(genbank_file, "r") as input_handle, open(gtf_file, "w") as output_handle:
            for record in SeqIO.parse(input_handle, "genbank"):
                for feature in record.features:
                    if feature.type == "gene" or feature.type == "CDS":
                        start = feature.location.start.position + 1
                        end = feature.location.end.position
                        strand = '+' if feature.location.strand == 1 else '-'
                        attributes = []
    
                        gene_id = None
                        if "gene_id" in feature.qualifiers:
                            gene_id = feature.qualifiers["gene_id"][0]
                        elif "gene" in feature.qualifiers:
                            gene_id = feature.qualifiers["gene"][0]
                        elif "locus_tag" in feature.qualifiers:
                            gene_id = feature.qualifiers["locus_tag"][0]
    
                        if gene_id is not None:
                            attributes.append(f'gene_id "{gene_id}"')
    
                        transcript_id = None
                        if "locus_tag" in feature.qualifiers:
                            transcript_id = feature.qualifiers["locus_tag"][0]
                            attributes.append(f'transcript_id "{transcript_id}"')
    
                        attribute_str = "; ".join(attributes) + ";"
                        gtf_line = f"{record.id}\t.\t{feature.type}\t{start}\t{end}\t.\t{strand}\t.\t{attribute_str}\n"
                        output_handle.write(gtf_line)
    
    def main():
        parser = argparse.ArgumentParser(description="Convert GenBank file to GTF format.")
        parser.add_argument("genbank_file", help="Input GenBank file")
        parser.add_argument("gtf_file", help="Output GTF file")
        args = parser.parse_args()
    
        genbank_to_gtf(args.genbank_file, args.gtf_file)
    
    if __name__ == "__main__":
        main()
  6. modify_gtf.py (code)

    import sys
    
    def modify_gtf(input_file, output_file):
        with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
            for line in infile:
                if line.startswith('#'):
                    outfile.write(line)  # Copy comment lines
                    continue
    
                parts = line.strip().split('\t')
                if len(parts) != 9:
                    continue  # Skip invalid lines
    
                record_type = parts[2]
                attributes = parts[8]
    
                # Change "Protein Homology" to "RefSeq"
                if parts[1] == "Protein Homology":
                    parts[1] = "ProteinHomology"
    
                new_attributes = []
                for attr in attributes.split(';'):
                    if attr.strip():
                        key_value = attr.strip().split(' ', 1)  # Split only on the first space
                        if len(key_value) == 2:
                            key, value = key_value
                            if (record_type == 'gene' or record_type == 'pseudogene') and key == 'ID':
                                key = 'gene_id'
                            elif (record_type == 'gene' or record_type == 'pseudogene') and key == 'Name':
                                key = 'gene_name'
                            elif (record_type == 'CDS' or record_type == 'tRNA' or record_type == 'rRNA' or record_type == 'SRP_RNA') and key == 'ID':
                                key = 'transcript_id'
                            elif (record_type == 'CDS' or record_type == 'tRNA' or record_type == 'rRNA' or record_type == 'SRP_RNA') and key == 'Parent':
                                key = 'gene_id'
                            elif (record_type == 'exon') and key == 'ID':
                                key = 'exon_id'
                            elif (record_type == 'exon') and key == 'Parent':
                                key = 'transcript_id'
                                new_attributes.append(f'gene_id {value}')
                            new_attributes.append(f'{key} {value}')
                        else:
                            new_attributes.append(attr.strip())
    
                # Change record type from CDS to exon
                if record_type == "CDS":
                    parts[2] = "exon"
    
                parts[8] = '; '.join(new_attributes)
                outfile.write('\t'.join(parts) + '\n')
    
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python modify_gtf.py 
    “) sys.exit(1) input_file = sys.argv[1] output_file = sys.argv[2] modify_gtf(input_file, output_file)
  7. correct_gtf.py (code)

    input_file = '1585_m.gtf'  # Replace with your input file path
    output_file = '1585_m_.gtf'  # Replace with your desired output file path
    
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        for line in infile:
            if line.strip() and not line.startswith('#'):  # Skip empty or comment lines
                parts = line.strip().split('\t')
                if parts[2] == 'gene':  # Check if the feature type is 'gene'
                    attributes = parts[8].split(';')
                    attributes = [attr.strip() for attr in attributes if 'transcript_id' not in attr]  # Remove transcript_id for gene
                    parts[8] = '; '.join(attributes) + ';' if attributes else '.'
                    line = '\t'.join(parts) + '\n'
            outfile.write(line)
    
    print(f"Processed file saved as {output_file}")
  8. transform_gtf.py (code)

    import sys
    
    def transform_data(input_data):
        output_data = []
    
        for line in input_data:
            fields = line.split()
            entry_type = fields[2]
            attributes = ' '.join(fields[8:])
            attributes_dict = dict(item.split(' ', 1) for item in attributes.split('; '))
    
            gene_id = attributes_dict.get('gene_id').replace('"', '')
            transcript_id = attributes_dict.get('transcript_id').replace('"', '')
    
            if not gene_id.startswith("SF028"):
                gene_name = f'gene_name "{gene_id}"'
                gene_id = f'gene_id "{transcript_id}"'
            else:
                gene_name = ''
    
            if entry_type == "gene":
                gene_line = f"{fields[0]}\t{fields[1]}\tgene\t{fields[3]}\t{fields[4]}\t{fields[5]}\t{fields[6]}\t{fields[7]}\t{gene_id}; {gene_name}".strip()
                output_data.append(gene_line)
                transcript_line = f"{fields[0]}\t{fields[1]}\ttranscript\t{fields[3]}\t{fields[4]}\t{fields[5]}\t{fields[6]}\t{fields[7]}\t{gene_id}; transcript_id \"{transcript_id}\""
                output_data.append(transcript_line)
            elif entry_type == "exon":
                exon_line = f"{fields[0]}\t{fields[1]}\texon\t{fields[3]}\t{fields[4]}\t{fields[5]}\t{fields[6]}\t{fields[7]}\t{gene_id}; transcript_id \"{transcript_id}\"; exon_id \"{transcript_id}\""
                output_data.append(exon_line)
    
        return output_data
    
    def main(input_file, output_file):
        with open(input_file, 'r') as file:
            input_data = file.readlines()
    
        transformed_data = transform_data(input_data)
    
        with open(output_file, 'w') as file:
            for line in transformed_data:
                file.write(line + '\n')
    
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python script.py 
    “) sys.exit(1) input_file = sys.argv[1] output_file = sys.argv[2] main(input_file, output_file)

RNA-seq recalculate p-values on sT- and LT-transcripts

  1. goal

    #p604 and p601 to sT transcript
    #p602 and p600 to LT transcript
    #p605 and p600 to LTtr transcript (or LT transcript is also fine)
    
    p602+604 compared to sT transcript (p602and604_d3 vs. p601_d3?)
    p605+604 compared to sT transcript (p604and605_d9/d12 vs. p600and601_d9/d12?)
    p602+604 compared to LT transcript (p602and604_d3 vs. p601_d3?)
    p605+604 compared to LT transcript (p604and605_d9/d12 vs. p600and601_d9/d12?)
  2. under ~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq_Merged_Corrected_28+2

    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    library("org.Hs.eg.db")
    
    #d.raw<- read.delim2("gene_counts_hg38+virus_on_LT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    d.raw<- read.delim2("gene_counts_hg38+virus_on_sT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    
    # [1] V_8_0_mock_DonorI             V_8_0_mock_DonorII           
    # [3] V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII      
    # [5] V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII      
    # [7] V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI       
    # [9] V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI       
    #[11] V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII      
    #[13] V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII      
    #[15] V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI       
    #[17] V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI       
    #[19] V_8_4_1_p602_d8_DonorII       V_8_4_1_p602_d8_DonorI       
    #[21] V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI
    #[23] V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII
    #[25] V_8_4_2_p602_d3_DonorI        V_8_4_2_p602_d3_DonorII      
    #[27] V_8_5_1_p602and604_d3_DonorI  V_8_5_2_p602and604_d3_DonorII
    
    col_order <- c("V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII", "V_8_1_5_p604_d3_DonorII", "V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII",   "V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI",  "V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII", "V_8_2_3_p605_d8_DonorII",  "V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI",  "V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI",  "V_8_3_1_p600and601_d12_DonorI", "V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII",    "V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII",    "V_8_5_1_p602and604_d3_DonorI", "V_8_5_2_p602and604_d3_DonorII")
    identical(colnames(d.raw),col_order)
    
    #reordered.raw <- d.raw[,col_order]
    
    replicates = as.factor(c("untreated","untreated", "p601_d3","p604_d3", "p601_d8","p604_d8",   "p601_d3","p604_d3","p601_d8","p604_d8",  "p600_d3","p605_d3","p600_d8", "p605_d8",  "p600_d3","p605_d3","p600_d8","p605_d8",  "p602_d8","p602_d8",  "p600and601_d9d12", "p604and605_d9d12","p600and601_d9d12","p604and605_d9d12",     "p602_d3","p602_d3",    "p602and604_d3","p602and604_d3"))
    
    batch = as.factor(c("200420", "200420", "190927", "190927",    "190927", "190927", "190228", "190228",    "190228", "190228", "191008", "191008",    "191008", "191008", "190228", "190228",     "190228", "190228", "200817", "200817",       "200420", "200420", "200817", "200817",      "210302", "210302",  "210504","210504"))
    
    ids = as.factor(c("untreated_DonorI","untreated_DonorII", "p601_d3_DonorII","p604_d3_DonorII", "p601_d8_DonorII","p604_d8_DonorII",   "p601_d3_DonorI","p604_d3_DonorI","p601_d8_DonorI","p604_d8_DonorI",  "p600_d3_DonorII","p605_d3_DonorII","p600_d8_DonorII", "p605_d8_DonorII",  "p600_d3_DonorI","p605_d3_DonorI","p600_d8_DonorI","p605_d8_DonorI",  "p602_d8_DonorII","p602_d8_DonorI",  "p600and601_d12_DonorI", "p604and605_d12_DonorI","p600and601_d9_DonorII","p604and605_d9_DonorII",        "p602_d3_DonorI","p602_d3_DonorII",      "p602and604_d3_DonorI","p602and604_d3_DonorII"))
    
    donor = as.factor(c("DonorI","DonorII",  "DonorII","DonorII", "DonorII","DonorII",   "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorII","DonorII","DonorII",  "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorI",   "DonorI", "DonorI","DonorII","DonorII",        "DonorI","DonorII",    "DonorI","DonorII"))
    
    cData = data.frame(row.names=colnames(d.raw), replicates=replicates, donor=donor, batch=batch, ids=ids)
    #dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~batch+replicates) # ERROR due to the factor 'batch'
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~donor+replicates)
    
    sizeFactors(dds)
    >NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    #            V_8_0_mock_DonorI            V_8_0_mock_DonorII 
    #                    0.6152727                     0.8401306 
    #      V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII 
    #                    1.0196629                     1.0293038 
    #      V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII 
    #                    0.8901261                     0.9062621 
    #       V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI 
    #                    1.3752853                     1.3504344 
    #       V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI 
    #                    1.2316027                     1.3085877 
    #      V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII 
    #                    0.9323440                     0.9381003 
    #      V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII 
    #                    0.7179836                     0.9400712 
    #       V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI 
    #                    1.3953243                     1.4836134 
    #       V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI 
    #                    1.2135489                     1.8270271 
    #      V_8_4_1_p602_d8_DonorII        V_8_4_1_p602_d8_DonorI 
    #                    1.5281322                     1.1579240 
    #V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI 
    #                    0.9895927                     0.8560524 
    #V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII 
    #                    1.3462308                     1.1142589 
    #       V_8_4_2_p602_d3_DonorI       V_8_4_2_p602_d3_DonorII 
    #                    0.9675338                     1.1800615 
    # V_8_5_1_p602and604_d3_DonorI V_8_5_2_p602and604_d3_DonorII 
    #                    0.3940273                     0.3490526 
    
    rld <- rlogTransformation(dds)
    
    # -- before pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("replicates"))
    #plotPCA(rld, intgroup = c("replicates", "batch"))
    #plotPCA(rld, intgroup = c("replicates", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    
    # -- before heatmap --
    ## generate the pairwise comparison between samples
    library(gplots) 
    library("RColorBrewer")
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    #paste( rld$dex, rld$cell, sep="-" )
    #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates,batch, sep=":"))
    rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates))
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
    
    #---- 3 ----
    dds$replicates <- relevel(dds$replicates, "p601_d3")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p602and604_d3_vs_p601_d3")
    
    dds$replicates <- relevel(dds$replicates, "p600and601_d9d12")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p604and605_d9d12_vs_p600and601_d9d12")
    
    for (i in clist) {
      contrast = paste("replicates", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      geness <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(res), keytype = "SYMBOL", columns = c("ENSEMBL", "ENTREZID", "SYMBOL", "GENENAME"))
      geness <- geness[!duplicated(geness$SYMBOL), ]
      res$SYMBOL = rownames(res)
      rownames(geness) <- geness$SYMBOL
      identical(rownames(res), rownames(geness))
      res_df <- as.data.frame(res)
      geness_res <- merge(geness, res_df)
      dim(geness_res)
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      #write.csv(res_df, file = paste(i, "background.txt", sep="_"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=1)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-1)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p601_d3-all.txt \
    p602and604_d3_vs_p601_d3-up.txt \
    p602and604_d3_vs_p601_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p601_d3_on_LT.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p600and601_d9d12-all.txt \
    p604and605_d9d12_vs_p600and601_d9d12-up.txt \
    p604and605_d9d12_vs_p600and601_d9d12-down.txt \
    -d$',' -o p604and605_d9d12_vs_p600and601_d9d12_on_LT.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p601_d3-all.txt \
    p602and604_d3_vs_p601_d3-up.txt \
    p602and604_d3_vs_p601_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p601_d3_on_sT.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p600and601_d9d12-all.txt \
    p604and605_d9d12_vs_p600and601_d9d12-up.txt \
    p604and605_d9d12_vs_p600and601_d9d12-down.txt \
    -d$',' -o p604and605_d9d12_vs_p600and601_d9d12_on_sT.xls;
  3. goal

    p783 compared to LT transcript (p783_d8 vs. p600_d8?)
  4. Under ~/DATA/Data_Denise_LT_K331A_RNASeq/Data_Denise_RNASeq_Merged_Corrected_8

    d.raw<- read.delim2("gene_name_gene_counts_hg38+virus.csv",sep=",", header=TRUE, row.names=1)
    library(dplyr)
    # Remove duplicate rows based on gene_name
    d.raw.unique <- d.raw %>% distinct(gene_name, .keep_all = TRUE)
    # Convert to a traditional data frame
    d.raw.df <- as.data.frame(d.raw.unique)
    # Set gene_name as row names
    rownames(d.raw.df) <- d.raw.df$gene_name
    # Remove the now redundant gene_name column
    d.raw.df$gene_name <- NULL
    
    #d.summarized <- d.raw %>% group_by(gene_name) %>% summarize(across(everything(), sum, na.rm = TRUE))
    #rownames(d.raw.df) <- d.raw.df$gene_name
    ## Remove the now redundant gene_name column
    #d.raw.df$gene_name <- NULL
    
    replicates = as.factor(c("p600_d8","p600_d8", "p602_d8","p602_d8",  "p605_d8","p605_d8",    "p783_d8","p783_d8"))
    ids = as.factor(c("p600_d8_DonorI","p600_d8_DonorII", "p602_d8_DonorI","p602_d8_DonorII",  "p605_d8_DonorI","p605_d8_DonorII",    "p783_d8_DonorI","p783_d8_DonorII"))
    donor = as.factor(c("DonorI","DonorII",   "DonorI","DonorII",   "DonorI","DonorII",    "DonorI","DonorII"))
    cData = data.frame(row.names=colnames(d.raw.df), replicates=replicates, donor=donor, ids=ids)
    #dds<-DESeqDataSetFromMatrix(countData=d.raw.df, colData=cData, design=~batch+replicates) # ERROR due to the factor 'batch'
    dds<-DESeqDataSetFromMatrix(countData=d.raw.df, colData=cData, design=~donor+replicates)
    
    sizeFactors(dds)
    >NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    
    #  control_DI  control_DII        LT_DI       LT_DII      LTtr_DI     LTtr_DII 
    #   0.9393201    0.5490296    0.9066729    1.1874750    1.4258546    0.7305989 
    # LT_K331A_DI LT_K331A_DII 
    #   1.4819629    1.2744066 
    
    rld <- rlogTransformation(dds)
    
    dds$replicates <- relevel(dds$replicates, "p600_d8")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p783_d8_vs_p600_d8")
    
    for (i in clist) {
      contrast = paste("replicates", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      geness <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(res), keytype = "SYMBOL", columns = c("ENSEMBL", "ENTREZID", "SYMBOL", "GENENAME"))
      geness <- geness[!duplicated(geness$SYMBOL), ]
      res$SYMBOL = rownames(res)
      rownames(geness) <- geness$SYMBOL
      identical(rownames(res), rownames(geness))
      res_df <- as.data.frame(res)
      geness_res <- merge(geness, res_df)
      dim(geness_res)
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      #write.csv(res_df, file = paste(i, "background.txt", sep="_"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=1)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-1)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p783_d8_vs_p600_d8-all.txt \
    p783_d8_vs_p600_d8-up.txt \
    p783_d8_vs_p600_d8-down.txt \
    -d$',' -o p783_d8_vs_p600_d8_on_LT.xls;
  5. Deduplication is not performed until UMI-tools are used.

    • UMI-tools dedup

      • After extracting the UMI information from the read sequence (see UMI-tools extract), the second step in the removal of UMI barcodes involves deduplicating the reads based on both mapping and UMI barcode information using the UMI-tools dedup command.

      • This will generate a filtered BAM file after the removal of PCR duplicates.

      • picard MarkDuplicates

      • Unless you are using UMIs it is not possible to establish whether the fragments you have sequenced from your sample were derived via true biological duplication (i.e. sequencing independent template fragments) or as a result of PCR biases introduced during the library preparation.

      • By default, the pipeline uses picard MarkDuplicates to mark the duplicate reads identified amongst the alignments to allow you to guage the overall level of duplication in your samples.

      • However, for RNA-seq data it is not recommended to physically remove duplicate reads from the alignments (unless you are using UMIs) because you expect a significant level of true biological duplication that arises from the same fragments being sequenced from for example highly expressed genes.

      • This step will be skipped automatically when using the –with_umi option or explicitly via the –skip_markduplicates parameter.

        compare the STAR bam-files with markDuplicates bam-files. They are the same, since in the markDuplicates step, the duplicated reads were only marked, not physically removed! jhuang@hamburg:~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq/results/STAR$ samtools flagstat ./V_8_4_2_p602_d3_DonorIIAligned.sortedByCoord.out.bam 19968762 + 0 in total (QC-passed reads + QC-failed reads) 2323677 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 19968762 + 0 mapped (100.00% : N/A) jhuang@hamburg:~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq/results/markDuplicates$ samtools flagstat V_8_4_2_p602_d3_DonorIIAligned.sortedByCoord.out.markDups.bam 19968762 + 0 in total (QC-passed reads + QC-failed reads) 2323677 + 0 secondary 0 + 0 supplementary 15013903 + 0 duplicates 19968762 + 0 mapped (100.00% : N/A)

  6. goal

    #| p600and601_d9_DonorII | GFP+mCherry control | Day 9 | 
    #| p600and601_d12_DonorI | GFP+mCherry control | Day 12 | 
    #-->| p604and605_d9_DonorII | sT+LTtr | Day 9 | 
    #-->| p604and605_d12_DonorI | sT+LTtr | Day 12 | 
    #-->| p602and604_d3_DonorI  | sT+LT | Day 3 | 
    #-->| p602and604_d3_DonorII | sT+LT | Day 3 |
    p602+604 d3 versus p602 d3 on sT+LT transcript
    p602+p604 d3 versus p604 d3 on sT+LT transcript
    p604+605 d9/12 versus p604 d8 on sT+LT transcript
    p604+605 d9/12 versus p605 d8 on sT+LT transcript
  7. under ~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq_Merged_Corrected_28+2

    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    library("org.Hs.eg.db")
    #d.raw<- read.delim2("gene_counts_hg38+virus_on_LT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    #d.raw<- read.delim2("gene_counts_hg38+virus_on_sT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    d.raw<- read.delim2("gene_counts_hg38+virus_on_sT+LT_transcripts_28samples.csv",sep=",", header=TRUE, row.names=1)
    # [1] V_8_0_mock_DonorI             V_8_0_mock_DonorII           
    # [3] V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII      
    # [5] V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII      
    # [7] V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI       
    # [9] V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI       
    #[11] V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII      
    #[13] V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII      
    #[15] V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI       
    #[17] V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI       
    #[19] V_8_4_1_p602_d8_DonorII       V_8_4_1_p602_d8_DonorI       
    #[21] V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI
    #[23] V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII
    #[25] V_8_4_2_p602_d3_DonorI        V_8_4_2_p602_d3_DonorII      
    #[27] V_8_5_1_p602and604_d3_DonorI  V_8_5_2_p602and604_d3_DonorII
    col_order <- c("V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII", "V_8_1_5_p604_d3_DonorII", "V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII",   "V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI",  "V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII", "V_8_2_3_p605_d8_DonorII",  "V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI",  "V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI",  "V_8_3_1_p600and601_d12_DonorI", "V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII",    "V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII",    "V_8_5_1_p602and604_d3_DonorI", "V_8_5_2_p602and604_d3_DonorII")
    identical(colnames(d.raw),col_order)
    #reordered.raw <- d.raw[,col_order]
    replicates = as.factor(c("untreated","untreated", "p601_d3","p604_d3", "p601_d8","p604_d8",   "p601_d3","p604_d3","p601_d8","p604_d8",  "p600_d3","p605_d3","p600_d8", "p605_d8",  "p600_d3","p605_d3","p600_d8","p605_d8",  "p602_d8","p602_d8",  "p600and601_d9d12", "p604and605_d9d12","p600and601_d9d12","p604and605_d9d12",     "p602_d3","p602_d3",    "p602and604_d3","p602and604_d3"))
    batch = as.factor(c("200420", "200420", "190927", "190927",    "190927", "190927", "190228", "190228",    "190228", "190228", "191008", "191008",    "191008", "191008", "190228", "190228",     "190228", "190228", "200817", "200817",       "200420", "200420", "200817", "200817",      "210302", "210302",  "210504","210504"))
    ids = as.factor(c("untreated_DonorI","untreated_DonorII", "p601_d3_DonorII","p604_d3_DonorII", "p601_d8_DonorII","p604_d8_DonorII",   "p601_d3_DonorI","p604_d3_DonorI","p601_d8_DonorI","p604_d8_DonorI",  "p600_d3_DonorII","p605_d3_DonorII","p600_d8_DonorII", "p605_d8_DonorII",  "p600_d3_DonorI","p605_d3_DonorI","p600_d8_DonorI","p605_d8_DonorI",  "p602_d8_DonorII","p602_d8_DonorI",  "p600and601_d12_DonorI", "p604and605_d12_DonorI","p600and601_d9_DonorII","p604and605_d9_DonorII",        "p602_d3_DonorI","p602_d3_DonorII",      "p602and604_d3_DonorI","p602and604_d3_DonorII"))
    donor = as.factor(c("DonorI","DonorII",  "DonorII","DonorII", "DonorII","DonorII",   "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorII","DonorII","DonorII",  "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorI",   "DonorI", "DonorI","DonorII","DonorII",        "DonorI","DonorII",    "DonorI","DonorII"))
    cData = data.frame(row.names=colnames(d.raw), replicates=replicates, donor=donor, batch=batch, ids=ids)
    #dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~batch+replicates) # ERROR due to the factor 'batch'
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~donor+replicates)
    sizeFactors(dds)
    >NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    #            V_8_0_mock_DonorI            V_8_0_mock_DonorII 
    #                    0.6152727                     0.8401306 
    #      V_8_1_5_p601_d3_DonorII       V_8_1_5_p604_d3_DonorII 
    #                    1.0196629                     1.0293038 
    #      V_8_1_5_p601_d8_DonorII       V_8_1_5_p604_d8_DonorII 
    #                    0.8901261                     0.9062621 
    #       V_8_1_6_p601_d3_DonorI        V_8_1_6_p604_d3_DonorI 
    #                    1.3752853                     1.3504344 
    #       V_8_1_6_p601_d8_DonorI        V_8_1_6_p604_d8_DonorI 
    #                    1.2316027                     1.3085877 
    #      V_8_2_3_p600_d3_DonorII       V_8_2_3_p605_d3_DonorII 
    #                    0.9323440                     0.9381003 
    #      V_8_2_3_p600_d8_DonorII       V_8_2_3_p605_d8_DonorII 
    #                    0.7179836                     0.9400712 
    #       V_8_2_4_p600_d3_DonorI        V_8_2_4_p605_d3_DonorI 
    #                    1.3953243                     1.4836134 
    #       V_8_2_4_p600_d8_DonorI        V_8_2_4_p605_d8_DonorI 
    #                    1.2135489                     1.8270271 
    #      V_8_4_1_p602_d8_DonorII        V_8_4_1_p602_d8_DonorI 
    #                    1.5281322                     1.1579240 
    #V_8_3_1_p600and601_d12_DonorI V_8_3_1_p604and605_d12_DonorI 
    #                    0.9895927                     0.8560524 
    #V_8_3_2_p600and601_d9_DonorII V_8_3_2_p604and605_d9_DonorII 
    #                    1.3462308                     1.1142589 
    #       V_8_4_2_p602_d3_DonorI       V_8_4_2_p602_d3_DonorII 
    #                    0.9675338                     1.1800615 
    # V_8_5_1_p602and604_d3_DonorI V_8_5_2_p602and604_d3_DonorII 
    #                    0.3940273                     0.3490526
    rld <- rlogTransformation(dds)
    # -- before pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("replicates"))
    #plotPCA(rld, intgroup = c("replicates", "batch"))
    #plotPCA(rld, intgroup = c("replicates", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    # -- before heatmap --
    ## generate the pairwise comparison between samples
    library(gplots) 
    library("RColorBrewer")
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    #paste( rld$dex, rld$cell, sep="-" )
    #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates,batch, sep=":"))
    rownames(mat) <- colnames(mat) <- with(colData(dds),paste(replicates))
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
    
    #---- 3 ----
    dds$replicates <- relevel(dds$replicates, "p602_d3")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p602and604_d3_vs_p602_d3")
    dds$replicates <- relevel(dds$replicates, "p604_d3")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p602and604_d3_vs_p604_d3")
    dds$replicates <- relevel(dds$replicates, "p604_d8")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p604and605_d9d12_vs_p604_d8")
    dds$replicates <- relevel(dds$replicates, "p605_d8")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("p604and605_d9d12_vs_p605_d8")
    for (i in clist) {
    contrast = paste("replicates", i, sep="_")
    res = results(dds, name=contrast)
    res <- res[!is.na(res$log2FoldChange),]
    geness <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(res), keytype = "SYMBOL", columns = c("ENSEMBL", "ENTREZID", "SYMBOL", "GENENAME"))
    geness <- geness[!duplicated(geness$SYMBOL), ]
    res$SYMBOL = rownames(res)
    rownames(geness) <- geness$SYMBOL
    identical(rownames(res), rownames(geness))
    res_df <- as.data.frame(res)
    geness_res <- merge(geness, res_df)
    dim(geness_res)
    write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
    #write.csv(res_df, file = paste(i, "background.txt", sep="_"))
    up <- subset(geness_res, padj<=0.05 & log2FoldChange>=1)
    down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-1)
    write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
    write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p602_d3-all.txt \
    p602and604_d3_vs_p602_d3-up.txt \
    p602and604_d3_vs_p602_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p602_d3.xls;
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p602and604_d3_vs_p604_d3-all.txt \
    p602and604_d3_vs_p604_d3-up.txt \
    p602and604_d3_vs_p604_d3-down.txt \
    -d$',' -o p602and604_d3_vs_p604_d3.xls;
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p604_d8-all.txt \
    p604and605_d9d12_vs_p604_d8-up.txt \
    p604and605_d9d12_vs_p604_d8-down.txt \
    -d$',' -o p604and605_d9d12_vs_p604_d8.xls;
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    p604and605_d9d12_vs_p605_d8-all.txt \
    p604and605_d9d12_vs_p605_d8-up.txt \
    p604and605_d9d12_vs_p605_d8-down.txt \
    -d$',' -o p604and605_d9d12_vs_p605_d8.xls;

RNA-seq data analysis (R-part) for S. epidermidis 1585

  1. Data input and generate PCA plot

    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    library(tximport)
    library(DESeq2)
    setwd("/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/star_salmon")
    # Define paths to your Salmon output quantification files
    files <- c("15m_A" = "./EX_15_min_A/quant.sf",
              "15m_B" = "./EX_15_min_B/quant.sf",
              "15m_C" = "./EX_15_min_C/quant.sf",
              "1h_A" = "./EX_1h_A/quant.sf",
              "1h_B" = "./EX_1h_B/quant.sf",
              "1h_C" = "./EX_1h_C/quant.sf",
              "2h_A" = "./EX_2h_A/quant.sf",
              "2h_B" = "./EX_2h_B/quant.sf",
              "2h_C" = "./EX_2h_C/quant.sf",
              "4h_A" = "./EX_4h_A/quant.sf",
              "4h_B" = "./EX_4h_B/quant.sf",
              "4h_C" = "./EX_4h_C/quant.sf",
              "6h_A" = "./EX_6h_A/quant.sf",
              "6h_B" = "./EX_6h_B/quant.sf",
              "6h_C" = "./EX_6h_C/quant.sf",
              "4d_A" = "./EX_Day_4_A/quant.sf",
              "4d_B" = "./EX_Day_4_B/quant.sf",
              "4d_C" = "./EX_Day_4_C/quant.sf")
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    # Define the replicates and condition of the samples
    replicate <- factor(c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"))
    condition <- factor(c("15m", "15m", "15m", "1h", "1h", "1h", "2h", "2h", "2h", "4h", "4h", "4h", "6h", "6h", "6h", "4d", "4d", "4d"))
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    # In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
    # Filter data to retain only genes with more than 2 counts > 3 across all samples
    # dds <- dds[rowSums(counts(dds) > 3) > 2, ]
    # Output raw count data to a CSV file
    write.csv(counts(dds), file="transcript_counts.csv")
    # -- gene-level count data --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    #tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_counts.csv -d',' -o gene_counts.xls
    
    #TODO: why a lot of reads were removed due to the too_short?
    #STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output
    dim(counts(dds))
    head(counts(dds), 10)
    #DEBUG: DESeq should not used here!?
    #TODO_NEXT_WEEK: rerun without fistly DESeq(dds) to compare if the results is the same to process_1 
    #dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    
    # draw simple pca and heatmap
    library(gplots) 
    library("RColorBrewer")
    #mat <- assay(rld)
    #mm <- model.matrix(~condition, colData(rld))
    #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
    #assay(rld) <- mat
    # -- pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    # -- heatmap --
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
  2. Analysis on differentially expressed genes

    setwd("/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/star_salmon/degenes")
    
    dds$condition <- relevel(dds$condition, "15m")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1h_vs_15m", "2h_vs_15m", "4h_vs_15m", "6h_vs_15m",  "4d_vs_15m")
    
    for (i in clist) {
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    library("EnhancedVolcano")
    for i in 1h_vs_15m 2h_vs_15m 4h_vs_15m 6h_vs_15m 4d_vs_15m; do
      echo "res <- read.csv(file = paste(\"${i}\", \"all.txt\", sep=\"-\"), row.names = 1)"
      echo "res_df <- as.data.frame(res)"
      echo "png(\"${i}.png\",width=800, height=600)"
      #legendPosition = 'right',legendLabSize = 12,  arrowheads = FALSE,
      #echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.2, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(~Delta*\"$(echo $i | cut -d'_' -f1) versus \" *~Delta*\"$(echo $i | cut -d'_' -f3)\"))"
      echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(\"$(echo $i | cut -d'_' -f1) versus $(echo $i | cut -d'_' -f3)\"))"
      echo "dev.off()"
    done
    
    res <- read.csv(file = paste("1h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("1h_vs_15m.png",width=1000, height=800)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 1), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("1h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("2h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("2h_vs_15m.png",width=1000, height=1000)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 4), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("2h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("4h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("4h_vs_15m.png",width=1000, height=1200)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 5.5), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("4h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("6h_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("6h_vs_15m.png",width=1000, height=1600)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', ylim = c(0, 17), pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("6h versus 15m"))
    dev.off()
    
    res <- read.csv(file = paste("4d_vs_15m", "all.txt", sep="-"), row.names = 1)
    res_df <- as.data.frame(res)
    png("4d_vs_15m.png",width=1000, height=1600)
    EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=2.0, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression("4d versus 15m"))
    dev.off()
    
    #under DIR degenes under KONSOLE
    for i in 1h_vs_15m 2h_vs_15m 4h_vs_15m 6h_vs_15m 4d_vs_15m; do 
    echo "~/Tools/csv2xls-0.4/csv_to_xls.py ${i}-all.txt ${i}-up.txt ${i}-down.txt -d$',' -o ${i}.xls;"; 
    done
  3. Clustering the genes and draw heatmap

    for i in 1h_vs_15m 2h_vs_15m 4h_vs_15m 6h_vs_15m 4d_vs_15m; do
    echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
    cat *.id | sort -u > ids
    #add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id
    RNASeq.NoCellLine <- assay(rld)
    #install.packages("gplots")
    library("gplots")
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    #datamat = RNASeq.NoCellLine
    write.csv(as.data.frame(datamat), file ="gene_expressions.txt")
    constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
    if(any(constant_rows)) {
      cat("Removing", sum(constant_rows), "constant rows.\n")
      datamat <- datamat[!constant_rows, ]
    }
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.4)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "MAGENTA", "CYAN", "GREEN", "RED", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTGREEN"); #"LIGHTRED", 
    mycol = mycol[as.vector(mycl)]
    #png("DEGs_heatmap.png", width=900, height=800)
    #cex.lab=10, labRow="",
    png("DEGs_heatmap.png", width=900, height=1000)
    #heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
    #            scale='row',trace='none',col=bluered(75), 
    #            RowSideColors = mycol, margins=c(10,20), cexRow=1.5, srtCol=45, lhei = c(2, 8))  #rownames(datamat)  
    # cexCol, cexRow
    heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,2), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=40, lhei=c(1,7), cexCol=2)
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') 
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')  
    write.csv(names(subset(mycl, mycl == '4')),file='cluster4_MAGENTA.txt')
    write.csv(names(subset(mycl, mycl == '5')),file='cluster5_CYAN.txt')
    write.csv(names(subset(mycl, mycl == '6')),file='cluster6_GREEN.txt')
    ~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    #~/Tools/csv2xls-0.4/csv_to_xls.py significant_gene_expressions.txt -d',' -o DEGs_heatmap_expression_data.xls;
  4. Generate Excel files from Genbank-file

    python3 /home/jhuang/Scripts/gb_to_excel.py 1585.gb 1585.xlsx
    #insert C./P. (Chrom or Plasmid1, Plasmid2, Plasmid3) as the second chrom in the 1585.xlsx.

Preparing GTF file for nextflow rnaseq regarding S. epidermidis 1585

  1. In the example RP62A

    #[IMPORTANT FINAL STEPS]
    [1] python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf or python3 ~/Scripts/genbank_to_gtf.py 1585.gb 1585.gtf
    
    [2] python3 modify_gtf.py RP62A.gtf RP62A_m.gtf
        python3 modify_gtf.py 1585.gtf 1585_m.gtf  #replace CDS with exon!
    
        gene_biotype "protein_coding"; 
    [3] sed -i -e 's/gene_id "rna-/gene_id "gene-/g' RP62A_m.gtf
    
    #  --gtf_extra_attributes        [string]  By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when 
                                              running Salmon. [default: gene_name] 
    #  --gtf_group_features          [string]  Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id]
    #  --featurecounts_group_type    [string]  The attribute type used to group feature types in the GTF file when generating the biotype plot with 
                                              featureCounts. [default: gene_biotype] 
    #  --featurecounts_feature_type  [string]  By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]
    
    #--gtf_extra_attributes gene 
    #--gtf_group_features Parent --featurecounts_group_type gene_biotype --featurecounts_feature_type CDS (outdated since [CHANGE1,2,3]) 
    
    # ------------ gff_to_gtf.py, then modify *.gtf file as follows -------------
    #Upload the scripts gff_to_gtf.py and modify_gtf.py.
    python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
    
    #[CHANGE1] ID in gene and Parent in CDS --> gene_id; ID in exon --> transcript_id; Name in gene --> gene_name
    #[CHANGE2] "Protein Homology" --> "RefSeq"
    #[CHANGE3] CDS --> exon
    #[CHANGE4?] --gtf_extra_attributes ||||gene|||| refers to ID=gene-SERP_RS00220;Name=noc;gbkey=Gene;||||gene=noc||||; "gene=" exists in both gene and exon. choose gene as gtf_extra_attributes! (see --gtf_extra_attributes gene in the next command)
  2. In the example of 1585 wildtype genome 1585.gb (CP143516-CP143519)

                Features Annotated                :: Gene; CDS; rRNA; tRNA; ncRNA
                Genes (total)                     :: 2,332 (CDS + RNA)
                CDSs (total)                      :: 2,240 (CDS)  grep "CDS" 1585.gtf | wc -l
                Genes (coding)                    :: 2,183 (Pseudo has also CDS?)
                CDSs (with protein)               :: 2,183
    
                Genes (RNA)                       :: 92  
                rRNAs                             :: 7, 6, 6 (5S, 16S, 23S)  grep "rRNA" 1585.gtf | wc -l
                complete rRNAs                    :: 7, 6, 6 (5S, 16S, 23S)
                tRNAs                             :: 69  grep "tRNA" 1585.gtf | wc -l
                ncRNAs                            :: 4  grep "ncRNA" 1585.gtf | wc -l
    
                Pseudo Genes (total)              :: 57
                CDSs (without protein)            :: 57
                Pseudo Genes (ambiguous residues) :: 0 of 57
                Pseudo Genes (frameshifted)       :: 33 of 57
                Pseudo Genes (incomplete)         :: 30 of 57
                Pseudo Genes (internal stop)      :: 25 of 57
                Pseudo Genes (multiple problems)  :: 21 of 57
    
        gene            730216..731766
                        /locus_tag="V0R30_03330"
        rRNA            730216..731766
                        /locus_tag="V0R30_03330"
    
    #Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
    cut -d$'\t' -f3 1585_m.gtf | sort -u
    exon
    #gene
    ncRNA
    rRNA
    tmRNA            
    tRNA
    #Ausführung: STEP0: one record (tmRNA) lacking the "tmRNA" row, manually corrected! 
    #                Transfer-messenger RNA (tmRNA, 10Sa RNA, ssrA) is bacterial RNA having both tRNA and mRNA properties
    #            STEP1: python3 genbank_to_gtf.py 1585.gb 1585.gtf
    #            STEP2: python3 modify_gtf.py 1585.gtf 1585_m.gtf  #"CDS" --> "exon"
    #            STEP3: replace transcript_id " --> transcript_id "tx-
    #            STEP4: ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
    #                   using "python3 add_geneid_genename_genebiotype.py 1585_m.gtf 1585_m_.gtf
    
    #DEBUG ERROR Stop at line : CP143516       biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
    #    Error Message: Cannot find transcript_id!
    #            STEP5:     Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described.   with "python3 add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py 1585_m_.gtf 1585_m__.gtf"
    
    #            STEP6: MANUALLY modify the lines ncRNA (3), tmRNA (1), rRNA (19), tRNA (69): tRNA --> exon, 'gene_biotype "tRNA";';
    #                   with "grep "tRNA" 1585_m_.gtf | wc -l" check they are correctly changed!
    
    #RESULTS:
    #CP143516        biopython       gene    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
    #CP143516        biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
    #...
    
    #/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
    #grep ">" genome.transcripts.fa  | wc -l
    #2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!
    
    #            STEP7 (Optional) To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
    #                see 1585_m___.gtf
    #                grep "exon" 1585_m___.gtf | wc -l
    #                2332
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
    #                2332
    
    #For example exon-V0R30_00090 is a tRNA.
    #>tx-V0R30_00090
    #GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
    
    #We can also filter the count number in the downstream analysis from the big matrix!
    #tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
    >tx-V0R30_05655
    AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
    >tx-V0R30_06520
    TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
    >tx-V0R30_11060
    CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG
    
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m___.gtf -profile test_full -resume --max_memory 200.GB --max_time 2400.h --save_reference --aligner star_salmon   --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0

The structural proteome of eukaryotic viruses

Original text please find under https://www.biorxiv.org/content/10.1101/2024.01.22.576744v1.full

  1. The structural proteome of eukaryotic viruses /ˈproʊ.ti.oʊm/ (Fig. 1 + Supplementary Fig. 1)

Fig1.jpg Sup_Fig1.jpg

    [ColabFold is advanced version of AlphaFold (https://steineggerlab.com/en/)]
    [Viral protein seqeunces --> structure prediction --> Seqeunce clustering using MMseqs2 --> Structural clustering using Foldseek]
    - Eukaryotic viral protein sequences --> Structure prediction (67715 Strucutures) using Colabfold --> Sequence clustering using MMseqs2 (70% Coverage, 20% identity) --> Structural clustering using Foldseek (70% Coverage and TMscore>=0.4)
    - This dataset includes a large diversity of viruses, including 4,463 species from 132 different viral families (Fig. 1B, C). 
    - Clusters are structurally consistent, as implementing DALI 24 to align cluster representatives to each member for clusters with at least 100 members yields a median cluster-average DALI Z score of 13.1 (Supplementary Fig. 1C). DALI Z-scores above 8 indicate
    two proteins are likely homologous 25.

    - We investigated how well this database represents viral diversity, and if it reconstitutes core viral hallmark genes. 
    - We grouped viral families into viral genome types based on their Baltimore classes26 with slight modifications – DNA viruses were split into large, medium, and small groupings based on their average genome length, while RNA viruses without single-stranded positive- or negative- sense genomes were grouped into the RNA (Other) category. 
    - Large double-stranded DNA (dsDNA) viruses have the most protein clusters per species and, despite constituting only 14 of the 132 viral families in the dataset, account for the majority of viral proteins (Fig. 1D, F). 
    - As expected, protein cluster count correlates strongly with genome size (Supplementary Fig. 1D). 
    - With their larger genomes, dsDNA viruses have the capacity to encode more auxiliary genes without sacrificing genome stability. 
    - RNA viruses make up a large fraction of the families present in the dataset, but a smaller fraction of the total proteins (Fig. 1E, F). 
    - Structural homology between viral families with a similar genome type is common, with large dsDNA viruses sharing many protein folds (Supplementary Fig. 1E).
    - As expected, the predominant protein clusters in the dataset as a whole (Supplementary Fig. 1F) and within each genome type (Fig. 1G) are largely involved in fundamental aspects of the viral life cycle. 
    - These include the single jellyroll fold, which comprises viral capsids ['kæpsid] and is present in viruses of all genome types. 
    - The double jellyroll fold also comprises viral capsids, although it is restricted to dsDNA viruses27. 
    - RNA viral families often encode nucleocapsids, responsible for packaging of viral RNA, and RNA-dependent RNA polymerases (RdRPs) responsible for genome replication. 
    - While the RdRP is universally conserved in RNA viruses, it is split amongst multiple protein clusters due to variation in protein length. 
    - In contrast, small dsDNA viruses such as papillomaviruses and polyomaviruses encode a viral replicase with conserved origin-binding and helicase domains. 
    -! Altogether, we find that our structural database successfully reconstitutes conserved viral proteins across diverse viral subtypes.
    - The pLDDT (predicted Local Distance Difference Test) value is a metric used in protein structure prediction, particularly by tools like AlphaFold, to assess the confidence in the predicted positions of amino acids in a protein structure. pLDDT scores range from 0 to 100, where higher scores indicate higher confidence in the accuracy of the predicted structure at specific positions. Scores above 90 are considered very high confidence, scores between 70 to 90 indicate confidence, scores between 50 to 70 are low confidence, and scores below 50 are very low confidence, suggesting that the model is uncertain about those regions of the protein structure.
    - Viral Protein Clusters ---Foldseek structural comparison---> Clustered Alphafold Database (human proteins)

    #viral protein cluster
    - We next investigated the taxonomic distribution of viral protein clusters. 
    - We conducted structural alignments of viral protein cluster representatives against 2.3 million cluster representatives from the entire Alphafold Database (AFDB)3 (Fig. 1H). 
    - For each virus protein cluster, we determined the last common ancestor of viruses encoding a cluster member. 
    - We found that 29% of protein clusters are present in multiple viral families, the majority of which are present in the Alphafold Database, suggesting that they are evolutionarily ancient (Fig. 1I). 
    - In addition, we found that 62% of viral proteins (or 55% of proteins from non-singleton clusters) are restricted to a single viral family and lack homologs in the AFDB (Fig. 1I). This shows that viral evolution generates substantial numbers of novel proteins that are absent from current structure
    databases.

2.Structural similarities between annotated and uncharacterized viral proteins (Fig. 2+Supplementary Fig. 2)

Fig2.jpg Sup_Fig2.jpg

    - We investigated the ability of structural alignments to identify relationships not apparent from protein sequence alone. 
    - We found that many representatives of sequence clusters are structurally similar despite low sequence similarity (Fig. 2A). --> the structure is more conserved than seqeunce level?
    - In fact, adding structural information to protein clustering efforts leads to more taxonomically diverse protein clusters, with significantly more viral families per cluster (Fig. 2B). : with the structure and seqeunce, we have much more viral families per cluster!
    - This is especially important for finding homology between proteins from divergent viruses, resulting in a substantial increase in protein clusters encompassing proteins encoded by viruses of different genome types (Fig. 2C). 51 vs 23
    - We asked if [structural alignments] can link poorly-annotated sequence clusters with those that are more annotated (Fig. 2D). We used the sequence-based classifier InterProScan28 to assign all proteins Pfam29, CDD30, and TIGRFAM31 classifications. 
    - Sequence clusters contain almost entirely annotated or entirely unannotated members, resulting in a bimodal distribution of sequence clusters (Fig. 2E). 
    - Of the proteins in clusters with more than one member, over 25% ( 5.2/(1.6+3.6+14.6)=0.26 ) of unannotated proteins are located in either an annotated sequence cluster or a protein cluster that contains an annotated sequence cluster (Fig. 2F).
    - Many protein clusters encompass a mixture of annotated and unannotated sequence clusters (Supplementary Fig. 2). Supplementary Fig. 2: Many unannotated proteins have structural homlogy to annotated protein clusters.
    - We find that these connections between sequence clusters are useful to determine putative functions of poorly characterized proteins across the virome. For example, while the single jellyroll fold is the most abundant protein cluster, many members of this cluster are not correctly annotated (Fig. 2G). Many other protein clusters include both annotated and unannotated sequence clusters, including clusters encoding enzymes such as nucleotide-phosphate kinases (Fig. 2H), NUDIX Hydrolases (Fig. 2I), DNA ligases (Fig. 2J), and nucleases (Fig. 2K). 
    - One cluster of note includes members that resemble the UL43 family of late herpesvirus proteins (Fig. 2L), which will be discussed later. 
    - Together, these results demonstrate that large scale clustering based on sequence plus predicted structure enables functional inference of poorly characterized viral proteins.

    #The Dali database is a structural classification based on precomputed all-against-all structural similarities within the PDB.
  1. Structural alignments suggest functions of human pathogen proteins ((Fig. 3)comparisions of viral (our databases) and non-viral proteins (Alphafold Databases) using Foldseek!)

Fig3.jpg

    Foldseek faster than TM-align and Dali!!!!!
    #- Foldseek is a tool designed for fast and sensitive comparison of large sets of protein structures. It uses a novel approach by encoding structures as sequences over a 20-state 3D interaction alphabet, enabling comparisons through sequence alignments. This method significantly speeds up structural comparisons, making Foldseek much faster than traditional structural aligners like TM-align and Dali, while maintaining high sensitivity. Foldseek is available as open-source software and also through a webserver, facilitating rapid and accurate protein structure searches
    #- Using Dali for structural comparison of proteins 

    - Unlike nucleotide or protein sequence, structural features are often conserved over large evolutionary timescales. 
    - Thus, we investigated if alignment between predicted viral and non-viral protein structures can offer insight into the function of poorly-annotated proteins encoded by human pathogens.
    - To do this, we used Foldseek (for comparisisons) to align [our virus protein structure database!!!!] with the initial release of the Alphafold Database, which contains over 300,000 proteins from 21 organisms across eukaryotes, bacteria, and archaea2 (Fig. 3A). 
    - This revealed pervasive structural homology between viral and non-viral proteins, with high structural similarity in the face of low amino acid identity (Fig. 3B).
    - Ultimately, 14,531 predicted viral proteins have an alignment to a member of the Alphafold database, with the majority of alignments being against proteins encoded by eukaryotes (Fig. 3C). 
    - These alignments include proteins that are unannotated but are encoded by human pathogens. 
    - To reduce rates of false negatives, we conducted a series of alignments using DALI24, which is slower than Foldseek but substantially more sensitive. 
    - First, we found that a set of proteins encoded by poxviruses are structurally similar to the auto-inhibitory domain of mammalian gasdermins (Fig. 3D)32. 
    -* In the context of "auto-inhibitory domain of mammalian gasdermins," the term "domain" refers to a specific part of a protein that has a distinct structure and function. In proteins like gasdermins, which are involved in cell death pathways, an auto-inhibitory domain acts as a regulatory segment. This domain can prevent the protein from acting until certain conditions are met, effectively inhibiting the protein's activity to regulate processes such as inflammation and cell death until it is appropriate for the protein to become active.

    - Similarly, several poxvirus proteins are structurally homologous to the human galactosyltransferase COLGALT1, thought to enable virus binding to
    surface glycosaminoglycans during viral entry (Fig. 3E)33. 
    - Next, we found that human herpesviruses proteins, including the protein BMRF2 from Epstein Barr herpesvirus (EBV) and Varicella zoster virus (VZV), share structural similarity with the human equilibrative nucleoside transporter ENT4 (Fig. 3F). 
    - EBV conducts substantial remodeling of host cell metabolism during viral infection34, and this finding suggests a potential metabolic role in addition to BMRF2 involvement in viral attachment35. 
    - In addition, transport of antiviral nucleoside analogues such as valacyclovir are mediated by nucleoside transporters36,37, raising questions about the
    interplay between this protein and valacyclovir during VZV infection. These proteins belong to a cluster of proteins similar to the UL43 family of late herpesvirus proteins, some of which are unannotated (Fig. 2J). 
    - In addition, we observed structural homology of Poxvirus C4-like proteins with eukaryotic dioxygenases (Fig. 3G). Vaccinia virus C4 is notable for antagonizing several innate immune pathways. C4 directly binds the pattern recognition receptor DNA-PK, blocking DNA binding and immune signaling through that pathway38. In addition, C4 inhibits NF-κB signaling downstream at or downstream of the IKK complex, but the mechanism of this inhibition is unknown39. Future studies are required to determine if its dioxygenase-like fold is involved in its innate immune antagonism. 
    - Altogether, these findings illustrate the ubiquity of structural homology between viral and non-viral proteins and show that this homology can be used to predict potential functions of poorly characterized viral proteins.
  1. Horizontal gene transfer creates taxonomically-diverse protein clusters (Supplementary Fig. 3)

Sup_Fig3.jpg

    -* What does "domain" in protein structure mean? While alpha-helices and beta-sheets are elements of secondary structure within a protein, a domain is a higher order of structure that often consists of multiple secondary structure elements arranged in a specific configuration. Domains can serve various functions, such as catalytic activity, binding to other molecules, or regulatory roles, and a single protein can contain multiple domains each performing different functions.

    - While we found that some protein clusters contain members encoded by viruses of different genome types, the evolutionary origin of such conservation is unclear. 
    - Many of these protein clusters are predominantly encoded by viruses of a single genome type but expressed in a small minority of viruses of a different genome type (Supplementary Fig. 3A). 
    - This observation is consistent with virus-virus or host-virus horizontal gene transfer.
    - To explore this possibility, we conducted Blast40 searches of sequence cluster representatives against viral- and non-viral protein databases and constructed phylogenetic trees of the top hits. 
    - We found that nucleoside-phosphate kinases in cluster 28 show a polyphyletic distribution with homologs in different viruses showing amino acid similarity to distinct sets of non-viral proteins (Supplementary Fig. 3B). Homalodisca vitripennis reovirus
    - There is a similar pattern with HrpA/B-like helicases in Cluster 55, with helicases in different viral families showing amino acid similarity to distinct sets of non-viral organisms (Supplementary Fig. 3C). 
    - These patterns are consistent with horizontal gene transfer from non-viral hosts. 
    - In contrast, other taxonomically distributed protein clusters such as cluster 56 (encoding parvovirus Rep proteins with homologs in some human herpesviruses) and cluster 735 (encoding a hemagglutinin lineage present in baculoviruses and some orthomyxoviruses) display a monophyletic taxonomic distribution consistent with horizontal gene transfer between viruses (Supplementary Fig. 3D, E). 
    - These data suggest that many protein clusters that contain proteins from viruses of different genome types arise from horizontal gene transfer, both from viral and non-viral sources.
  1. Structural alignments identify shared functional domains (Supplementary Fig. 4)

Sup_Fig4.jpg

    - We constructed protein clusters with a strict 70% coverage requirement, leaving open the possibility that individual domains can be identified through structure comparison3. 
    - We reasoned that protein domains present within multiple protein clusters may have particular biological importance. 
    - We used DALI to conduct all-by-all alignments of the representatives of all protein clusters having more than one member. 
    - This revealed substantial protein similarity with many alignments having Z scores greater than 8, indicating high confidence of structural homology25
    (Supplementary Fig. 4A). 
    - Protein clusters ultimately fall into a network of shared domains (Supplementary Fig. 4B). 
    - Here, distinct domains are often shared across protein clusters in context with various combinations of other domains, which can be seen with domains involved in interaction with the cytoskeleton (Supplementary Fig. 4C) and in metabolism (Supplementary Fig. 4D,E) in eukaryotic viruses and phage.
  1. Structural homology reveals phosphodiesterases that degrade 2’3’ cGAMP (e.g. LigT-like phophodiesterases (Fig. 4 + Supplementary Fig. 5))

Fig4.jpg Sup_Fig5.jpg

    Many aspects of eukaryotic and prokaryotic immunity have a shared origin41. One set of related
    pathways are the mammalian cGAS-STING and OAS pathways and prokaryotic
    Cyclic-oligonucleotide-based anti-phage signaling systems (CBASS). In both cases, a protein
    sensor detects a viral cue and generates a nucleotide second messenger, which activates a
    downstream antiviral effector (Fig. 4A). In the case of the cGAS (cyclic GMP-AMP synthase)
    pathway, cGAS recognizes cytoplasmic double-stranded DNA and generates 2’3’ cyclic
    GMP-AMP (2’3’ cGAMP). Many cGAS/DncV-like nucleotidyltransferases (CD-NTases) in
    prokaryotic CBASS’ make a similar second messenger, 3’3’ cGAMP, in response to viral cues42.
    In contrast, OAS (oligoadenylate synthase) recognizes double-stranded RNA and generates
    linear 2’5’ oligoadenylates (2’5’ OA)43,44. In prokaryotes, phage T4 encodes the ligT-like PDE
    anti-CBASS protein 1 (Acb1), which degrades 3’3’ cGAMP and a variety of other cyclic
    nucleotide substrates including 2’3’ cGAMP45.
    In eukaryotes, several RNA viruses encode PDEs that degrade 2’5’ OA46. Interestingly,
    we find that these PDEs have a ligT-like fold similar to Acb1. Given the conserved use of
    ligT-like PDEs in viral anti-immunity, we investigated their distribution and phylogeny. Structural
    searches revealed many different branches of ligT-like PDEs are present in eukaryotic viruses
    (Fig. 4B). Notably, there are multiple independent branches of ligT-like PDEs in RNA viruses.
    Linage A betacoronaviruses and Toroviruses share a clade of PDEs that is similar to the PDEs
    present in Rotaviruses. Surprisingly, lineage C betacoronaviruses contain a distinct branch of
    PDEs (Fig. 4A)47. This suggests that there were two independent PDE acquisition events within
    the betacoronavirus genus, showing the strong selective pressure for betacoronaviruses to
    evade the OAS pathway. We find that some large DNA viruses also contain ligT-like PDEs.
    Despite the extreme amino acid variability across the ligT-like PDE tree there is near-universal
    conservation of the two catalytic histidines (Fig. 4C), with the exception of the Mimivirus ligT-like
    branch.
    The presence of ligT-like PDEs in large DNA viruses raises the question of whether they
    have an anti-immune function. While the RNA-sensing OAS pathway is commonly targeted by
    ligT-like PDEs of RNA viruses, there is likely less pressure for large DNA viruses to target OAS.
    Thus, we tested whether ligT-like PDEs encoded by large DNA viruses have activity against 2’3’
    cGAMP. To address this question, we generated a synthetic STING circuit in 293T cells48,49 (Fig.
    4D). Here, STING can be activated by treatment with 2’3’ cGAMP or the non-nucleotide STING
    agonist diABZI50, which will lead to expression of firefly luciferase in a STING-dependent
    manner. We expect that a viral PDE that targets 2’3’ cGAMP should be able to inhibit 2’3’
    cGAMP- but not diABZI-mediated STING activity. Testing representative PDEs from each
    branch revealed that while PDEs encoded by RNA viruses and other large DNA viruses have
    only limited activity against 2’3’ cGAMP, PDEs encoded by avian poxviruses have very potent
    activity against 2’3’ cGAMP (Supplementary Fig. 5A). We found that the ligT-like PDEs encoded
    by Pigeonpox and Canarypox very potently restrict STING signaling stimulated by 2’3’ cGAMP
    but have limited activity against diABZI-mediated STING signaling (Fig. 4E). Furthermore,
    mutation of the catalytic histidines substantially reduces activity (Fig. 4E). Avian poxvirues are
    notable for their lack of Poxin51, the other 2’3’ cGAMP phosphodiesterase encoded by
    poxviruses, showing the strong selective pressure for poxviruses to evade cGAS-STING
    immunity. Altogether, we have leveraged structure homology to discover a novel mechanism of
    2’3’ cGAMP degradation by eukaryotic viruses and find that cGAMP targeting by ligT-like PDEs
    is a pan-viral mechanism of anti-immunity.

BCR (B-cell receptor) and TCR (T-cell receptor) sequencing

BCR (B-cell receptor) sequencing and TCR (T-cell receptor) sequencing are molecular techniques used to analyze the genetic diversity of B-cell and T-cell receptors within an immune response. These receptors are crucial for the adaptive immune system’s ability to recognize and bind to specific antigens (foreign substances or pathogens), leading to an immune response. Each B-cell or T-cell has a unique receptor that recognizes a specific antigen, and the vast diversity of these receptors allows the immune system to respond to a wide range of pathogens. BCR Sequencing

BCR sequencing focuses on the variable regions of the B-cell receptor, which determine its antigen specificity. BCRs are membrane-bound antibodies produced by B cells. Through BCR sequencing, researchers can identify the diversity of B-cell populations and understand their role in immune responses, including those against infections, in autoimmune diseases, and in cancer. It also helps in studying the clonal expansion of B cells and the development of antibody repertoires in response to vaccines or infections. TCR Sequencing

TCR sequencing analyzes the genetic sequences encoding the T-cell receptor. T cells are critical for cell-mediated immunity, recognizing peptides presented by the major histocompatibility complex (MHC) on the surface of other cells. TCR sequencing allows for the examination of the diversity and clonality of T-cell populations within a sample. This is important for understanding T-cell responses in various contexts, including infection, autoimmunity, vaccine responses, and tumor immunity.

B cells and T cells are essential components of the adaptive immune system, both equipped with receptors to recognize antigens. However, they serve different functions based on their receptor types and the mechanisms through which they mediate immune responses:

  • B Cells and B-cell Receptors (BCRs)

    • Antibody Production: Once activated, B cells can differentiate into plasma cells that produce antibodies. These antibodies are specific to the antigen that stimulated the B cell and can directly neutralize pathogens or tag them for destruction by other parts of the immune system.
    • Antigen Presentation: B cells can also act as antigen-presenting cells (APCs). They process antigens and present them on their surface to T cells, helping to activate T cells.
    • Memory: Some B cells become memory B cells after an infection, providing long-term immunity by responding more rapidly and effectively upon subsequent exposures to the same antigen.
  • T Cells and T-cell Receptors (TCRs)

    • Helper T Cells (CD4+ T Cells): These cells assist the immune response by secreting cytokines that regulate the activity of other immune cells, including B cells, cytotoxic T cells, and macrophages. They play a pivotal role in activating and directing the adaptive immune response.
    • Cytotoxic T Cells (CD8+ T Cells): These cells can directly kill cells infected by pathogens, such as virus-infected cells and cancer cells, by recognizing antigens presented on the surface of these cells.
    • Memory T Cells: After an immune response, some T cells become memory T cells, providing long-term protection by responding more quickly and effectively to subsequent exposures to the same antigen.

In summary, B cells primarily function by producing antibodies to neutralize foreign invaders, while T cells operate through direct cell killing and regulating the activities of other immune cells. The cooperation between these two cell types is crucial for mounting an effective immune response.

B细胞和T细胞都是适应性免疫系统的重要组成部分,都配备有识别抗原的受体。然而,它们基于其受体类型和介导免疫反应的机制具有不同的功能:

  • B细胞和B细胞受体(BCRs)

    • 抗体产生:激活后,B细胞可以分化成产生抗体的浆细胞。抗体特异性地针对刺激B细胞的抗原,它们可以直接中和病原体或将它们标记以供免疫系统的其他部分销毁。
    • 抗原呈递:B细胞也可以作为抗原呈递细胞(APCs)的功能。它们处理抗原并将其呈递在其表面给T细胞,帮助激活T细胞。
    • 记忆:一些B细胞在感染后成为记忆B细胞,通过对同一抗原的后续暴露反应更快、更有效提供长期免疫。
  • T细胞和T细胞受体(TCRs)

    • 辅助T细胞(CD4+ T细胞):这些细胞通过分泌细胞因子来支持免疫反应,这些细胞因子调节包括B细胞、细胞毒性T细胞和巨噬细胞在内的其他免疫细胞的活动。它们在激活和指导适应性免疫反应中起着关键作用。
    • 细胞毒性T细胞(CD8+ T细胞):这些细胞能够直接杀死被病原体感染的细胞,如病毒感染的细胞和癌细胞。它们通过识别这些细胞表面的抗原来实现这一点。
    • 记忆T细胞:在免疫反应后,一些T细胞变成记忆T细胞,为体提供长期保护,对同一抗原的再次暴露作出更快速和更有效的反应。

总的来说,B细胞主要通过产生抗体来中和外来入侵者,而T细胞则通过直接细胞杀伤和调节免疫反应的其他细胞来发挥作用。这两种细胞类型的合作对于形成有效的免疫反应至关重要。

Differences Between Molecular and Genomic Biomarkers

The terms “molecular biomarker” and “genomic biomarker” refer to specific types of biomarkers used in the field of biomedicine for diagnosis, prognosis, and monitoring the response to treatment, among other applications. Both fall under the broader category of biomarkers, which are measurable indicators of some biological state or condition, but they differ in what they measure and represent.

  • Molecular Biomarker

    A molecular biomarker refers to any molecule measured in tissues, cells, or bodily fluids that is a sign of a normal or abnormal process, or of a condition or disease. Molecular biomarkers can include a wide range of molecule types, such as:

    • Proteins/peptides: Changes in expression levels, modifications, or activity can serve as markers.
    • Lipids: Variations in lipid profiles can indicate metabolic states or diseases.
    • Metabolites: Small molecule byproducts of metabolic processes can indicate physiological or pathological states.
    • RNA: Variations in RNA expression levels, including mRNA and non-coding RNAs (like miRNA), can reflect gene expression changes in response to a disease or treatment.

      Molecular biomarkers thus encompass a broad range of biological molecules that reflect the state of biological processes or diseases.

  • Genomic Biomarker

    A genomic biomarker specifically refers to a DNA sequence or genetic variation that provides information on an individual’s characteristics, condition, disease risk, or response to treatment. Genomic biomarkers include:

    • Single nucleotide polymorphisms (SNPs): Single base-pair variations in the genome that can be associated with disease risk, drug response, or other traits.
    • Gene mutations: Alterations in DNA sequence that can be benign, or lead to disease.
    • Copy number variations (CNVs): Changes in the number of copies of a particular gene, which can affect gene expression levels and contribute to various diseases.
    • Epigenetic markers: Changes in gene expression or cellular phenotype caused by mechanisms other than changes in the DNA sequence itself, such as DNA methylation and histone modification.

      Genomic biomarkers are directly related to an individual’s genetic information and provide insights into genetic predispositions, disease risks, and how an individual might respond to a drug or a treatment based on their genetic makeup.

The main difference between molecular and genomic biomarkers lies in their nature and the level at which they operate. Molecular biomarkers can be any type of molecule that indicates something about the state of the organism, often reflecting the effect of genetic variations, environmental factors, and their interactions. Genomic biomarkers, on the other hand, are specifically related to the genetic information (DNA or epigenetic modifications) of an individual, providing insights into genetic predispositions and the potential for disease or response to treatment. Both types of biomarkers are crucial for personalized medicine, allowing for tailored treatments based on individual molecular and genetic profiles.

Bioinformatics Screening of Potential Biomarkers from mRNA Expression Profiles to Discover Drug Targets and Agents for Cervical Cancer https://www.mdpi.com/1422-0067/23/7/3968

Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

  1. In the example Yersinia enterocolitica strain WA

    #[IMPORTANT FINAL STEPS]
    [STEP1] genbank2fasta.py WA.gb
    [STEP2] python3 ~/Scripts/genbank_to_gtf.py WA.gb WA.gtf  #(1col or 3cols)
    [STEP3] python3 ~/Scripts/modify_gtf.py WA.gtf WA_m.gtf   #(1col or 3cols) "CDS" --> "exon"
  2. In the example of 1585 wildtype genome WA_m.gtf

    #Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
    #cut -d$'\t' -f3 WA_m.gtf | sort -u
    #exon
    #gene
    #rRNA
    #tRNA
    #Gene; CDS (renamed exon=4121); rRNA (22); tRNA (84); ncRNA
    [STEP4] replace transcript_id " --> transcript_id "tx- under kate
    [STEP5] python3 ~/Scripts/add_geneid_genename_genebiotype.py WA_m.gtf WA_m_.gtf  #(4cols+3cols) ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
        #NO ERROR in WA case.
        #DEBUG ERROR Stop at line : CP143516       biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
        #    Error Message: Cannot find transcript_id!
    [STEP6] python3 ~/Scripts/add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py WA_m_.gtf WA_m__.gtf  # (4cols+4cols) Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described.
    [STEP7] cp WA_m__.gtf WA_m___.gtf; MANUALLY modify the lines tRNA --> exon and 'gene_biotype "tRNA";' 
          #for ncRNA (3), tmRNA (1), rRNA (19), tRNA (69) in 1585_m___.gtf
          #for rRNA (22), tRNA (84) in WA_m___.gtf                            
          #grep "gene_biotype \"tRNA\"" WA_m___.gtf | wc -l  #CHECK they are correctly changed!
    
    [RESULTS]
    #CP143516        biopython       gene    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
    #CP143516        biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
    #...
    #[Intermediate RESULTS]
    #/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
    #grep ">" genome.transcripts.fa  | wc -l
    #2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!
    
    [STEP8 (Optional, not used in the example)] To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
    #                see WA_m___.gtf
    #                grep "exon" WA_m___.gtf | wc -l
    #                2332
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
    #                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
    #                2332
    #                4227
    #For example exon-V0R30_00090 is a tRNA.
    #>tx-V0R30_00090
    #GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
    #We can also filter the count number in the downstream analysis from the big matrix!
    #tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
    >tx-V0R30_05655
    AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
    >tx-V0R30_06520
    TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
    >tx-V0R30_11060
    CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG
  3. nextflow running

    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_WA --fasta WA.fasta --gtf WA_m___.gtf -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon   --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0
    
    nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38   -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon  --skip_deseq2_qc
  4. [TODO] generate 1585 Excel table, we need DNA-sequence, we can use the DNA seqeunce from results_WA/genome/genome.transcripts.fa in the gb_to_excel.py

    python3 ~/Scripts/gb_to_excel.py 1585.gb 1585.xlsx
    #insert C./P. (Chrom or Plasmid1, Plasmid2, Plasmid3) as the second chrom in the 1585.xlsx.
  5. GO enrichment analysis of non-model bacterial RNA-seq_Sepidermidis

    ~/Scripts/1_filter_interproscan_out.py
    ~/Scripts/2_split_proteins_with_and_without_GO_terms3.py
    ~/Scripts/3_generate_fasta_not_annotated_by_interpro.py
    
    1. Use Custom Annotations for Enrichment Analysis
    
    For GO term enrichment analysis without standard database support, you can create a custom set of annotations if you have the necessary GO term associations for your genes. This involves manually mapping your genes to GO terms based on your experimental data or literature research.
    2. GO Term Enrichment without biomaRt
    
    If you already have GO term associations for your genes or can obtain them through other means, you can proceed with GO term enrichment analysis using clusterProfiler or similar tools by creating a custom OrgDb or using the enricher function directly with your custom GO term mappings.
    
    Here’s a simplified example using clusterProfiler with custom GO term mappings:
    
    library(clusterProfiler)
    
    # Assuming you have a named vector of genes and corresponding GO terms
    genes <- c("gene1", "gene2", "gene3", ...) # Your gene identifiers
    go_terms <- c("GO:0008150", "GO:0006355", "GO:0004672", ...) # Corresponding GO terms for these genes
    
    # Create a named list where names are GO terms and elements are vectors of gene IDs associated with each GO term
    go_list <- split(genes, go_terms)
    
    # Your list of interesting genes (e.g., those differentially expressed)
    interesting_genes <- c("gene1", "gene2", ...)
    
    # Perform enrichment analysis
    enrichment_results <- enricher(interesting_genes,
                                  TERM2GENE = go_list,
                                  pAdjustMethod = "BH",
                                  pvalueCutoff = 0.05,
                                  qvalueCutoff = 0.2)
    
    # View the enrichment results
    print(enrichment_results)
    
    #Install and Run InterProScan.
    #https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
    python3 setup.py -f interproscan.properties
    ./interproscan.sh -i 1585_CDS.fasta -f tsv -goterms -pa
    
    #InterProScan's remote option also supports additional output formats and analyses, such as producing graphical overviews (-svg for SVG output) or specific types of analyses only (-appl TIGRFAM,Pfam to limit the search to certain databases). Check the InterProScan documentation for a full list of options and more detailed explanations.
    
    https://interproscan-docs.readthedocs.io/en/latest/UserDocs.html#obtaining-a-copy-of-interproscan
    
    -dp,--disable-precalc                                     Optional.  Disables use of the precalculated match lookup
                                                              service.  All match calculations will be run locally.
    
    InterProScan test run
    
    This distribution of InterProScan provides a set of protein test sequences, which you can use to check how InterProScan behaves on your system. First, if you have not yet run the initialisation script run the following command:
    
    python3 setup.py -f interproscan.properties
    
    This command will press and index the hmm models to prepare them into a format used by hmmscan. This command need only be run once.
    
    You can then run the following two test case commands:
    
    ./interproscan.sh -i test_all_appl.fasta -f tsv -dp
    ./interproscan.sh -i test_all_appl.fasta -f tsv
    
    ./interproscan.sh -i 1585_CDS_test.fasta -f tsv -goterms -pa
    #./interproscan.sh -i 1585_CDS_test.fasta -f TSV,XML -goterms
    
    awk -F'\t' '{if(NR>1) print $1, $14}' 1585_CDS_test.fasta.tsv1 > filtered_GO_terms.txt
    
    #4d_vs_15m-up.id
    #4d_vs_15m-down.id
    
    # Install and load required packages
    #install.packages("rentrez")
    #install.packages("KEGGREST")
    #install.packages("dplyr") # for data manipulation
    #install.packages("ggplot2") # for visualization
    
    library(rentrez)
    library(KEGGREST)
    library(dplyr)
    library(ggplot2)
    
    # Read GenBank file
    genbank_file <- "../../../1585.gb"
    genbank_data <- readLines(genbank_file)
    
    # Extract accession number from the GenBank file
    accession <- grep("^ACCESSION", genbank_data, value = TRUE)
    accession <- gsub("^ACCESSION\\s+", "", accession)
    accession <- trimws(accession)
    
    # Function to retrieve sequence information from GenBank file
    get_sequence_from_genbank <- function(genbank_file) {
      # Read GenBank file
      genbank_data <- readLines(genbank_file)
    
      # Find sequence lines starting with "ORIGIN"
      origin_line <- grep("^ORIGIN", genbank_data)
    
      # Extract sequence lines
      sequence_lines <- genbank_data[(origin_line + 1):length(genbank_data)]
    
      # Remove non-sequence characters and concatenate into a single sequence
      sequence <- gsub("[0-9 ]", "", sequence_lines)
      sequence <- paste(sequence, collapse = "")
    
      return(sequence)
    }
    
    # GenBank file
    genbank_file <- "1585.gb"
    
    # Retrieve sequence from GenBank file
    sequence <- get_sequence_from_genbank(genbank_file)
    
    # Perform KEGG pathway enrichment analysis
    kegg_pathways <- keggList("pathway")
    
    # Toy example: Randomly select 100 genes and map to KEGG pathways
    set.seed(123)
    genes <- sample(1:1000, 100)
    
    # Mapping genes to KEGG pathways
    mapped_pathways <- sapply(genes, function(gene) {
      kegg_gene <- keggGet("link", gene)
      if (length(kegg_gene) > 0) {
        pathway_ids <- kegg_gene$ENTRY %>% unique()
        pathway_names <- kegg_pathways[kegg_pathways$ID %in% pathway_ids, "NAME"]
        if (length(pathway_names) > 0) {
          return(paste(pathway_names, collapse = "; "))
        } else {
          return(NA)
        }
      } else {
        return(NA)
      }
    })
    
    # Filter out NAs
    mapped_pathways <- mapped_pathways[!is.na(mapped_pathways)]
    
    # Count pathway occurrences
    pathway_counts <- table(unlist(strsplit(mapped_pathways, "; ")))
    
    # Sort pathways by occurrence
    pathway_counts <- sort(pathway_counts, decreasing = TRUE)
    
    # Top 10 enriched pathways
    top_pathways <- head(pathway_counts, 10)
    
    # Plot the top enriched pathways
    barplot(top_pathways, main = "Top 10 Enriched KEGG Pathways",
            xlab = "Pathway", ylab = "Frequency")