Host and Viral Temporal Transcriptomics of VZV Infection in Human Skin Organoids (Data_Nina_RNAseq_2026)

draw_3D.py.zip

complete_deg_pipeline.R

  1. Preparing samplesheet.csv

         sample,fastq_1,fastq_2,strandedness
         control_r1,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R1.fastq.gz,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R2.fastq.gz,auto
         control_r2,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R1.fastq.gz,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R2.fastq.gz,auto
         control_r3,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R1.fastq.gz,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R2.fastq.gz,auto
         VZV.d10_r1,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R1.fastq.gz,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R2.fastq.gz,auto
         VZV.d10_r2,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R1.fastq.gz,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R2.fastq.gz,auto
         VZV.d10_r3,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R1.fastq.gz,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R2.fastq.gz,auto
         VZV.d15_r1,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R1.fastq.gz,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R2.fastq.gz,auto
         VZV.d15_r2,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R1.fastq.gz,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R2.fastq.gz,auto
         VZV.d15_r3,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R1.fastq.gz,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R2.fastq.gz,auto
         VZV.d20_r1,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R1.fastq.gz,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R2.fastq.gz,auto
         VZV.d20_r2,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R1.fastq.gz,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R2.fastq.gz,auto
         VZV.d20_r3,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R1.fastq.gz,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R2.fastq.gz,auto
  2. DO NOT RUN Trimmomatic before Nextflow.

         It is highly recommended to let the nf-core/rnaseq pipeline handle all trimming internally. Pre-trimming your data can break the pipeline's internal file tracking, interfere with the UMI extraction step, and make reproducibility harder.
         The nf-core/rnaseq pipeline has built-in support for both UMI extraction and read trimming, and it executes them in the correct order: it will extract the UMIs from the raw Read 1 first, and then trim the error-prone bases from Read 2.
         Here is how you should adjust your approach and your Nextflow command.
    
         不应该在运行 Nextflow 之前手动使用 Trimmomatic。
         强烈建议让 nf-core/rnaseq 流程在内部处理所有的修剪(trimming)工作。提前手动修剪数据可能会破坏流程内部的文件追踪机制,干扰 UMI 的提取步骤,并降低结果的可重复性。
         nf-core/rnaseq 流程内置了对 UMI 提取和 reads 修剪的支持,并且会按照正确的顺序执行:它会首先从原始的 Read 1 中提取 UMI,然后再修剪 Read 2 中容易出错的碱基。
  3. Nextflow running

         #yes, exactly, the Reads contain UMI: the first 12 nt of Read1.
         #The CORALL v2 Kit from Lexogen was used for library prep. According to the manufacturer it can additionally be beneficial for the mapping to trim the first 10 nt of Read2 as well, as this site is more error prone (due to priming they say).
         #If you need more information from my side, let me know!
    
         方法一:正则表达式法 (Regex) —— 你目前选择的方案
         根据你提供的文档,使用 regex 方法时,必须使用特定的命名捕获组。文档中明确指出:
         "The expected groups in the regex are: umi_n = UMI positions, where n can be any value (required)"
         这意味着,捕获 UMI 的组名必须是 umi_ 加上一个数字(例如 umi_1),而不能只是 umi。
         正确的正则表达式:'^(?P
    .{12}).*’ 原理解释: ^:匹配字符串的开头。 (?P .{12}):捕获前 12 个任意字符,并将其命名为 umi_1(这将作为 UMI 被提取并添加到 read name 中)。 .*:匹配并保留剩余的序列(文档说明:未被命名为 discard 的其他部分会被重新附加到 read 序列上)。 方法二:字符串法 (String) —— 更简单直观的替代方案 文档中提到: “string: This should be used where the barcodes are always in the same place in the read. N = UMI position” 因为你的 UMI 长度固定(12nt)且位置固定(永远在 Read 1 开头),你完全可以使用更简单的 string 方法,用 N 来代表 UMI 的位置。 对应的 Nextflow 参数: –umitools_extract_method string –umitools_bc_pattern NNNNNNNNNNNN (正好 12 个 N) # OLD_COMMANDS_DEPRECATED: under sage in the early running # ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq # nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_GRCh38 –genome GRCh38 –with_umi –umitools_extract_method regex –umitools_bc_pattern ‘^(?P.{12}).*’ -profile docker -resume –max_cpus 54 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –pseudo_aligner salmon –umitools_grouping_method unique # #Debug the following error: added “–minAssignedFrags 0 \\” to modules/nf-core/salmon/quant/main.nf option “salmon quant” and added “–min_mapped_reads 0” in the nextflow command #nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \ nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf –help nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \ –input samplesheet.csv \ –outdir results_GRCh38 \ –genome GRCh38 \ –with_umi \ –umitools_extract_method regex \ –umitools_bc_pattern ‘^(?P .{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon \ –pseudo_aligner salmon \ –umitools_grouping_method unique 关键执行逻辑(让你完全放心): 第一步 (umi_tools):流程会读取原始的 Read 1,根据 ^(?P .{12}).* 提取前 12 个碱基作为 UMI,并将其添加到 read 的名称中(例如 @READ_ID_UMISEQUENCE)。此时 Read 1 的序列中不再包含这 12 个碱基。 第二步 (fastp):流程接着运行 fastp。由于你设置了 –clip_r2 10,fastp 会精准地切除 Read 2 的前 10 个碱基(解决 Lexogen CORALL v2 的引物错误问题)。 互不干扰:umi_tools 只处理 Read 1 的 UMI,fastp 负责全局的质控和 Read 2 的修剪。你绝对不需要在运行 Nextflow 之前手动使用 Trimmomatic。
  4. To verify that fastp has successfully trimmed the first 10 nucleotides from Read 2, you can use any of the following three methods.

         *(Note: First, ensure that the `FASTP` step in your Nextflow terminal has actually finished running and shows a green checkmark ✔).*
    
         ### Method: Manually Measure the Sequence Length (Raw Data Proof)
         Because you did not use the `--save_trimmed` flag, the trimmed FASTQ files are not saved in your `results/` folder. However, they still exist temporarily in the `work/` directory. You can measure the exact length of the first read to prove 10bp was cut.
    
         *Note: Since you used `--with_umi`, the input file for fastp is the output from UMI-tools, so the filename will likely contain `umi_extract` instead of `raw`.*
    
         1. Navigate into the `work/` directory for the `FASTP` step:
         ```bash
         cd work/fc/def4de*/
         cd work/a1/b2c3d4*/
         ```
         2. List the files to see the exact naming convention:
         ```bash
         ls -lh *2.fastq.gz
         ```
         *(You will likely see an input file like `*.umi_extract_2.fastq.gz` and an output file like `*.fastp.trimmed_2.fastq.gz` or `*.trim_2.fastq.gz`).*
         3. **Measure the length of the first sequence in the INPUT Read 2:**
         ```bash
         zcat *.umi_extract_1.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat *.umi_extract_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         #142
         #154
         ```
         *(Let's say this outputs `150`)*
         4. **Measure the length of the first sequence in the OUTPUT Read 2:**
         ```bash
         zcat *.trimmed_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat control_r3_1.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat control_r3_2.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         #142
         #144
         ```
         *(This should output exactly `140`)*.
    
         If the output length is exactly 10 bases shorter than the input length, the trimming was 100% successful.
    
         ---
    
         ### 💡 Pro Tip for Future Runs
         If you want to easily inspect the trimmed FASTQ files in your `results/` folder without digging through the `work/` directory next time, simply add **`--save_trimmed`** to your `nextflow run` command. This will force the pipeline to copy the post-fastp FASTQ files to your main results directory.
  5. Nextflow running on virus

         # OLD_COMMANDS_DEPRECATED: under sage in the early running
         # The virus-referenced commands!
         # nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_chrHsv1 --fasta chrHsv1_s17.fasta --gtf chrHsv1_s17.gtf --with_umi --umitools_extract_method regex --umitools_bc_pattern '^(?P.{12}).*' --umitools_dedup_stats -profile test_full -resume --max_memory 256.GB --max_time 2400.h --save_reference --aligner star_salmon --gtf_extra_attributes gene_id --gtf_group_features transcript_id --featurecounts_group_type gene_id --featurecounts_feature_type transcript --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --min_mapped_reads 0
    
             #This error occurs because your GenBank file was likely exported from SnapGene (as seen in the COMMENT section of the file). SnapGene sometimes formats the date in the LOCUS line as DD-MM-YYYY (e.g., 13-06-2025), but Biopython's parser strictly expects the standard NCBI GenBank format, which requires a 3-letter month abbreviation like DD-MMM-YYYY (e.g., 13-JUN-2025).
             sed -i 's/13-06-2025/13-JUN-2025/' BB1528_nanopore_consensus.gb
    
             python3 gb_to_fasta_gtf_v2.py BB1528_nanopore_consensus.gb BB1528_nanopore_consensus.fasta BB1528_nanopore_consensus.gtf BB1528    # Generate the last two files in the command line
             python fix_gtf_to_hsv1_format.py    # Generate BB1528_final.gtf
    
             nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \
             --input samplesheet_virus.csv \
             --outdir results_BB1528 \
             --fasta BB1528_nanopore_consensus.fasta \
             --gtf BB1528_final.gtf \
             --with_umi \
             --umitools_extract_method regex \
             --umitools_bc_pattern '^(?P
    .{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon –pseudo_aligner salmon \ –gtf_extra_attributes gene_id \ –gtf_group_features transcript_id \ –featurecounts_group_type gene_id \ –featurecounts_feature_type transcript \ –skip_rseqc –skip_dupradar –skip_preseq –skip_biotype_qc –skip_deseq2_qc –skip_multiqc \ –umitools_grouping_method unique –min_mapped_reads 0 # —- DEBUG_LOG —- Good news: Your GTF and FASTA files are perfectly formatted! We can prove this from your log: [info] Index contained 78 targets This means Salmon successfully built the transcriptome index and found all 78 transcripts from your BB1528_final.gtf. The reference files are completely correct. What is causing the error? The error is purely a biological/sample-specific issue: [warning] salmon was only able to assign 9 fragments to transcripts in the index, but the minimum number of required assigned fragments (–minAssignedFrags) was 10. The nf-core/rnaseq pipeline automatically subsamples your reads (in this case, 1,000,000 fragments) to run a quick Salmon quantification and check the library strandedness (–libType=A). Because you are mapping to a specific viral reference (BB1528), it is highly likely that control_r3 is a negative control or a sample with an extremely low viral load. Out of the 1,000,000 subsampled reads, only 9 reads actually belonged to the virus. Salmon has a built-in safety threshold (–minAssignedFrags 10) that intentionally crashes the pipeline if fewer than 10 reads map to the transcriptome, to prevent “garbage” quantification of empty samples. Since 9 The pipeline is failing during the FASTQ_SUBSAMPLE_FQ_SALMON step in the three control samples! Later on, we can manually set all counts for all virus transcripts in the three control-conditions as 0.
  6. A small BUG in both nextflow runs: versions.yml

         vim /mnt/md1/DATA/Data_Nina_RNAseq_2026/work/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml
         BUG_LINES:
         "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON":
             umitools: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-4ht1kmld because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
         1.1.4
         SHOULD_BE:
         "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON":
             umitools: 1.1.4
    
         ALTERNATIVE: Not correct the bug, copy the version log to mail directory:
             cp work_GRCh38_DEL/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml results_GRCh38
             cp work/bf/4a1e941bc32b97de1bc8f4fd47df79/collated_versions.yml results_BB1528
  7. import data and pca-plot

         # Import the required libraries
         library("AnnotationDbi")
         library("clusterProfiler")
         library("ReactomePA")
         library(gplots)
         library(tximport)
         library(DESeq2)
         library("org.Hs.eg.db")
         library(dplyr)
         library(tidyverse)
    
         # ------ For human genome ------
         setwd("~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon")
         # Define paths to your Salmon output quantification files
         files <- c("control_r1" = "./control_r1/quant.sf",
                 "control_r2" = "./control_r2/quant.sf",
                 "control_r3" = "./control_r3/quant.sf",
                 "VZV.d10_r1" = "./VZV.d10_r1/quant.sf",
                 "VZV.d10_r2" = "./VZV.d10_r2/quant.sf",
                 "VZV.d10_r3" = "./VZV.d10_r3/quant.sf",
                 "VZV.d15_r1" = "./VZV.d15_r1/quant.sf",
                 "VZV.d15_r2" = "./VZV.d15_r2/quant.sf",
                 "VZV.d15_r3" = "./VZV.d15_r3/quant.sf",
                 "VZV.d20_r1" = "./VZV.d20_r1/quant.sf",
                 "VZV.d20_r2" = "./VZV.d20_r2/quant.sf",
                 "VZV.d20_r3" = "./VZV.d20_r3/quant.sf")
         # Import the transcript abundance data with tximport
         txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
         # Define the replicates and condition of the samples
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         # Define the colData for DESeq2
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
         # -- transcript-level count data --
         # Create DESeqDataSet object
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         write.csv(counts(dds), file="transcript_counts.csv")
    
         # -- gene-level count data --
         # Read in the tx2gene map from salmon_tx2gene.tsv
         tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
         # Set the column names
         colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
         # Remove the gene_name column if not needed
         tx2gene <- tx2gene[,1:2]
         # Import and summarize the Salmon data with tximport
         txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
         # Continue with the DESeq2 workflow as before...
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
         write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
         # ------ For virus genome ------
         setwd("~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon")
         # Define paths to your Salmon output quantification files
         files <- c("VZV.d10_r1" = "./VZV.d10_r1/quant.sf",
                 "VZV.d10_r2" = "./VZV.d10_r2/quant.sf",
                 "VZV.d10_r3" = "./VZV.d10_r3/quant.sf",
                 "VZV.d15_r1" = "./VZV.d15_r1/quant.sf",
                 "VZV.d15_r2" = "./VZV.d15_r2/quant.sf",
                 "VZV.d15_r3" = "./VZV.d15_r3/quant.sf",
                 "VZV.d20_r1" = "./VZV.d20_r1/quant.sf",
                 "VZV.d20_r2" = "./VZV.d20_r2/quant.sf",
                 "VZV.d20_r3" = "./VZV.d20_r3/quant.sf")
         # Import the transcript abundance data with tximport
         txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
         # Define the replicates and condition of the samples
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         # Define the colData for DESeq2
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
         # -- transcript-level count data --
         # Create DESeqDataSet object
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         write.csv(counts(dds), file="transcript_counts.csv")
    
         # -- gene-level count data --
         # Read in the tx2gene map from salmon_tx2gene.tsv
         tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
         # Set the column names
         colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
         # Remove the gene_name column if not needed
         tx2gene <- tx2gene[,1:2]
         # Import and summarize the Salmon data with tximport
         txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
         # Continue with the DESeq2 workflow as before...
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
         write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
         # MANUALLY set all counts for all virus transcripts in the three control-conditions ("control_r1","control_r2","control_r3") as 0 using Excel
    
         # ------ Merge the raw counts of human and microbe ------
         #cat ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv > merged_gene_counts.csv
         #DELETE the second line: "","control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3"
    
         # 1. Remove the hidden ^M (Windows carriage returns) from both files
         sed -i 's/\r$//' ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv
         # 2. Extract the header from the first file to start the new merged file
         head -n 1 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv > merged_gene_counts.csv
         # 3. Append the data from the first file (skipping its header)
         tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv >> merged_gene_counts.csv
         # 4. Append the data from the second file (skipping its header)
         tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv >> merged_gene_counts.csv
    
         # Should be 60683 lines
         #~/Tools/csv2xls-0.4/csv_to_xls.py merged_gene_counts.csv -d',' -o raw_gene_counts.xls;
    
         # -- for merged analysis due to false normalization factors wenn alone analyzed on virus data --
         setwd("~/DATA/Data_Nina_RNAseq_2026/")
         d.raw <- read.csv("merged_gene_counts.csv", header=TRUE, row.names=1)
    
         # Re-define the replicates and condition for the merged counts
         rownames <- factor(c("control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3"))
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         colData <- data.frame(condition=condition, replicate=replicate, row.names=rownames)
    
         #CORRECTED: dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition+replicate)
         #⚠️ A Note on the Design Formula: In the snippet, I used design = ~condition+replicate. In DESeq2, adding replicate (r1, r2, r3) to the design formula tells the model to treat "r1" as a globally systematic batch effect across all conditions. Unless the replicates were processed in distinct batches (e.g., all r1 on Monday, all r2 on Tuesday), it is standard to use design = ~ condition to simply measure the biological variance. Therefore, I have updated the script to use ~ condition.
         dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition)
         dim(counts(dds))
         head(counts(dds), 10)
    
         rld <- rlogTransformation(dds)
    
         #We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function.
         #The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples.
         #The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups.
         #So, the typical workflow is:
         #  - Create the DESeqDataSet object.
         #  - Use estimateSizeFactors to normalize for library size.
         #  - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.)
         #  - Use DESeq for the differential expression analysis.
         #  - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq.
    
         # draw simple pca and heatmap
         library(gplots)
         library("RColorBrewer")
         #mat <- assay(rld)
         #mm <- model.matrix(~condition, colData(rld))
         #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
         #assay(rld) <- mat
         # -- pca --
         png("pca.png", 1200, 800)
         plotPCA(rld, intgroup=c("condition"))
         dev.off()
         # -- heatmap --
         png("heatmap.png", 1200, 800)
         distsRL <- dist(t(assay(rld)))
         mat <- as.matrix(distsRL)
         hc <- hclust(distsRL)
         hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
         heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
         dev.off()
  8. draw 3D PCA plots.

         library(gplots)
         library("RColorBrewer")
    
         library(ggplot2)
         data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE)
         write.csv(data, file="plotPCA_data.csv")
         #calculate all PCs including PC3 with the following codes
         library(genefilter)
         ntop <- 500
         rv <- rowVars(assay(rld))
         select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
         mat <- t( assay(rld)[select, ] )
         pc <- prcomp(mat)
         pc$x[,1:3]
         #df_pc <- data.frame(pc$x[,1:3])
         df_pc <- data.frame(pc$x)
         identical(rownames(data), rownames(df_pc)) #-->TRUE
    
         data$PC1 <- NULL
         data$PC2 <- NULL
         merged_df <- merge(data, df_pc, by = "row.names")
         #merged_df <- merged_df[, -1]
         row.names(merged_df) <- merged_df$Row.names
         merged_df$Row.names <- NULL  # remove the "name" column
         merged_df$name <- NULL
         merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","group","condition","replicate")]
         write.csv(merged_df, file="merged_df_12PCs.csv")
         summary(pc)
         #                           PC1     PC2     PC3
         #Proportion of Variance  0.5969  0.1732 0.09444
    
         python3 draw_3D.py
         # Edit on the generated SVG-figure.
         #/usr/bin/convert PCA_3D.png -crop 2900x1600+250+700 PCA_3D_cropped.png
  9. (Optional) estimate size factors and dispersion values.

         #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
         #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
         #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    
         sizeFactors(dds)
         #NULL
         # Estimate size factors
         dds <- estimateSizeFactors(dds)
         # Estimate dispersions
         dds <- estimateDispersions(dds)
         #> sizeFactors(dds)
         #control_r1 control_r2 control_r3 VZV.d10_r1 VZV.d10_r2 VZV.d10_r3 VZV.d15_r1
         #1.2072182  1.1009757  0.8610196  1.1965615  0.8575169  1.1714809  0.8344160
         #VZV.d15_r2 VZV.d15_r3 VZV.d20_r1 VZV.d20_r2 VZV.d20_r3
         #0.5278583  1.1768657  1.1006353  1.1288974  1.2630530
    
         # If alone with virus data, the following BUG occured:
         #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                             HeLa_TO_r1                      HeLa_TO_r2
                             0.9978755                       1.1092227
         data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    
         #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    
         1/0.9978755=1.002129023
         1/1.1092227=
    
         #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
         bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
         bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
    
         raw_counts <- counts(dds)
         normalized_counts <- counts(dds, normalized=TRUE)
         #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
         #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
         #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
         estimSf <- function (cds){
             # Get the count matrix
             cts <- counts(cds)
    
             # Compute the geometric mean
             geomMean <- function(x) prod(x)^(1/length(x))
    
             # Compute the geometric mean over the line
             gm.mean  <-  apply(cts, 1, geomMean)
    
             # Zero values are set to NA (avoid subsequentcdsdivision by 0)
             gm.mean[gm.mean == 0] <- NA
    
             # Divide each line by its corresponding geometric mean
             # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
             # MARGIN: 1 or 2 (line or columns)
             # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
             # FUN: the function to be applied
             cts <- sweep(cts, 1, gm.mean, FUN="/")
    
             # Compute the median over the columns
             med <- apply(cts, 2, median, na.rm=TRUE)
    
             # Return the scaling factor
             return(med)
         }
         #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
         #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
         #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
         #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
         #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
         #DESeq2’s median of ratios [1]
         #EdgeR’s trimmed mean of M values (TMM) [2]
         #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
         test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
         sum(test_normcount != normalized_counts)
  10. Merged Host-Virus Counts -> DESeq2 -> DEGs of 6 Comparisons

         Rscript complete_deg_pipeline.R
    
         # Key changes made:
         #     1. Virus gene identification: Added detection of virus genes using pattern matching ("^ORF|cat_RNA|redF_RNA|repE_RNA|sopA_RNA|sopB_RNA|UL45_RNA")
         #     2. Updated significance cutoff: Changed from |log2FC| >= 1 to |log2FC| >= 2
         #     3. Color scheme:
         #        * Green for virus RNAs (all virus genes)
         #        * Dark green for significantly regulated virus genes
         #        * Red for significantly up-regulated host genes
         #        * Blue for significantly down-regulated host genes
         #        * Gray for non-significant host genes
         #     4. Enhanced legend: The legend now clearly distinguishes between virus RNAs and host genes with their regulation status
         #     5. Summary statistics: Added detailed output showing counts of virus vs host genes and their regulation status
    
         #  ================================================================================
         # 📊 FINAL SUMMARY OF ALL 6 COMPARISONS
         #                 name total  up down virus sig_total pct_sig
         # 03_VZV.d20_vs_control 17594 193  220    77       413     2.3
         # 02_VZV.d15_vs_control 17594 139   94    77       233     1.3
         # 05_VZV.d20_vs_VZV.d10 17594  35  114    77       149     0.8
         # 01_VZV.d10_vs_control 17594  89   20    77       109     0.6
         # 06_VZV.d20_vs_VZV.d15 17594  28   38    77        66     0.4
         # 04_VZV.d15_vs_VZV.d10 17594   6   13    77        19     0.1

Leave a Reply

Your email address will not be published. Required fields are marked *