-
Preparing samplesheet.csv
sample,fastq_1,fastq_2,strandedness control_r1,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R1.fastq.gz,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R2.fastq.gz,auto control_r2,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R1.fastq.gz,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R2.fastq.gz,auto control_r3,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R1.fastq.gz,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R2.fastq.gz,auto VZV.d10_r1,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R1.fastq.gz,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R2.fastq.gz,auto VZV.d10_r2,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R1.fastq.gz,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R2.fastq.gz,auto VZV.d10_r3,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R1.fastq.gz,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R2.fastq.gz,auto VZV.d15_r1,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R1.fastq.gz,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R2.fastq.gz,auto VZV.d15_r2,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R1.fastq.gz,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R2.fastq.gz,auto VZV.d15_r3,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R1.fastq.gz,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R2.fastq.gz,auto VZV.d20_r1,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R1.fastq.gz,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R2.fastq.gz,auto VZV.d20_r2,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R1.fastq.gz,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R2.fastq.gz,auto VZV.d20_r3,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R1.fastq.gz,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R2.fastq.gz,auto -
DO NOT RUN Trimmomatic before Nextflow.
It is highly recommended to let the nf-core/rnaseq pipeline handle all trimming internally. Pre-trimming your data can break the pipeline's internal file tracking, interfere with the UMI extraction step, and make reproducibility harder. The nf-core/rnaseq pipeline has built-in support for both UMI extraction and read trimming, and it executes them in the correct order: it will extract the UMIs from the raw Read 1 first, and then trim the error-prone bases from Read 2. Here is how you should adjust your approach and your Nextflow command. 不应该在运行 Nextflow 之前手动使用 Trimmomatic。 强烈建议让 nf-core/rnaseq 流程在内部处理所有的修剪(trimming)工作。提前手动修剪数据可能会破坏流程内部的文件追踪机制,干扰 UMI 的提取步骤,并降低结果的可重复性。 nf-core/rnaseq 流程内置了对 UMI 提取和 reads 修剪的支持,并且会按照正确的顺序执行:它会首先从原始的 Read 1 中提取 UMI,然后再修剪 Read 2 中容易出错的碱基。 -
Nextflow running
.{12}).*’ 原理解释: ^:匹配字符串的开头。 (?P .{12}):捕获前 12 个任意字符,并将其命名为 umi_1(这将作为 UMI 被提取并添加到 read name 中)。 .*:匹配并保留剩余的序列(文档说明:未被命名为 discard 的其他部分会被重新附加到 read 序列上)。 方法二:字符串法 (String) —— 更简单直观的替代方案 文档中提到: “string: This should be used where the barcodes are always in the same place in the read. N = UMI position” 因为你的 UMI 长度固定(12nt)且位置固定(永远在 Read 1 开头),你完全可以使用更简单的 string 方法,用 N 来代表 UMI 的位置。 对应的 Nextflow 参数: –umitools_extract_method string –umitools_bc_pattern NNNNNNNNNNNN (正好 12 个 N) # OLD_COMMANDS_DEPRECATED: under sage in the early running # ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq # nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_GRCh38 –genome GRCh38 –with_umi –umitools_extract_method regex –umitools_bc_pattern ‘^(?P.{12}).*’ -profile docker -resume –max_cpus 54 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –pseudo_aligner salmon –umitools_grouping_method unique # #Debug the following error: added “–minAssignedFrags 0 \\” to modules/nf-core/salmon/quant/main.nf option “salmon quant” and added “–min_mapped_reads 0” in the nextflow command #nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \ nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf –help nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \ –input samplesheet.csv \ –outdir results_GRCh38 \ –genome GRCh38 \ –with_umi \ –umitools_extract_method regex \ –umitools_bc_pattern ‘^(?P .{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon \ –pseudo_aligner salmon \ –umitools_grouping_method unique 关键执行逻辑(让你完全放心): 第一步 (umi_tools):流程会读取原始的 Read 1,根据 ^(?P .{12}).* 提取前 12 个碱基作为 UMI,并将其添加到 read 的名称中(例如 @READ_ID_UMISEQUENCE)。此时 Read 1 的序列中不再包含这 12 个碱基。 第二步 (fastp):流程接着运行 fastp。由于你设置了 –clip_r2 10,fastp 会精准地切除 Read 2 的前 10 个碱基(解决 Lexogen CORALL v2 的引物错误问题)。 互不干扰:umi_tools 只处理 Read 1 的 UMI,fastp 负责全局的质控和 Read 2 的修剪。你绝对不需要在运行 Nextflow 之前手动使用 Trimmomatic。#yes, exactly, the Reads contain UMI: the first 12 nt of Read1. #The CORALL v2 Kit from Lexogen was used for library prep. According to the manufacturer it can additionally be beneficial for the mapping to trim the first 10 nt of Read2 as well, as this site is more error prone (due to priming they say). #If you need more information from my side, let me know! 方法一:正则表达式法 (Regex) —— 你目前选择的方案 根据你提供的文档,使用 regex 方法时,必须使用特定的命名捕获组。文档中明确指出: "The expected groups in the regex are: umi_n = UMI positions, where n can be any value (required)" 这意味着,捕获 UMI 的组名必须是 umi_ 加上一个数字(例如 umi_1),而不能只是 umi。 正确的正则表达式:'^(?P -
To verify that
fastphas successfully trimmed the first 10 nucleotides from Read 2, you can use any of the following three methods.*(Note: First, ensure that the `FASTP` step in your Nextflow terminal has actually finished running and shows a green checkmark ✔).* ### Method: Manually Measure the Sequence Length (Raw Data Proof) Because you did not use the `--save_trimmed` flag, the trimmed FASTQ files are not saved in your `results/` folder. However, they still exist temporarily in the `work/` directory. You can measure the exact length of the first read to prove 10bp was cut. *Note: Since you used `--with_umi`, the input file for fastp is the output from UMI-tools, so the filename will likely contain `umi_extract` instead of `raw`.* 1. Navigate into the `work/` directory for the `FASTP` step: ```bash cd work/fc/def4de*/ cd work/a1/b2c3d4*/ ``` 2. List the files to see the exact naming convention: ```bash ls -lh *2.fastq.gz ``` *(You will likely see an input file like `*.umi_extract_2.fastq.gz` and an output file like `*.fastp.trimmed_2.fastq.gz` or `*.trim_2.fastq.gz`).* 3. **Measure the length of the first sequence in the INPUT Read 2:** ```bash zcat *.umi_extract_1.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}' zcat *.umi_extract_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}' #142 #154 ``` *(Let's say this outputs `150`)* 4. **Measure the length of the first sequence in the OUTPUT Read 2:** ```bash zcat *.trimmed_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}' zcat control_r3_1.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}' zcat control_r3_2.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}' #142 #144 ``` *(This should output exactly `140`)*. If the output length is exactly 10 bases shorter than the input length, the trimming was 100% successful. --- ### 💡 Pro Tip for Future Runs If you want to easily inspect the trimmed FASTQ files in your `results/` folder without digging through the `work/` directory next time, simply add **`--save_trimmed`** to your `nextflow run` command. This will force the pipeline to copy the post-fastp FASTQ files to your main results directory. -
Nextflow running on virus
.{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon –pseudo_aligner salmon \ –gtf_extra_attributes gene_id \ –gtf_group_features transcript_id \ –featurecounts_group_type gene_id \ –featurecounts_feature_type transcript \ –skip_rseqc –skip_dupradar –skip_preseq –skip_biotype_qc –skip_deseq2_qc –skip_multiqc \ –umitools_grouping_method unique –min_mapped_reads 0 # —- DEBUG_LOG —- Good news: Your GTF and FASTA files are perfectly formatted! We can prove this from your log: [info] Index contained 78 targets This means Salmon successfully built the transcriptome index and found all 78 transcripts from your BB1528_final.gtf. The reference files are completely correct. What is causing the error? The error is purely a biological/sample-specific issue: [warning] salmon was only able to assign 9 fragments to transcripts in the index, but the minimum number of required assigned fragments (–minAssignedFrags) was 10. The nf-core/rnaseq pipeline automatically subsamples your reads (in this case, 1,000,000 fragments) to run a quick Salmon quantification and check the library strandedness (–libType=A). Because you are mapping to a specific viral reference (BB1528), it is highly likely that control_r3 is a negative control or a sample with an extremely low viral load. Out of the 1,000,000 subsampled reads, only 9 reads actually belonged to the virus. Salmon has a built-in safety threshold (–minAssignedFrags 10) that intentionally crashes the pipeline if fewer than 10 reads map to the transcriptome, to prevent “garbage” quantification of empty samples. Since 9 The pipeline is failing during the FASTQ_SUBSAMPLE_FQ_SALMON step in the three control samples! Later on, we can manually set all counts for all virus transcripts in the three control-conditions as 0.# OLD_COMMANDS_DEPRECATED: under sage in the early running # The virus-referenced commands! # nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_chrHsv1 --fasta chrHsv1_s17.fasta --gtf chrHsv1_s17.gtf --with_umi --umitools_extract_method regex --umitools_bc_pattern '^(?P.{12}).*' --umitools_dedup_stats -profile test_full -resume --max_memory 256.GB --max_time 2400.h --save_reference --aligner star_salmon --gtf_extra_attributes gene_id --gtf_group_features transcript_id --featurecounts_group_type gene_id --featurecounts_feature_type transcript --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --min_mapped_reads 0 #This error occurs because your GenBank file was likely exported from SnapGene (as seen in the COMMENT section of the file). SnapGene sometimes formats the date in the LOCUS line as DD-MM-YYYY (e.g., 13-06-2025), but Biopython's parser strictly expects the standard NCBI GenBank format, which requires a 3-letter month abbreviation like DD-MMM-YYYY (e.g., 13-JUN-2025). sed -i 's/13-06-2025/13-JUN-2025/' BB1528_nanopore_consensus.gb python3 gb_to_fasta_gtf_v2.py BB1528_nanopore_consensus.gb BB1528_nanopore_consensus.fasta BB1528_nanopore_consensus.gtf BB1528 # Generate the last two files in the command line python fix_gtf_to_hsv1_format.py # Generate BB1528_final.gtf nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \ --input samplesheet_virus.csv \ --outdir results_BB1528 \ --fasta BB1528_nanopore_consensus.fasta \ --gtf BB1528_final.gtf \ --with_umi \ --umitools_extract_method regex \ --umitools_bc_pattern '^(?P -
A small BUG in both nextflow runs: versions.yml
vim /mnt/md1/DATA/Data_Nina_RNAseq_2026/work/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml BUG_LINES: "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON": umitools: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-4ht1kmld because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. 1.1.4 SHOULD_BE: "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON": umitools: 1.1.4 ALTERNATIVE: Not correct the bug, copy the version log to mail directory: cp work_GRCh38_DEL/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml results_GRCh38 cp work/bf/4a1e941bc32b97de1bc8f4fd47df79/collated_versions.yml results_BB1528 -
import data and pca-plot
# Import the required libraries library("AnnotationDbi") library("clusterProfiler") library("ReactomePA") library(gplots) library(tximport) library(DESeq2) library("org.Hs.eg.db") library(dplyr) library(tidyverse) # ------ For human genome ------ setwd("~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon") # Define paths to your Salmon output quantification files files <- c("control_r1" = "./control_r1/quant.sf", "control_r2" = "./control_r2/quant.sf", "control_r3" = "./control_r3/quant.sf", "VZV.d10_r1" = "./VZV.d10_r1/quant.sf", "VZV.d10_r2" = "./VZV.d10_r2/quant.sf", "VZV.d10_r3" = "./VZV.d10_r3/quant.sf", "VZV.d15_r1" = "./VZV.d15_r1/quant.sf", "VZV.d15_r2" = "./VZV.d15_r2/quant.sf", "VZV.d15_r3" = "./VZV.d15_r3/quant.sf", "VZV.d20_r1" = "./VZV.d20_r1/quant.sf", "VZV.d20_r2" = "./VZV.d20_r2/quant.sf", "VZV.d20_r3" = "./VZV.d20_r3/quant.sf") # Import the transcript abundance data with tximport txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE) # Define the replicates and condition of the samples replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3")) condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20")) # Define the colData for DESeq2 colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files)) # -- transcript-level count data -- # Create DESeqDataSet object dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition) write.csv(counts(dds), file="transcript_counts.csv") # -- gene-level count data -- # Read in the tx2gene map from salmon_tx2gene.tsv tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE) # Set the column names colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name") # Remove the gene_name column if not needed tx2gene <- tx2gene[,1:2] # Import and summarize the Salmon data with tximport txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE) # Continue with the DESeq2 workflow as before... colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files)) dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition) #dds <- dds[rowSums(counts(dds) > 3) > 2, ] #60605-->26543 write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv") # ------ For virus genome ------ setwd("~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon") # Define paths to your Salmon output quantification files files <- c("VZV.d10_r1" = "./VZV.d10_r1/quant.sf", "VZV.d10_r2" = "./VZV.d10_r2/quant.sf", "VZV.d10_r3" = "./VZV.d10_r3/quant.sf", "VZV.d15_r1" = "./VZV.d15_r1/quant.sf", "VZV.d15_r2" = "./VZV.d15_r2/quant.sf", "VZV.d15_r3" = "./VZV.d15_r3/quant.sf", "VZV.d20_r1" = "./VZV.d20_r1/quant.sf", "VZV.d20_r2" = "./VZV.d20_r2/quant.sf", "VZV.d20_r3" = "./VZV.d20_r3/quant.sf") # Import the transcript abundance data with tximport txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE) # Define the replicates and condition of the samples replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3")) condition <- factor(c("VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20")) # Define the colData for DESeq2 colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files)) # -- transcript-level count data -- # Create DESeqDataSet object dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition) write.csv(counts(dds), file="transcript_counts.csv") # -- gene-level count data -- # Read in the tx2gene map from salmon_tx2gene.tsv tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE) # Set the column names colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name") # Remove the gene_name column if not needed tx2gene <- tx2gene[,1:2] # Import and summarize the Salmon data with tximport txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE) # Continue with the DESeq2 workflow as before... colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files)) dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition) #dds <- dds[rowSums(counts(dds) > 3) > 2, ] #60605-->26543 write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv") # MANUALLY set all counts for all virus transcripts in the three control-conditions ("control_r1","control_r2","control_r3") as 0 using Excel # ------ Merge the raw counts of human and microbe ------ #cat ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv > merged_gene_counts.csv #DELETE the second line: "","control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3" # 1. Remove the hidden ^M (Windows carriage returns) from both files sed -i 's/\r$//' ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv # 2. Extract the header from the first file to start the new merged file head -n 1 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv > merged_gene_counts.csv # 3. Append the data from the first file (skipping its header) tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv >> merged_gene_counts.csv # 4. Append the data from the second file (skipping its header) tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv >> merged_gene_counts.csv # Should be 60683 lines #~/Tools/csv2xls-0.4/csv_to_xls.py merged_gene_counts.csv -d',' -o raw_gene_counts.xls; # -- for merged analysis due to false normalization factors wenn alone analyzed on virus data -- setwd("~/DATA/Data_Nina_RNAseq_2026/") d.raw <- read.csv("merged_gene_counts.csv", header=TRUE, row.names=1) # Re-define the replicates and condition for the merged counts rownames <- factor(c("control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3")) replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3")) condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20")) colData <- data.frame(condition=condition, replicate=replicate, row.names=rownames) #CORRECTED: dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition+replicate) #⚠️ A Note on the Design Formula: In the snippet, I used design = ~condition+replicate. In DESeq2, adding replicate (r1, r2, r3) to the design formula tells the model to treat "r1" as a globally systematic batch effect across all conditions. Unless the replicates were processed in distinct batches (e.g., all r1 on Monday, all r2 on Tuesday), it is standard to use design = ~ condition to simply measure the biological variance. Therefore, I have updated the script to use ~ condition. dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition) dim(counts(dds)) head(counts(dds), 10) rld <- rlogTransformation(dds) #We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function. #The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples. #The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups. #So, the typical workflow is: # - Create the DESeqDataSet object. # - Use estimateSizeFactors to normalize for library size. # - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.) # - Use DESeq for the differential expression analysis. # - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq. # draw simple pca and heatmap library(gplots) library("RColorBrewer") #mat <- assay(rld) #mm <- model.matrix(~condition, colData(rld)) #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm) #assay(rld) <- mat # -- pca -- png("pca.png", 1200, 800) plotPCA(rld, intgroup=c("condition")) dev.off() # -- heatmap -- png("heatmap.png", 1200, 800) distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) hc <- hclust(distsRL) hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100) heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13)) dev.off() -
draw 3D PCA plots.
library(gplots) library("RColorBrewer") library(ggplot2) data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE) write.csv(data, file="plotPCA_data.csv") #calculate all PCs including PC3 with the following codes library(genefilter) ntop <- 500 rv <- rowVars(assay(rld)) select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))] mat <- t( assay(rld)[select, ] ) pc <- prcomp(mat) pc$x[,1:3] #df_pc <- data.frame(pc$x[,1:3]) df_pc <- data.frame(pc$x) identical(rownames(data), rownames(df_pc)) #-->TRUE data$PC1 <- NULL data$PC2 <- NULL merged_df <- merge(data, df_pc, by = "row.names") #merged_df <- merged_df[, -1] row.names(merged_df) <- merged_df$Row.names merged_df$Row.names <- NULL # remove the "name" column merged_df$name <- NULL merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","group","condition","replicate")] write.csv(merged_df, file="merged_df_12PCs.csv") summary(pc) # PC1 PC2 PC3 #Proportion of Variance 0.5969 0.1732 0.09444 python3 draw_3D.py # Edit on the generated SVG-figure. #/usr/bin/convert PCA_3D.png -crop 2900x1600+250+700 PCA_3D_cropped.png -
(Optional) estimate size factors and dispersion values.
#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed. #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes. #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression). sizeFactors(dds) #NULL # Estimate size factors dds <- estimateSizeFactors(dds) # Estimate dispersions dds <- estimateDispersions(dds) #> sizeFactors(dds) #control_r1 control_r2 control_r3 VZV.d10_r1 VZV.d10_r2 VZV.d10_r3 VZV.d15_r1 #1.2072182 1.1009757 0.8610196 1.1965615 0.8575169 1.1714809 0.8344160 #VZV.d15_r2 VZV.d15_r3 VZV.d20_r1 VZV.d20_r2 VZV.d20_r3 #0.5278583 1.1768657 1.1006353 1.1288974 1.2630530 # If alone with virus data, the following BUG occured: #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation! HeLa_TO_r1 HeLa_TO_r2 0.9978755 1.1092227 data.frame(genes = rownames(dds), dispersions = dispersions(dds)) #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts. 1/0.9978755=1.002129023 1/1.1092227= #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor --effectiveGenomeSize 2864785220 bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023 --effectiveGenomeSize 2864785220 bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor 0.901532217 --effectiveGenomeSize 2864785220 raw_counts <- counts(dds) normalized_counts <- counts(dds, normalized=TRUE) #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA) #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA) #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor estimSf <- function (cds){ # Get the count matrix cts <- counts(cds) # Compute the geometric mean geomMean <- function(x) prod(x)^(1/length(x)) # Compute the geometric mean over the line gm.mean <- apply(cts, 1, geomMean) # Zero values are set to NA (avoid subsequentcdsdivision by 0) gm.mean[gm.mean == 0] <- NA # Divide each line by its corresponding geometric mean # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...) # MARGIN: 1 or 2 (line or columns) # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN # FUN: the function to be applied cts <- sweep(cts, 1, gm.mean, FUN="/") # Compute the median over the columns med <- apply(cts, 2, median, na.rm=TRUE) # Return the scaling factor return(med) } #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/ #DESeq2’s median of ratios [1] #EdgeR’s trimmed mean of M values (TMM) [2] #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html #very good website! test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/") sum(test_normcount != normalized_counts) -
Merged Host-Virus Counts -> DESeq2 -> DEGs of 6 Comparisons
Rscript complete_deg_pipeline.R # Key changes made: # 1. Virus gene identification: Added detection of virus genes using pattern matching ("^ORF|cat_RNA|redF_RNA|repE_RNA|sopA_RNA|sopB_RNA|UL45_RNA") # 2. Updated significance cutoff: Changed from |log2FC| >= 1 to |log2FC| >= 2 # 3. Color scheme: # * Green for virus RNAs (all virus genes) # * Dark green for significantly regulated virus genes # * Red for significantly up-regulated host genes # * Blue for significantly down-regulated host genes # * Gray for non-significant host genes # 4. Enhanced legend: The legend now clearly distinguishes between virus RNAs and host genes with their regulation status # 5. Summary statistics: Added detailed output showing counts of virus vs host genes and their regulation status # ================================================================================ # 📊 FINAL SUMMARY OF ALL 6 COMPARISONS # name total up down virus sig_total pct_sig # 03_VZV.d20_vs_control 17594 193 220 77 413 2.3 # 02_VZV.d15_vs_control 17594 139 94 77 233 1.3 # 05_VZV.d20_vs_VZV.d10 17594 35 114 77 149 0.8 # 01_VZV.d10_vs_control 17594 89 20 77 109 0.6 # 06_VZV.d20_vs_VZV.d15 17594 28 38 77 66 0.4 # 04_VZV.d15_vs_VZV.d10 17594 6 13 77 19 0.1
Host and Viral Temporal Transcriptomics of VZV Infection in Human Skin Organoids (Data_Nina_RNAseq_2026)
Leave a reply