gene_x 0 like s 480 view s
Tags: pipeline, RNA-seq
Prepare GTF for non-model virus
If you need to retain all information from the GFF file, you may need to do some post-processing to add the extra attributes back into the GTF file. However, keep in mind that downstream tools which expect GTF format may not correctly handle extra attributes.
# -- Deprecated processing for virus gtf --
#NOT_USED, since it changed a lot!
#gffread X14112.1.gff -T -o X14112.1.gtf
cp X14112.1.gff3 X14112.1.gff3_backup
grep "^##" X14112.1.gff3 > X14112.1_gene.gff3
grep "ID=gene" X14112.1.gff3 >> X14112.1_gene.gff
#!!!!VERY_IMPORTANT!!!!: change type '\tgene\t' to '\texon\t'!
#sed -i -e "s/\tgene\t/\texon\t/g" X14112.1_gene_.gff # since default is --featurecounts_feature_type 'exon'
# -- New processing for virus gtf --
gffread X14112.1_orig.gff -T -o X14112.1_v2.gtf
python3 add_gene_id.py # X14112.1_v2.gtf --> X14112.1_v3.gtf
#------------------------------------
def add_missing_gene_id(input_gtf, output_gtf):
with open(input_gtf, 'r') as in_gtf, open(output_gtf, 'w') as out_gtf:
for line in in_gtf:
if not line.startswith('#'): # Skip header lines
elements = line.strip().split('\t')
attributes = elements[8]
if 'gene_id' not in attributes:
# Extract transcript_id
transcript_id = ''
for attr in attributes.split(';'):
if 'transcript_id' in attr:
transcript_id = attr.strip()
# Prepend transcript_id as gene_id if not empty
if transcript_id != '':
attributes = f'{transcript_id.replace("transcript_id", "gene_id")}; {attributes}'
elements[8] = attributes
line = '\t'.join(elements)
out_gtf.write(line + '\n')
# Use the function
input_gtf = 'X14112.1_v2.gtf' # Path to your input GTF
output_gtf = 'X14112.1_v3.gtf' # Path to the output GTF
add_missing_gene_id(input_gtf, output_gtf)
Human herpesvirus 1, also known as Herpes Simplex Virus type 1 (HSV-1), is a virus with a complex genome encoding around 70-80 genes. The number of genes can vary slightly depending on the specific strain of HSV-1, as well as the methodologies used to identify and annotate the genes.
Please note that the naming convention for viral genes and proteins can sometimes be inconsistent, with multiple names referring to the same gene or protein. IE175, ICP4, and IE gene 3 all refer to the same gene in HSV-1.
# Delete the records if they are intron or manually add gene_name to the records without gene_name.
cp X14112.1_v3.gtf X14112.1_v4.gtf
#Find all recoreds without "gene_name"
grep -v "gene_name" X14112.1_v4.gtf
#-->Delete intron records: grep "intron" X14112.1.gff3_orig
DEL X14112.1 EMBL transcript 4953 6907 . - . transcript_id "id-X14112.1:4953..6907"; gene_id "id-X14112.1:4953..6907"
DEL X14112.1 EMBL exon 4953 6907 . - . gene_id "id-X14112.1:4953..6907"; transcript_id "id-X14112.1:4953..6907";
DEL X14112.1 EMBL transcript 132374 132539 . + . transcript_id "id-X14112.1:132374..132539"; gene_id "id-X14112.1:132374..132539"
DEL X14112.1 EMBL exon 132374 132539 . + . gene_id "id-X14112.1:132374..132539"; transcript_id "id-X14112.1:132374..132539";
DEL X14112.1 EMBL transcript 145649 145860 . - . transcript_id "id-X14112.1:145649..145860"; gene_id "id-X14112.1:145649..145860"
DEL X14112.1 EMBL exon 145649 145860 . - . gene_id "id-X14112.1:145649..145860"; transcript_id "id-X14112.1:145649..145860";
# or update: grep "146805" X14112.1_orig.gff
UPDATE X14112.1 EMBL transcript 146805 151063 . + . transcript_id "rna-X14112.1:146805..151063"; gene_id "rna-X14112.1:146805..151063"
UPDATE X14112.1 EMBL exon 146805 151063 . + . gene_id "rna-X14112.1:146805..151063"; transcript_id "rna-X14112.1:146805..151063";
--> transcript_id "rna-IE175"; gene_id "gene-IE175"; gene_name "IE175"; --> transcript_id "rna-IE175"; gene_id "gene-IE175"; gene_name "IE175";
# or update: grep "133941" X14112.1_orig.gff
UPDATE X14112.1 EMBL transcript 133941 146107 . - . transcript_id "rna-X14112.1:133941..146107"; gene_id "rna-X14112.1:133941..146107"
UPDATE X14112.1 EMBL exon 133941 145648 . - . gene_id "rna-X14112.1:133941..146107"; transcript_id "rna-X14112.1:133941..146107";
UPDATE X14112.1 EMBL exon 145861 146107 . - . gene_id "rna-X14112.1:133941..146107"; transcript_id "rna-X14112.1:133941..146107";
--> transcript_id "rna-IE68"; gene_id "rna-IE68"; gene_name "IE68";
--> gene_id "rna-IE68"; transcript_id "rna-IE68"; gene_name "IE68";
--> gene_id "rna-IE68"; transcript_id "rna-IE68"; gene_name "IE68";
(optional) consider to update all exon and CDS with different names! for example exon-RL2-1, exon-RL2-2, cds-RL2-1. Maybe it is not nessary, since the output contains only transcript-type!
Run nexflow for virus
docker pull nfcore/rnaseq
/usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_virus --fasta "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1.fasta" --gtf "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_v4.gtf" --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P<umi_1>.{12}).*" --umitools_dedup_stats --skip_rseqc --skip_dupradar --skip_preseq -profile docker -resume --max_cpus 55 --max_memory 120.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'hisat2' --gtf_extra_attributes 'gene_name' --gtf_group_features 'gene_id' --featurecounts_group_type 'gene_name' --featurecounts_feature_type 'exon' --umitools_grouping_method 'unique'
Run nexflow for human using GRCh38 genome
docker pull nfcore/rnaseq
/usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38 --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P<umi_1>.{12}).*" --umitools_dedup_stats --skip_rseqc --skip_dupradar --skip_preseq -profile docker -resume --max_cpus 55 --max_memory 128.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --pseudo_aligner 'salmon' --gtf_extra_attributes 'gene_name' --gtf_group_features 'gene_id' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon' --umitools_grouping_method 'unique'
3.1. BUG_1 for running d8_r1 due to memory
# in modules/nf-core/umitools/dedup/main.nf
process UMITOOLS_DEDUP {
tag "$meta.id"
//REMOVED label "process_medium"
//ADDED
label 'high_memory' // this needs to be defined in your config file
cpus 55 // adjust as per your system's capabilities
ERROR ~ Module compilation error
- file : /mnt/h1/jhuang/DATA/Data_Manja_RNAseq_Organoids/rnaseq/./workflows/../subworkflows/nf-core/bam_dedup_stats_samtools_umitools/../../../modules/nf-core/umitools/dedup/main.nf
- cause: Unexpected character: '#' @ line 3, column 5.
#label "process_medium"
^
3.2. BUG_2 for running d8_r1 due to memory
# in conf/test_full.config
process {
//ADDED
withLabel: 'high_memory' {
memory = '120 GB' // adjust as per your system's capabilities
}
withName: 'UMITOOLS_DEDUP' {
time = '160.h' // Adjust the time limit to your needs
}
}
ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:BAM_DEDUP_STATS_SAMTOOLS_UMITOOLS_TRANSCRIPTOME:UMITOOLS_DEDUP (control_r2)'
Caused by:
Process requirement exceeds available memory -- req: 128 GB; avail: 125.8 GB
Command executed:
PYTHONHASHSEED=0 umi_tools \
dedup \
-I control_r2.transcriptome.sorted.bam \
-S control_r2.umi_dedup.transcriptome.sorted.bam \
--output-stats control_r2.umi_dedup.transcriptome.sorted \
--method='unique' --random-seed=100
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASEQ:RNASEQ:BAM_DEDUP_STATS_SAMTOOLS_UMITOOLS_TRANSCRIPTOME:UMITOOLS_DEDUP":
umitools: $(umi_tools --version 2>&1 | sed 's/^.*UMI-tools version://; s/ *$//')
END_VERSIONS
Command exit status:
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
setwd("~/DATA/Data_Manja_RNAseq_Organoids/results_GRCh38_unique_9samples/star_salmon")
# Define paths to your Salmon output quantification files
files <- c("control_r1" = "./control_r1/quant.sf",
"control_r2" = "./control_r2/quant.sf",
"HSV.d2_r1" = "./HSV.d2_r1/quant.sf",
"HSV.d2_r2" = "./HSV.d2_r2/quant.sf",
"HSV.d4_r1" = "./HSV.d4_r1/quant.sf",
"HSV.d4_r2" = "./HSV.d4_r2/quant.sf",
"HSV.d6_r1" = "./HSV.d6_r1/quant.sf",
"HSV.d6_r2" = "./HSV.d6_r2/quant.sf",
"HSV.d8_r2" = "./HSV.d8_r2/quant.sf")
# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2", "r2"))
condition <- factor(c("control", "control", "HSV.d2", "HSV.d2", "HSV.d4", "HSV.d4", "HSV.d6", "HSV.d6", "HSV.d8"))
# Define the colData for DESeq2
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
# Create DESeqDataSet object
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
# In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
# Filter data to retain only genes with more than 2 counts > 3 across all samples
# dds <- dds[rowSums(counts(dds) > 3) > 2, ]
# Run DESeq2
dds <- DESeq(dds)
# Perform rlog transformation
rld <- rlogTransformation(dds)
# Output raw count data to a CSV file
write.csv(counts(dds), file="transcript_counts.csv")
# -- gene-level count data --
# Read in the tx2gene map from salmon_tx2gene.tsv
#tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
# Set the column names
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
# Remove the gene_name column if not needed
tx2gene <- tx2gene[,1:2]
# Import and summarize the Salmon data with tximport
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
# Continue with the DESeq2 workflow as before...
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
#dds <- dds[rowSums(counts(dds) > 3) > 2, ] #60605-->26543
dds <- DESeq(dds)
rld <- rlogTransformation(dds)
write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
#TODO: why a lot of reads were removed due to the too_short?
STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output
dim(counts(dds))
head(counts(dds), 10)
WHY: too much "Unmapped: too short" in STAR using nextflow, I read the STAR manual, the default values are 0, it means it will never discard reads due to too short. ASSUMPTION: the umi_tools dedup counts as "Unmapped: too short". Check the intermediate bam files.
Indeed, you're right that according to the STAR manual, the default settings for read length filtering parameters (--outFilterScoreMinOverLread, --outFilterMatchNminOverLread, and --outFilterMatchNmin) are set to 0, which means reads will not be discarded for being too short.
The "Unmapped: too short" statistic in the STAR output usually refers to reads that didn't map well enough to the reference genome. In other words, either the read didn't match sufficiently to any location in the genome, or the best matches were not significantly better than other matches to be unique.
If you're seeing a high proportion of "Unmapped: too short" reads, it could be due to several reasons:
For troubleshooting, you could:
Check the quality of your reads using a tool like FastQC. * Ensure you're using the correct reference genome. * If you're working with single-end data, consider checking the length distribution of your reads. If many reads are shorter than the minimum length threshold, consider adjusting this threshold if appropriate. * If you're working with paired-end data, make sure both of the reads in a pair pass the quality controls. STAR is sensitive to inconsistencies between paired reads.
Finally, always make sure your read preprocessing steps (like trimming for quality and adapter sequences) are performed correctly as these can greatly affect the downstream analysis.
点赞本文的读者
还没有人对此文章表态
没有评论
RNA-seq 2024 Ute from raw counts
Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA
© 2023 XGenes.com Impressum