Abstract: When running the nf-core/rnaseq pipeline on custom or non-model viral genomes (like VZV or HSV), standard GenBank (.gb) files often cause indexing errors in STAR or quantification failures in Salmon due to missing GTF attributes. In this post, I share a robust Python (Biopython) workflow to convert GenBank files into perfectly formatted FASTA and GTF files. We specifically address how to inject required features like exon, gene_type, and properly formatted transcript_id and gene_id attributes to ensure seamless execution of the star_salmon alignment and quantification steps.
You are absolutely right! The duplicate exon was not the root cause. The real issue is that Salmon and the nf-core/rnaseq pipeline strictly expect specific attributes (like gene_type "protein_coding") to be present in every corresponding record to correctly build the transcript-to-gene mapping and index.
By comparing your original BB1528_nanopore_consensus.gtf with the HSV1 structure, we can see the missing “additional IDs/attributes”:
gene_type "protein_coding"is missing in all records.- The
transcript_idformat should ideally end with_RNA(e.g.,UL33_RNA) instead of_tx. CDSrecords should have anorf_id.generecords should not havegene_name, andexonrecords should not haveexon_number(to match the clean HSV1 structure).
Instead of re-converting from the GenBank file, we can write a quick Python script to directly fix your existing BB1528_nanopore_consensus.gtf and reformat its attributes to perfectly match the HSV1 structure.
Step 1: Save and Run the GTF Formatting Script
Save the following code as fix_gtf_to_hsv1_format.py in your working directory:
import sys
def fix_gtf(input_gtf, output_gtf):
seen_exons = set()
with open(input_gtf, 'r') as fin, open(output_gtf, 'w') as fout:
for line in fin:
if line.startswith('#'):
continue
parts = line.strip().split('\t')
if len(parts) < 9:
continue
seqname, source, feature, start, end, score, strand, frame, attributes = parts
# Parse existing attributes into a dictionary
attrs_dict = {}
for attr in attributes.split(';'):
attr = attr.strip()
if not attr:
continue
if ' ' in attr:
key, val = attr.split(' ', 1)
attrs_dict[key] = val.strip('"')
gene_id = attrs_dict.get('gene_id', 'unknown_gene')
# Standardize transcript_id to gene_id_RNA to match HSV1 format
new_transcript_id = f"{gene_id}_RNA"
# Rebuild attributes strictly matching HSV1 structure
new_attrs = ""
if feature == "gene":
new_attrs = f'gene_id "{gene_id}"; gene_type "protein_coding"'
elif feature == "transcript":
new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
elif feature == "exon":
# Avoid duplicate exons
exon_key = (seqname, start, end, strand, gene_id, new_transcript_id)
if exon_key in seen_exons:
continue
seen_exons.add(exon_key)
new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
elif feature == "CDS":
orf_id = f"{gene_id}_CDS"
new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; orf_id "{orf_id}"; gene_type "protein_coding"'
else:
# For any other features, keep them but ensure gene_type is present
if 'transcript_id' in attrs_dict:
attrs_dict['transcript_id'] = new_transcript_id
if 'gene_type' not in attrs_dict:
attrs_dict['gene_type'] = 'protein_coding'
new_attrs = '; '.join([f'{k} "{v}"' for k, v in attrs_dict.items()])
# Write the fixed line
fout.write(f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{new_attrs}\n")
if __name__ == "__main__":
fix_gtf("BB1528_nanopore_consensus.gtf", "BB1528_final.gtf")
print("✅ GTF successfully reformatted to match HSV1 structure -> BB1528_final.gtf")
Run it in your terminal:
python3 fix_gtf_to_hsv1_format.py
Step 2: Verify the Output
If you check the new BB1528_final.gtf, the structure will now perfectly mirror your HSV1 example:
BB1528 GenBank gene 88 561 . + . gene_id "ORF_S_L_transcribed"; gene_type "protein_coding"
BB1528 GenBank transcript 88 561 . + . gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528 GenBank exon 88 561 . + . gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528 GenBank CDS 88 561 . + 0 gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; orf_id "ORF_S_L_transcribed_CDS"; gene_type "protein_coding"
Step 3: Re-run Nextflow with -resume
Now, update your Nextflow command to point to the newly formatted BB1528_final.gtf.
Important: Keep the -resume flag! Because the GTF file’s content has changed, Nextflow will automatically detect the new file hash and re-run only the genome preparation steps (like SALMON_INDEX and GTF2BED) without re-processing your FASTQ files.
nextflow run rnaseq/main.nf \
--input samplesheet.csv \
--outdir results_BB1528 \
--fasta BB1528_nanopore_consensus.fasta \
--gtf BB1528_final.gtf \
--with_umi \
--umitools_extract_method regex \
--umitools_bc_pattern '^(?P
<UMI>.{12}).*' \
--umitools_dedup_stats \
-profile test_full \
-resume \
--max_memory 256.GB \
--max_time 2400.h \
--save_reference \
--aligner star_salmon \
--gtf_extra_attributes gene_id \
--gtf_group_features transcript_id \
--featurecounts_group_type gene_id \
--featurecounts_feature_type transcript \
--skip_rseqc \
--skip_dupradar \
--skip_preseq \
--skip_biotype_qc \
--skip_deseq2_qc \
--skip_multiqc \
--min_mapped_reads 0
This should completely resolve the SALMON_QUANT indexing error!
Good news: Your GTF and FASTA files are perfectly formatted!
We can prove this from your log:
[info] Index contained 78 targets
This means Salmon successfully built the transcriptome index and found all 78 transcripts from your BB1528_final.gtf. The reference files are completely correct.
What is causing the error?
The error is purely a biological/sample-specific issue:
[warning] salmon was only able to assign 9 fragments to transcripts in the index,
but the minimum number of required assigned fragments (--minAssignedFrags) was 10.
The nf-core/rnaseq pipeline automatically subsamples your reads (in this case, 1,000,000 fragments) to run a quick Salmon quantification and check the library strandedness (--libType=A).
Because you are mapping to a specific viral reference (BB1528), it is highly likely that control_r3 is a negative control or a sample with an extremely low viral load. Out of the 1,000,000 subsampled reads, only 9 reads actually belonged to the virus.
Salmon has a built-in safety threshold (--minAssignedFrags 10) that intentionally crashes the pipeline if fewer than 10 reads map to the transcriptome, to prevent “garbage” quantification of empty samples. Since 9 < 10, the pipeline stops.
How to fix it
You can easily bypass this by creating a custom Nextflow configuration file that tells Salmon to accept samples with even just 1 mapped fragment.
Step 1: Create a config file
In your working directory, create a file named salmon_fix.config and add the following code:
process {
withName: 'SALMON_QUANT' {
// Append --minAssignedFrags 1 to any existing Salmon arguments
ext.args = { (task.ext.args ?: '') + " --minAssignedFrags 1" }
}
}
Step 2: Run the pipeline with the config file
Add -c salmon_fix.config to your Nextflow command. This will apply the fix to both the subsampled QC step and the main quantification step, ensuring your negative controls don’t crash the pipeline.
nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \
-c salmon_fix.config \
--input samplesheet.csv \
--outdir results_BB1528 \
--fasta BB1528_nanopore_consensus.fasta \
--gtf BB1528_final.gtf \
--with_umi \
--umitools_extract_method regex \
--umitools_bc_pattern '^(?P
<umi_1>.{12}).*' \
--trimmer fastp \
--extra_fastp_args "--trim_front2 10" \
-resume \
--max_cpus 54 \
--max_memory 120.GB \
--max_time 2400.h \
--save_align_intermeds \
--save_unaligned \
--save_reference \
--aligner star_salmon \
--gtf_extra_attributes gene_id \
--gtf_group_features transcript_id \
--featurecounts_group_type gene_id \
--featurecounts_feature_type transcript \
--skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc \
--min_mapped_reads 0
Summary of what will happen now:
- The pipeline will resume from where it left off (thanks to
-resume). - Salmon will process
control_r3, see that it only has 9 viral reads, but instead of crashing, it will output a warning and successfully generate an output file with 0 or near-0 counts. - The rest of your infected samples (which likely have plenty of viral reads) will quantify normally.
- At the end of the pipeline, you will get a complete count matrix where your negative controls correctly show 0 expression, which is exactly what you want for downstream differential expression analysis!
Yes, all detected configuration files are being loaded and merged by Nextflow. Your custom settings will definitely be applied. What does this warning actually mean? This is not an error; it is just a best-practice reminder built into the nf-core pipeline framework. The developers of nf-core strictly enforce a rule: Pipeline parameters (the arguments starting with — like –input, –fasta, –max_memory) should never be defined inside a .config file. They must be passed directly via the command line (CLI) or a dedicated -params-file. The warning is simply the pipeline saying:
"Hey, I see you are using custom .config files. Just a reminder: do not put your --parameters inside them! Use the command line for parameters, and use .config files only for Nextflow settings (like process resources, arguments, or executor settings)."
Is your setup correct? Yes, absolutely. The salmon_fix.config file we created in the previous step only contains a process block to modify Salmon’s internal arguments (ext.args). It does not contain any params. Therefore, you are using the config file exactly as intended. Why did it trigger? Nextflow automatically detects any file ending in .config in your working directory (e.g., nextflow.config, salmon_fix.config, or profile configs). Because it found more than one, it triggered this informational warning. What should you do? You can safely ignore this warning. The pipeline will continue running normally, and your fix to allow Salmon to quantify samples with fewer than 10 assigned fragments (–minAssignedFrags 1) will be successfully applied. Did the pipeline successfully pass the SALMON_QUANT step and finish running this time? If it didn’t crash at that specific step, it means the config file worked perfectly!
Based on the error log and the reference files you provided, here is the diagnosis of the problem and how to fix it.
🔍 The Root Cause
The pipeline is failing during the FASTQ_SUBSAMPLE_FQ_SALMON step. nf-core/rnaseq uses this step to subsample 1 million reads and run Salmon to automatically detect your library’s strandedness (e.g., forward, reverse, unstranded) and verify the mapping rate.
The critical error from Salmon is:
salmon was only able to assign 8 fragments to transcripts in the index, but the minimum number of required assigned fragments (--minAssignedFrags) was 10. This could be indicative of a mismatch between the reference and sample...
Why is this happening?
- Your Reference is Virus-Only: The reference you provided (
BB1528_nanopore_consensus.fasta) is a Varicella-Zoster Virus (VZV) genome (~127 kb). - Your Controls are Likely Host-Only: You have samples named
control_r1,control_r2, andcontrol_r3. These are almost certainly uninfected host cells (e.g., human or monkey cell lines). - The Mismatch: Because host RNA-seq reads do not match the VZV viral genome, the mapping rate for your control samples is essentially 0% (only 8 random reads mapped out of 1,000,000). Salmon aborts because it cannot confidently determine the library type with fewer than 10 mapped reads.
🛠️ How to Fix It
Depending on your experimental goals, choose one of the following solutions:
Solution 1: Add the Host Genome to your Reference (Recommended)
If you want to analyze host gene expression or simply want the pipeline to run smoothly on uninfected controls, you must include the host genome (e.g., Human GRCh38 or Macaque) alongside the VZV genome.
- Action: Concatenate the Host FASTA and VZV FASTA into a single
combined_genome.fa. Do the same for the Host GTF and VZV GTF (combined_annotation.gtf). - Why: This allows the host reads in your
controlsamples to map properly, allowing Salmon to easily detect the library strandedness and pass the QC threshold.
Solution 2: Remove the Control Samples from the Samplesheet
If you are only interested in viral transcription and intentionally built a virus-only reference, the uninfected controls will yield zero useful data anyway.
- Action: Open your
samplesheet.csvand delete the rows forcontrol_r1,control_r2, andcontrol_r3. - Why: The pipeline will only run on the
VZV.dXXinfected samples, which will have high mapping rates to the VZV reference, allowing Salmon to successfully detect the library type.
Solution 3: Manually Specify Strandedness
If you know the strandedness of your library preparation kit (e.g., Illumina TruSeq Stranded mRNA is usually reverse), you can tell the pipeline to skip the auto-detection step.
- Action: Add a
strandednesscolumn to yoursamplesheet.csvand fill it withreverse,forward, orunstrandedfor all samples. Alternatively, pass--strandedness reverse(or appropriate value) in your Nextflow run command. - Note: While this bypasses the library-type inference failure, Salmon might still throw a warning about the low mapping rate for the controls, but it usually prevents the hard crash.
Solution 4: Verify Sample Identities
If your control samples are actually supposed to be VZV-infected (e.g., a different viral strain or a positive control), then an 8-read mapping rate indicates a severe issue.
- Action: Check your FASTQ files. Ensure the
controlFASTQ files aren’t corrupted, empty, or accidentally swapped with a completely different organism’s sequencing data. Ensure the viral strain in the control isn’t highly divergent from the Oka parental strain reference you are using.
💡 Summary Recommendation
For viral RNA-seq experiments, Solution 1 is the standard best practice. Creating a “chimeric” reference (Host + Virus) ensures that mapping rates are accurate, host-response pathways can be evaluated, and nf-core/rnaseq’s internal QC tools function exactly as intended.