Abstract: When running the nf-core/rnaseq pipeline on custom or non-model viral genomes (like VZV or HSV), standard GenBank (.gb) files often cause indexing errors in STAR or quantification failures in Salmon due to missing GTF attributes. In this post, I share a robust Python (Biopython) workflow to convert GenBank files into perfectly formatted FASTA and GTF files. We specifically address how to inject required features like exon, gene_type, and properly formatted transcript_id and gene_id attributes to ensure seamless execution of the star_salmon alignment and quantification steps.
You are absolutely right! The duplicate exon was not the root cause. The real issue is that Salmon and the nf-core/rnaseq pipeline strictly expect specific attributes (like gene_type "protein_coding") to be present in every corresponding record to correctly build the transcript-to-gene mapping and index.
By comparing your original BB1528_nanopore_consensus.gtf with the HSV1 structure, we can see the missing “additional IDs/attributes”:
gene_type "protein_coding"is missing in all records.- The
transcript_idformat should ideally end with_RNA(e.g.,UL33_RNA) instead of_tx. CDSrecords should have anorf_id.generecords should not havegene_name, andexonrecords should not haveexon_number(to match the clean HSV1 structure).
Instead of re-converting from the GenBank file, we can write a quick Python script to directly fix your existing BB1528_nanopore_consensus.gtf and reformat its attributes to perfectly match the HSV1 structure.
Step 1: Save and Run the GTF Formatting Script
Save the following code as fix_gtf_to_hsv1_format.py in your working directory:
import sys
def fix_gtf(input_gtf, output_gtf):
seen_exons = set()
with open(input_gtf, 'r') as fin, open(output_gtf, 'w') as fout:
for line in fin:
if line.startswith('#'):
continue
parts = line.strip().split('\t')
if len(parts) < 9:
continue
seqname, source, feature, start, end, score, strand, frame, attributes = parts
# Parse existing attributes into a dictionary
attrs_dict = {}
for attr in attributes.split(';'):
attr = attr.strip()
if not attr:
continue
if ' ' in attr:
key, val = attr.split(' ', 1)
attrs_dict[key] = val.strip('"')
gene_id = attrs_dict.get('gene_id', 'unknown_gene')
# Standardize transcript_id to gene_id_RNA to match HSV1 format
new_transcript_id = f"{gene_id}_RNA"
# Rebuild attributes strictly matching HSV1 structure
new_attrs = ""
if feature == "gene":
new_attrs = f'gene_id "{gene_id}"; gene_type "protein_coding"'
elif feature == "transcript":
new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
elif feature == "exon":
# Avoid duplicate exons
exon_key = (seqname, start, end, strand, gene_id, new_transcript_id)
if exon_key in seen_exons:
continue
seen_exons.add(exon_key)
new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
elif feature == "CDS":
orf_id = f"{gene_id}_CDS"
new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; orf_id "{orf_id}"; gene_type "protein_coding"'
else:
# For any other features, keep them but ensure gene_type is present
if 'transcript_id' in attrs_dict:
attrs_dict['transcript_id'] = new_transcript_id
if 'gene_type' not in attrs_dict:
attrs_dict['gene_type'] = 'protein_coding'
new_attrs = '; '.join([f'{k} "{v}"' for k, v in attrs_dict.items()])
# Write the fixed line
fout.write(f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{new_attrs}\n")
if __name__ == "__main__":
fix_gtf("BB1528_nanopore_consensus.gtf", "BB1528_final.gtf")
print("✅ GTF successfully reformatted to match HSV1 structure -> BB1528_final.gtf")
Run it in your terminal:
python3 fix_gtf_to_hsv1_format.py
Step 2: Verify the Output
If you check the new BB1528_final.gtf, the structure will now perfectly mirror your HSV1 example:
BB1528 GenBank gene 88 561 . + . gene_id "ORF_S_L_transcribed"; gene_type "protein_coding"
BB1528 GenBank transcript 88 561 . + . gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528 GenBank exon 88 561 . + . gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528 GenBank CDS 88 561 . + 0 gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; orf_id "ORF_S_L_transcribed_CDS"; gene_type "protein_coding"
Step 3: Re-run Nextflow with -resume
Now, update your Nextflow command to point to the newly formatted BB1528_final.gtf.
Important: Keep the -resume flag! Because the GTF file’s content has changed, Nextflow will automatically detect the new file hash and re-run only the genome preparation steps (like SALMON_INDEX and GTF2BED) without re-processing your FASTQ files.
nextflow run rnaseq/main.nf \
--input samplesheet.csv \
--outdir results_BB1528 \
--fasta BB1528_nanopore_consensus.fasta \
--gtf BB1528_final.gtf \
--with_umi \
--umitools_extract_method regex \
--umitools_bc_pattern '^(?P
<UMI>.{12}).*' \
--umitools_dedup_stats \
-profile test_full \
-resume \
--max_memory 256.GB \
--max_time 2400.h \
--save_reference \
--aligner star_salmon \
--gtf_extra_attributes gene_id \
--gtf_group_features transcript_id \
--featurecounts_group_type gene_id \
--featurecounts_feature_type transcript \
--skip_rseqc \
--skip_dupradar \
--skip_preseq \
--skip_biotype_qc \
--skip_deseq2_qc \
--skip_multiqc \
--min_mapped_reads 0
This should completely resolve the SALMON_QUANT indexing error!