TODO_TOMORROW: How to Convert GenBank Files to GTF/FASTA for the nf-core/rnaseq Pipeline (Data_Nina_RNAseq_2026)

Abstract: When running the nf-core/rnaseq pipeline on custom or non-model viral genomes (like VZV or HSV), standard GenBank (.gb) files often cause indexing errors in STAR or quantification failures in Salmon due to missing GTF attributes. In this post, I share a robust Python (Biopython) workflow to convert GenBank files into perfectly formatted FASTA and GTF files. We specifically address how to inject required features like exon, gene_type, and properly formatted transcript_id and gene_id attributes to ensure seamless execution of the star_salmon alignment and quantification steps.

You are absolutely right! The duplicate exon was not the root cause. The real issue is that Salmon and the nf-core/rnaseq pipeline strictly expect specific attributes (like gene_type "protein_coding") to be present in every corresponding record to correctly build the transcript-to-gene mapping and index.

By comparing your original BB1528_nanopore_consensus.gtf with the HSV1 structure, we can see the missing “additional IDs/attributes”:

  1. gene_type "protein_coding" is missing in all records.
  2. The transcript_id format should ideally end with _RNA (e.g., UL33_RNA) instead of _tx.
  3. CDS records should have an orf_id.
  4. gene records should not have gene_name, and exon records should not have exon_number (to match the clean HSV1 structure).

Instead of re-converting from the GenBank file, we can write a quick Python script to directly fix your existing BB1528_nanopore_consensus.gtf and reformat its attributes to perfectly match the HSV1 structure.

Step 1: Save and Run the GTF Formatting Script

Save the following code as fix_gtf_to_hsv1_format.py in your working directory:

import sys

def fix_gtf(input_gtf, output_gtf):
    seen_exons = set()

    with open(input_gtf, 'r') as fin, open(output_gtf, 'w') as fout:
        for line in fin:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue

            seqname, source, feature, start, end, score, strand, frame, attributes = parts

            # Parse existing attributes into a dictionary
            attrs_dict = {}
            for attr in attributes.split(';'):
                attr = attr.strip()
                if not attr:
                    continue
                if ' ' in attr:
                    key, val = attr.split(' ', 1)
                    attrs_dict[key] = val.strip('"')

            gene_id = attrs_dict.get('gene_id', 'unknown_gene')
            # Standardize transcript_id to gene_id_RNA to match HSV1 format
            new_transcript_id = f"{gene_id}_RNA"

            # Rebuild attributes strictly matching HSV1 structure
            new_attrs = ""
            if feature == "gene":
                new_attrs = f'gene_id "{gene_id}"; gene_type "protein_coding"'
            elif feature == "transcript":
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "exon":
                # Avoid duplicate exons
                exon_key = (seqname, start, end, strand, gene_id, new_transcript_id)
                if exon_key in seen_exons:
                    continue
                seen_exons.add(exon_key)
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "CDS":
                orf_id = f"{gene_id}_CDS"
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; orf_id "{orf_id}"; gene_type "protein_coding"'
            else:
                # For any other features, keep them but ensure gene_type is present
                if 'transcript_id' in attrs_dict:
                    attrs_dict['transcript_id'] = new_transcript_id
                if 'gene_type' not in attrs_dict:
                    attrs_dict['gene_type'] = 'protein_coding'
                new_attrs = '; '.join([f'{k} "{v}"' for k, v in attrs_dict.items()])

            # Write the fixed line
            fout.write(f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{new_attrs}\n")

if __name__ == "__main__":
    fix_gtf("BB1528_nanopore_consensus.gtf", "BB1528_final.gtf")
    print("✅ GTF successfully reformatted to match HSV1 structure -> BB1528_final.gtf")

Run it in your terminal:

python3 fix_gtf_to_hsv1_format.py

Step 2: Verify the Output

If you check the new BB1528_final.gtf, the structure will now perfectly mirror your HSV1 example:

BB1528  GenBank gene    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; gene_type "protein_coding"
BB1528  GenBank transcript  88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank exon    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank CDS 88  561 .   +   0   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; orf_id "ORF_S_L_transcribed_CDS"; gene_type "protein_coding"

Step 3: Re-run Nextflow with -resume

Now, update your Nextflow command to point to the newly formatted BB1528_final.gtf.

Important: Keep the -resume flag! Because the GTF file’s content has changed, Nextflow will automatically detect the new file hash and re-run only the genome preparation steps (like SALMON_INDEX and GTF2BED) without re-processing your FASTQ files.

nextflow run rnaseq/main.nf \
    --input samplesheet.csv \
    --outdir results_BB1528 \
    --fasta BB1528_nanopore_consensus.fasta \
    --gtf BB1528_final.gtf \
    --with_umi \
    --umitools_extract_method regex \
    --umitools_bc_pattern '^(?P
<UMI>.{12}).*' \
    --umitools_dedup_stats \
    -profile test_full \
    -resume \
    --max_memory 256.GB \
    --max_time 2400.h \
    --save_reference \
    --aligner star_salmon \
    --gtf_extra_attributes gene_id \
    --gtf_group_features transcript_id \
    --featurecounts_group_type gene_id \
    --featurecounts_feature_type transcript \
    --skip_rseqc \
    --skip_dupradar \
    --skip_preseq \
    --skip_biotype_qc \
    --skip_deseq2_qc \
    --skip_multiqc \
    --min_mapped_reads 0

This should completely resolve the SALMON_QUANT indexing error!

Leave a Reply

Your email address will not be published. Required fields are marked *