Daily Archives: 2026年6月17日

TODO_TOMORROW: How to Convert GenBank Files to GTF/FASTA for the nf-core/rnaseq Pipeline (Data_Nina_RNAseq_2026)

Abstract: When running the nf-core/rnaseq pipeline on custom or non-model viral genomes (like VZV or HSV), standard GenBank (.gb) files often cause indexing errors in STAR or quantification failures in Salmon due to missing GTF attributes. In this post, I share a robust Python (Biopython) workflow to convert GenBank files into perfectly formatted FASTA and GTF files. We specifically address how to inject required features like exon, gene_type, and properly formatted transcript_id and gene_id attributes to ensure seamless execution of the star_salmon alignment and quantification steps.

You are absolutely right! The duplicate exon was not the root cause. The real issue is that Salmon and the nf-core/rnaseq pipeline strictly expect specific attributes (like gene_type "protein_coding") to be present in every corresponding record to correctly build the transcript-to-gene mapping and index.

By comparing your original BB1528_nanopore_consensus.gtf with the HSV1 structure, we can see the missing “additional IDs/attributes”:

  1. gene_type "protein_coding" is missing in all records.
  2. The transcript_id format should ideally end with _RNA (e.g., UL33_RNA) instead of _tx.
  3. CDS records should have an orf_id.
  4. gene records should not have gene_name, and exon records should not have exon_number (to match the clean HSV1 structure).

Instead of re-converting from the GenBank file, we can write a quick Python script to directly fix your existing BB1528_nanopore_consensus.gtf and reformat its attributes to perfectly match the HSV1 structure.

Step 1: Save and Run the GTF Formatting Script

Save the following code as fix_gtf_to_hsv1_format.py in your working directory:

import sys

def fix_gtf(input_gtf, output_gtf):
    seen_exons = set()

    with open(input_gtf, 'r') as fin, open(output_gtf, 'w') as fout:
        for line in fin:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue

            seqname, source, feature, start, end, score, strand, frame, attributes = parts

            # Parse existing attributes into a dictionary
            attrs_dict = {}
            for attr in attributes.split(';'):
                attr = attr.strip()
                if not attr:
                    continue
                if ' ' in attr:
                    key, val = attr.split(' ', 1)
                    attrs_dict[key] = val.strip('"')

            gene_id = attrs_dict.get('gene_id', 'unknown_gene')
            # Standardize transcript_id to gene_id_RNA to match HSV1 format
            new_transcript_id = f"{gene_id}_RNA"

            # Rebuild attributes strictly matching HSV1 structure
            new_attrs = ""
            if feature == "gene":
                new_attrs = f'gene_id "{gene_id}"; gene_type "protein_coding"'
            elif feature == "transcript":
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "exon":
                # Avoid duplicate exons
                exon_key = (seqname, start, end, strand, gene_id, new_transcript_id)
                if exon_key in seen_exons:
                    continue
                seen_exons.add(exon_key)
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "CDS":
                orf_id = f"{gene_id}_CDS"
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; orf_id "{orf_id}"; gene_type "protein_coding"'
            else:
                # For any other features, keep them but ensure gene_type is present
                if 'transcript_id' in attrs_dict:
                    attrs_dict['transcript_id'] = new_transcript_id
                if 'gene_type' not in attrs_dict:
                    attrs_dict['gene_type'] = 'protein_coding'
                new_attrs = '; '.join([f'{k} "{v}"' for k, v in attrs_dict.items()])

            # Write the fixed line
            fout.write(f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{new_attrs}\n")

if __name__ == "__main__":
    fix_gtf("BB1528_nanopore_consensus.gtf", "BB1528_final.gtf")
    print("✅ GTF successfully reformatted to match HSV1 structure -> BB1528_final.gtf")

Run it in your terminal:

python3 fix_gtf_to_hsv1_format.py

Step 2: Verify the Output

If you check the new BB1528_final.gtf, the structure will now perfectly mirror your HSV1 example:

BB1528  GenBank gene    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; gene_type "protein_coding"
BB1528  GenBank transcript  88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank exon    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank CDS 88  561 .   +   0   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; orf_id "ORF_S_L_transcribed_CDS"; gene_type "protein_coding"

Step 3: Re-run Nextflow with -resume

Now, update your Nextflow command to point to the newly formatted BB1528_final.gtf.

Important: Keep the -resume flag! Because the GTF file’s content has changed, Nextflow will automatically detect the new file hash and re-run only the genome preparation steps (like SALMON_INDEX and GTF2BED) without re-processing your FASTQ files.

nextflow run rnaseq/main.nf \
    --input samplesheet.csv \
    --outdir results_BB1528 \
    --fasta BB1528_nanopore_consensus.fasta \
    --gtf BB1528_final.gtf \
    --with_umi \
    --umitools_extract_method regex \
    --umitools_bc_pattern '^(?P
<UMI>.{12}).*' \
    --umitools_dedup_stats \
    -profile test_full \
    -resume \
    --max_memory 256.GB \
    --max_time 2400.h \
    --save_reference \
    --aligner star_salmon \
    --gtf_extra_attributes gene_id \
    --gtf_group_features transcript_id \
    --featurecounts_group_type gene_id \
    --featurecounts_feature_type transcript \
    --skip_rseqc \
    --skip_dupradar \
    --skip_preseq \
    --skip_biotype_qc \
    --skip_deseq2_qc \
    --skip_multiqc \
    --min_mapped_reads 0

This should completely resolve the SALMON_QUANT indexing error!

Comprehensive Reproducible Pipeline for Longitudinal Nasal Microbiome and *S. epidermidis* Epidome Analysis (DATA_B/Data_Luise_Epidome_longitudinal_nose)

readme

phyloseq-rmd

microbiotaprocess-r

1. Study Design & Metadata

This pipeline analyzes longitudinal nasal swab samples from patients undergoing neurosurgery.

  • Cohort A (Aneurysm): 15 patients (45 samples)
  • Cohort H (Hypophysis): 20 patients (60 samples)
  • Total: 108 patient samples + 3 extraction controls.
  • Timepoints:
    • .1 = Admission (Baseline)
    • .2 = Surgery (Intraoperative/Immediate post-op)
    • .3 = Discharge (Recovery)

Targeted Sequencing Approaches:

  1. 16S rRNA Gene Amplicon Sequencing: For overall nasal microbiome profiling (processed via QIIME1 open-reference picking against SILVA 132).
  2. Epidome Method (S. epidermidis): Targeted amplicon sequencing of the g216 and yycH genes for high-resolution Sequence Type (ST) tracking (processed via DADA2 and custom Epidome Python/R scripts).

2. R Environment & Dependencies

To reproduce this analysis, ensure you are using R (v4.1+) and have the following packages installed.

# Core Microbiome & Phyloseq
install.packages(c("phyloseq", "microbiome", "picante", "vegan", "ape"))
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "microbiome", "phyloseq"))

# Advanced Processing & Visualization
install.packages(c("MicrobiotaProcess", "microeco", "ggplot2", "ggpubr", "ggrepel", "aplot"))
install.packages(c("ggh4x", "gghalves", "ggalluvial", "RColorBrewer", "heatmaply", "gplots"))

# Statistical Testing & Data Wrangling
install.packages(c("dplyr", "tidyr", "rstatix", "RVAideMemoire", "openxlsx", "knitr", "kableExtra", "tibble", "reshape2", "coin", "BSDA"))

3. Workflow Part I: Overall Microbiome (16S) Analysis

Scripts involved: MicrobiotaProcess.R (Primary engine for diversity/ordination) & Phyloseq.Rmd (Reporting, DESeq2, and specific statistical tests).

3.1 Data Import & Normalization Strategy

The pipeline splits the raw OTU table (table_even9893.biom) into distinct objects based on the downstream statistical requirements:

  • ps_filt: Raw counts, low-depth samples removed (min 1000 reads). Used for DESeq2 and Hellinger transformation.
  • ps_rarefied: Rarefied to an even depth of 9,893 reads (set.seed(9242)). Used for Alpha diversity and UniFrac/Jaccard metrics.
  • ps_abund_rel: Relative abundance, filtered to keep only taxa with mean relative abundance > 0.1%. Used for clean taxonomic composition plots.
  • hellinger (via MicrobiotaProcess): Hellinger-transformed ps_filt. Used for Bray-Curtis distances and PCoA.

3.2 Alpha Diversity (Longitudinal & Cross-Sectional)

  • Metrics: Observed OTUs, Chao1, ACE, Shannon, Simpson, Pielou.
  • Longitudinal Analysis (Within Cohorts A and H):
    • Overall temporal shift: Friedman test (non-parametric repeated measures) accounting for PatientID pairing.
    • Post-hoc pairwise: Paired t-tests and Paired Wilcoxon signed-rank tests with Benjamini-Hochberg (BH) correction.
  • Cross-Sectional Analysis (Cohort A vs. Cohort H):
    • All timepoints pooled: Independent t-test / Wilcoxon rank-sum test.
    • Surgery timepoint ONLY: To prevent dilution of the surgical effect by baseline/discharge samples, a specific filter (SampleType == "surgery") is applied before running the independent t-test/Wilcoxon.

3.3 Beta Diversity & Ordination

  • Distance Metric: Bray-Curtis dissimilarity calculated on Hellinger-transformed non-rarefied counts.
  • Ordination: Principal Coordinates Analysis (PCoA).
  • Statistical Testing:
    • Overall: PERMANOVA (vegan::adonis2) with 9,999 permutations.
    • Post-hoc Pairwise: All 15 possible pairwise group comparisons (e.g., A1 vs A2, A1 vs H1, H2 vs H3) are calculated using adonis2. P-values are adjusted using BH (FDR) and Bonferroni corrections.
  • Outputs: PCoA plots (colored by group, sized by Shannon, alpha by Observed), exported as PNG/PDF/SVG. Pairwise PERMANOVA results exported to Bray_pairwise_PERMANOVA.xlsx.

3.4 Taxonomic Composition

  • Plots: Stacked barplots and Heatmaps at the Class level (Top 20 taxa).
  • Grouping: Plots generated for individual samples, grouped by Cohort (A vs H), and grouped by Timepoint (Admission, Surgery, Discharge).

3.5 Differential Abundance Analysis (DESeq2)

  • Method: Negative Binomial Generalized Linear Model (Wald test) using raw, non-rarefied integer counts (ps_filt).
  • Comparisons: Six independent pairwise comparisons per cohort:
    1. Admission vs. Surgery
    2. Admission vs. Discharge
    3. Surgery vs. Discharge (Repeated for both Cohort A and Cohort H)
  • Threshold: Adjusted P-value (BH) < 0.05.
  • Outputs: Volcano-style plots (log2FoldChange by Family/Class) and Excel tables of significant DEGs for each comparison.

4. Workflow Part II: S. epidermidis Epidome Analysis

Script involved: Phyloseq.Rmd (Sections: ST summary, Binary Prevalence Analysis). Input Data: count_table_seq31_seq37_ST.txt (ASVs classified into STs via g216 and yycH BLAST against the Epidome DB). Normalized to median depth (56,191).

4.1 ST Composition & Alpha Diversity

  • Composition: Stacked barplots of relative ST abundance across the 108 samples.
  • Alpha Diversity: Shannon index and Observed STs compared between Cohorts (t-test) and across Timepoints (t-test).
  • goeBURST: Clonal complex (CC) visualization based on PubMLST allele profiles.

4.2 Continuous Abundance Testing (Specific STs)

  • Method: Wilcoxon signed-rank test (paired, Admission .1 vs. Discharge .3).
  • Targets: Analyzed individually for clinically relevant STs (ST2, ST5, ST8, ST23, ST59, ST60, ST73, ST100, ST130, ST215) and in combined clinical groups (e.g., ST5+ST87+ST130+ST210+ST384).
  • Handling Ties: Evaluated using exact calculations via the coin package and the Sign Test (BSDA) to ensure robustness against zero-inflation.

4.3 🌟 Novel Binary Prevalence Analysis (Cochran’s Q & McNemar)

Because continuous abundance of specific high-risk STs (ST2, ST5, ST23) showed no significant shifts, we converted the data to a binary presence/absence matrix to track strain acquisition and clearance.

  1. Data Transformation: Abundance > 0 becomes 1 (Present); becomes (Absent).
  2. Overall Temporal Shift: Cochran’s Q test (RVAideMemoire::cochran.qtest) to evaluate if the prevalence proportion changes across the 3 timepoints (Admission, Surgery, Discharge) for a given ST.
  3. Pairwise Transitions: Robust McNemar tests for paired binary data (Admission vs Surgery, Surgery vs Discharge, Admission vs Discharge).
    • Robustness: Custom R function handles zero-variance (no discordant pairs) by assigning $p = 1.0$ instead of crashing.
  4. Correction: Benjamini-Hochberg (BH) FDR correction applied to the 3 pairwise p-values.
  5. Visualization: Line charts with scatter points showing the Prevalence (%) trend across the 3 timepoints, annotated with $n/N$ (number of positive patients / total patients).
  6. Export: The binary matrix is exported to ST_binary_matrix.xlsx.

5. Execution Instructions (How to Reproduce)

Step 1: Prepare the Directory

Ensure your working directory contains the following raw files:

  • table_even9893.biom and ../clustering/rep_set.tre (16S data)
  • ../map3_corrected.txt (Metadata)
  • ../count_table_seq31_seq37_ST.txt (Epidome ST counts)
  • adiv_even.txt, adiv_even_A.txt, adiv_even_H.txt (Pre-calculated 16S alpha metrics from QIIME)
  • alpha_diversity_metrics_samples_ST.csv (Pre-calculated ST alpha metrics)

Step 2: Run the MicrobiotaProcess Pipeline (Beta/Alpha/Composition)

Open your terminal or RStudio and run the MicrobiotaProcess.R script. This will generate the figures_MP/ and figures_All_Combined/ directories.

Rscript MicrobiotaProcess.R

Outputs: PCoA plots, rarefaction curves, Bray-Curtis distance matrices, pairwise PERMANOVA Excel files, and Class-level heatmaps.

Step 3: Render the Comprehensive HTML Report

Open RStudio, load Phyloseq.Rmd, and render it. This integrates the outputs from Step 2, runs DESeq2, performs the Epidome statistical tests, and generates the final interactive report.

# In R Console:
rmarkdown::render('Phyloseq.Rmd', output_file = 'Phyloseq.html')

Outputs: Phyloseq.html (the master report), figures/ directory (DESeq2 plots, ST prevalence plots, goeBURST), and various .xlsx files for ST binary matrices and DEG tables.


6. Summary of Key Output Files

File / Directory Description
Phyloseq.html Master Report. Contains all 16S and Epidome statistics, tables, and embedded figures.
figures_MP/ High-res PCoA, rarefaction, alpha diversity, and composition heatmaps from MicrobiotaProcess.
figures_All_Combined/ Pairwise PERMANOVA results (Bray_pairwise_PERMANOVA.xlsx) and combined PCoA plots.
figures/ DESeq2 volcano plots, ST prevalence trend lines, and Alpha diversity boxplots.
DEGs_*.xlsx Excel tables of differentially abundant OTUs for all 6 pairwise comparisons per cohort.
ST_binary_matrix.xlsx The 0/1 presence/absence matrix for ST2, ST5, and ST23 used for the McNemar analysis.

Pipeline maintained by Jiabin Huang. For questions regarding the Epidome Python scripts or upstream QIIME1/DADA2 processing, refer to the respective bash/R scripts in the epidome/scripts/ directory.