Daily Archives: 2026年6月17日

How to Convert GenBank Files to GTF/FASTA for the nf-core/rnaseq Pipeline (Data_Nina_RNAseq_2026)

Step 1: Save and Run the GTF Formatting Script

Save the following code as fix_gtf_to_hsv1_format.py in your working directory:

import sys

def fix_gtf(input_gtf, output_gtf):
    seen_exons = set()

    with open(input_gtf, 'r') as fin, open(output_gtf, 'w') as fout:
        for line in fin:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue

            seqname, source, feature, start, end, score, strand, frame, attributes = parts

            # Parse existing attributes into a dictionary
            attrs_dict = {}
            for attr in attributes.split(';'):
                attr = attr.strip()
                if not attr:
                    continue
                if ' ' in attr:
                    key, val = attr.split(' ', 1)
                    attrs_dict[key] = val.strip('"')

            gene_id = attrs_dict.get('gene_id', 'unknown_gene')
            # Standardize transcript_id to gene_id_RNA to match HSV1 format
            new_transcript_id = f"{gene_id}_RNA"

            # Rebuild attributes strictly matching HSV1 structure
            new_attrs = ""
            if feature == "gene":
                new_attrs = f'gene_id "{gene_id}"; gene_type "protein_coding"'
            elif feature == "transcript":
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "exon":
                # Avoid duplicate exons
                exon_key = (seqname, start, end, strand, gene_id, new_transcript_id)
                if exon_key in seen_exons:
                    continue
                seen_exons.add(exon_key)
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "CDS":
                orf_id = f"{gene_id}_CDS"
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; orf_id "{orf_id}"; gene_type "protein_coding"'
            else:
                # For any other features, keep them but ensure gene_type is present
                if 'transcript_id' in attrs_dict:
                    attrs_dict['transcript_id'] = new_transcript_id
                if 'gene_type' not in attrs_dict:
                    attrs_dict['gene_type'] = 'protein_coding'
                new_attrs = '; '.join([f'{k} "{v}"' for k, v in attrs_dict.items()])

            # Write the fixed line
            fout.write(f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{new_attrs}\n")

if __name__ == "__main__":
    fix_gtf("BB1528_nanopore_consensus.gtf", "BB1528_final.gtf")
    print("✅ GTF successfully reformatted to match HSV1 structure -> BB1528_final.gtf")

Run it in your terminal:

python3 fix_gtf_to_hsv1_format.py

Step 2: Verify the Output

If you check the new BB1528_final.gtf, the structure will now perfectly mirror your HSV1 example:

BB1528  GenBank gene    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; gene_type "protein_coding"
BB1528  GenBank transcript  88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank exon    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank CDS 88  561 .   +   0   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; orf_id "ORF_S_L_transcribed_CDS"; gene_type "protein_coding"

Step 3: Re-run Nextflow with `-resume`

Now, update your Nextflow command to point to the newly formatted BB1528_final.gtf.

Important: Keep the -resume flag! Because the GTF file’s content has changed, Nextflow will automatically detect the new file hash and re-run only the genome preparation steps (like SALMON_INDEX and GTF2BED) without re-processing your FASTQ files.

nextflow run rnaseq/main.nf \
    --input samplesheet.csv \
    --outdir results_BB1528 \
    --fasta BB1528_nanopore_consensus.fasta \
    --gtf BB1528_final.gtf \
    --with_umi \
    --umitools_extract_method regex \
    --umitools_bc_pattern '^(?P
<UMI>.{12}).*' \
    --umitools_dedup_stats \
    -profile test_full \
    -resume \
    --max_memory 256.GB \
    --max_time 2400.h \
    --save_reference \
    --aligner star_salmon \
    --gtf_extra_attributes gene_id \
    --gtf_group_features transcript_id \
    --featurecounts_group_type gene_id \
    --featurecounts_feature_type transcript \
    --skip_rseqc \
    --skip_dupradar \
    --skip_preseq \
    --skip_biotype_qc \
    --skip_deseq2_qc \
    --skip_multiqc \
    --min_mapped_reads 0

This should completely resolve the SALMON_QUANT indexing error!

Good news: Your GTF and FASTA files are perfectly formatted!

We can prove this from your log: [info] Index contained 78 targets This means Salmon successfully built the transcriptome index and found all 78 transcripts from your BB1528_final.gtf. The reference files are completely correct.

What is causing the error?

The error is purely a biological/sample-specific issue:

[warning] salmon was only able to assign 9 fragments to transcripts in the index, 
but the minimum number of required assigned fragments (--minAssignedFrags) was 10.

The nf-core/rnaseq pipeline automatically subsamples your reads (in this case, 1,000,000 fragments) to run a quick Salmon quantification and check the library strandedness (--libType=A).

Because you are mapping to a specific viral reference (BB1528), it is highly likely that control_r3 is a negative control or a sample with an extremely low viral load. Out of the 1,000,000 subsampled reads, only 9 reads actually belonged to the virus.

Salmon has a built-in safety threshold (--minAssignedFrags 10) that intentionally crashes the pipeline if fewer than 10 reads map to the transcriptome, to prevent “garbage” quantification of empty samples. Since 9 < 10, the pipeline stops.

How to fix it

You can easily bypass this by creating a custom Nextflow configuration file that tells Salmon to accept samples with even just 1 mapped fragment.

Step 1: Create a config file In your working directory, create a file named salmon_fix.config and add the following code:

process {
    withName: 'SALMON_QUANT' {
        // Append --minAssignedFrags 1 to any existing Salmon arguments
        ext.args = { (task.ext.args ?: '') + " --minAssignedFrags 1" }
    }
}

Step 2: Run the pipeline with the config file Add -c salmon_fix.config to your Nextflow command. This will apply the fix to both the subsampled QC step and the main quantification step, ensuring your negative controls don’t crash the pipeline.

nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \
    -c salmon_fix.config \
    --input samplesheet.csv \
    --outdir results_BB1528 \
    --fasta BB1528_nanopore_consensus.fasta \
    --gtf BB1528_final.gtf \
    --with_umi \
    --umitools_extract_method regex \
    --umitools_bc_pattern '^(?P
<umi_1>.{12}).*' \
    --trimmer fastp \
    --extra_fastp_args "--trim_front2 10" \
    -resume \
    --max_cpus 54 \
    --max_memory 120.GB \
    --max_time 2400.h \
    --save_align_intermeds \
    --save_unaligned \
    --save_reference \
    --aligner star_salmon \
    --gtf_extra_attributes gene_id \
    --gtf_group_features transcript_id \
    --featurecounts_group_type gene_id \
    --featurecounts_feature_type transcript \
    --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc \
    --min_mapped_reads 0

Summary of what will happen now:

The pipeline will resume from where it left off (thanks to -resume).
Salmon will process control_r3, see that it only has 9 viral reads, but instead of crashing, it will output a warning and successfully generate an output file with 0 or near-0 counts.
The rest of your infected samples (which likely have plenty of viral reads) will quantify normally.
At the end of the pipeline, you will get a complete count matrix where your negative controls correctly show 0 expression, which is exactly what you want for downstream differential expression analysis!

Yes, all detected configuration files are being loaded and merged by Nextflow. Your custom settings will definitely be applied. What does this warning actually mean? This is not an error; it is just a best-practice reminder built into the nf-core pipeline framework. The developers of nf-core strictly enforce a rule: Pipeline parameters (the arguments starting with — like –input, –fasta, –max_memory) should never be defined inside a .config file. They must be passed directly via the command line (CLI) or a dedicated -params-file. The warning is simply the pipeline saying:

"Hey, I see you are using custom .config files. Just a reminder: do not put your --parameters inside them! Use the command line for parameters, and use .config files only for Nextflow settings (like process resources, arguments, or executor settings)."

Is your setup correct? Yes, absolutely. The salmon_fix.config file we created in the previous step only contains a process block to modify Salmon’s internal arguments (ext.args). It does not contain any params. Therefore, you are using the config file exactly as intended. Why did it trigger? Nextflow automatically detects any file ending in .config in your working directory (e.g., nextflow.config, salmon_fix.config, or profile configs). Because it found more than one, it triggered this informational warning. What should you do? You can safely ignore this warning. The pipeline will continue running normally, and your fix to allow Salmon to quantify samples with fewer than 10 assigned fragments (–minAssignedFrags 1) will be successfully applied. Did the pipeline successfully pass the SALMON_QUANT step and finish running this time? If it didn’t crash at that specific step, it means the config file worked perfectly!

Based on the error log and the reference files you provided, here is the diagnosis of the problem and how to fix it.

🔍 The Root Cause

The pipeline is failing during the FASTQ_SUBSAMPLE_FQ_SALMON step. nf-core/rnaseq uses this step to subsample 1 million reads and run Salmon to automatically detect your library’s strandedness (e.g., forward, reverse, unstranded) and verify the mapping rate.

The critical error from Salmon is:

salmon was only able to assign 8 fragments to transcripts in the index, but the minimum number of required assigned fragments (--minAssignedFrags) was 10. This could be indicative of a mismatch between the reference and sample...

Why is this happening?

Your Reference is Virus-Only: The reference you provided (BB1528_nanopore_consensus.fasta) is a Varicella-Zoster Virus (VZV) genome (~127 kb).
Your Controls are Likely Host-Only: You have samples named control_r1, control_r2, and control_r3. These are almost certainly uninfected host cells (e.g., human or monkey cell lines).
The Mismatch: Because host RNA-seq reads do not match the VZV viral genome, the mapping rate for your control samples is essentially 0% (only 8 random reads mapped out of 1,000,000). Salmon aborts because it cannot confidently determine the library type with fewer than 10 mapped reads.

🛠️ How to Fix It

Depending on your experimental goals, choose one of the following solutions:

Solution 1: Add the Host Genome to your Reference (Recommended)

If you want to analyze host gene expression or simply want the pipeline to run smoothly on uninfected controls, you must include the host genome (e.g., Human GRCh38 or Macaque) alongside the VZV genome.

Action: Concatenate the Host FASTA and VZV FASTA into a single combined_genome.fa. Do the same for the Host GTF and VZV GTF (combined_annotation.gtf).
Why: This allows the host reads in your control samples to map properly, allowing Salmon to easily detect the library strandedness and pass the QC threshold.

Solution 2: Remove the Control Samples from the Samplesheet

If you are only interested in viral transcription and intentionally built a virus-only reference, the uninfected controls will yield zero useful data anyway.

Action: Open your samplesheet.csv and delete the rows for control_r1, control_r2, and control_r3.
Why: The pipeline will only run on the VZV.dXX infected samples, which will have high mapping rates to the VZV reference, allowing Salmon to successfully detect the library type.

Solution 3: Manually Specify Strandedness

If you know the strandedness of your library preparation kit (e.g., Illumina TruSeq Stranded mRNA is usually reverse), you can tell the pipeline to skip the auto-detection step.

Action: Add a strandedness column to your samplesheet.csv and fill it with reverse, forward, or unstranded for all samples. Alternatively, pass --strandedness reverse (or appropriate value) in your Nextflow run command.
Note: While this bypasses the library-type inference failure, Salmon might still throw a warning about the low mapping rate for the controls, but it usually prevents the hard crash.

Solution 4: Verify Sample Identities

If your control samples are actually supposed to be VZV-infected (e.g., a different viral strain or a positive control), then an 8-read mapping rate indicates a severe issue.

Action: Check your FASTQ files. Ensure the control FASTQ files aren’t corrupted, empty, or accidentally swapped with a completely different organism’s sequencing data. Ensure the viral strain in the control isn’t highly divergent from the Oka parental strain reference you are using.

💡 Summary Recommendation

For viral RNA-seq experiments, Solution 1 is the standard best practice. Creating a “chimeric” reference (Host + Virus) ensures that mapping rates are accurate, host-response pathways can be evaluated, and nf-core/rnaseq’s internal QC tools function exactly as intended.

Comprehensive Reproducible Pipeline for Longitudinal Nasal Microbiome and S. epidermidis Epidome Analysis (DATA_B/Data_Luise_Epidome_longitudinal_nose)

1. Study Design & Metadata

This pipeline analyzes longitudinal nasal swab samples from patients undergoing neurosurgery.

Cohort A (Aneurysm): 15 patients (45 samples)
Cohort H (Hypophysis): 20 patients (60 samples)
Total: 108 patient samples + 3 extraction controls.
Timepoints:
- .1 = Admission (Baseline)
- .2 = Surgery (Intraoperative/Immediate post-op)
- .3 = Discharge (Recovery)

Targeted Sequencing Approaches:

16S rRNA Gene Amplicon Sequencing: For overall nasal microbiome profiling (processed via QIIME1 open-reference picking against SILVA 132).
Epidome Method (S. epidermidis): Targeted amplicon sequencing of the g216 and yycH genes for high-resolution Sequence Type (ST) tracking (processed via DADA2 and custom Epidome Python/R scripts).

2. R Environment & Dependencies

To reproduce this analysis, ensure you are using R (v4.1+) and have the following packages installed.

# Core Microbiome & Phyloseq
install.packages(c("phyloseq", "microbiome", "picante", "vegan", "ape"))
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "microbiome", "phyloseq"))

# Advanced Processing & Visualization
install.packages(c("MicrobiotaProcess", "microeco", "ggplot2", "ggpubr", "ggrepel", "aplot"))
install.packages(c("ggh4x", "gghalves", "ggalluvial", "RColorBrewer", "heatmaply", "gplots"))

# Statistical Testing & Data Wrangling
install.packages(c("dplyr", "tidyr", "rstatix", "RVAideMemoire", "openxlsx", "knitr", "kableExtra", "tibble", "reshape2", "coin", "BSDA"))

3. Workflow Part I: Overall Microbiome (16S) Analysis

Scripts involved: MicrobiotaProcess.R (Primary engine for diversity/ordination) & Phyloseq.Rmd (Reporting, DESeq2, and specific statistical tests).

3.1 Data Import & Normalization Strategy

The pipeline splits the raw OTU table (table_even9893.biom) into distinct objects based on the downstream statistical requirements:

ps_filt: Raw counts, low-depth samples removed (min 1000 reads). Used for DESeq2 and Hellinger transformation.
ps_rarefied: Rarefied to an even depth of 9,893 reads (set.seed(9242)). Used for Alpha diversity and UniFrac/Jaccard metrics.
ps_abund_rel: Relative abundance, filtered to keep only taxa with mean relative abundance > 0.1%. Used for clean taxonomic composition plots.
hellinger (via MicrobiotaProcess): Hellinger-transformed ps_filt. Used for Bray-Curtis distances and PCoA.

3.2 Alpha Diversity (Longitudinal & Cross-Sectional)

Metrics: Observed OTUs, Chao1, ACE, Shannon, Simpson, Pielou.
Longitudinal Analysis (Within Cohorts A and H):
- Overall temporal shift: Friedman test (non-parametric repeated measures) accounting for PatientID pairing.
- Post-hoc pairwise: Paired t-tests and Paired Wilcoxon signed-rank tests with Benjamini-Hochberg (BH) correction.
Cross-Sectional Analysis (Cohort A vs. Cohort H):
- All timepoints pooled: Independent t-test / Wilcoxon rank-sum test.
- Surgery timepoint ONLY: To prevent dilution of the surgical effect by baseline/discharge samples, a specific filter (SampleType == "surgery") is applied before running the independent t-test/Wilcoxon.

3.3 Beta Diversity & Ordination

Distance Metric: Bray-Curtis dissimilarity calculated on Hellinger-transformed non-rarefied counts.
Ordination: Principal Coordinates Analysis (PCoA).
Statistical Testing:
- Overall: PERMANOVA (vegan::adonis2) with 9,999 permutations.
- Post-hoc Pairwise: All 15 possible pairwise group comparisons (e.g., A1 vs A2, A1 vs H1, H2 vs H3) are calculated using adonis2. P-values are adjusted using BH (FDR) and Bonferroni corrections.
Outputs: PCoA plots (colored by group, sized by Shannon, alpha by Observed), exported as PNG/PDF/SVG. Pairwise PERMANOVA results exported to Bray_pairwise_PERMANOVA.xlsx.

3.4 Taxonomic Composition

Plots: Stacked barplots and Heatmaps at the Class level (Top 20 taxa).
Grouping: Plots generated for individual samples, grouped by Cohort (A vs H), and grouped by Timepoint (Admission, Surgery, Discharge).

3.5 Differential Abundance Analysis (DESeq2)

Method: Negative Binomial Generalized Linear Model (Wald test) using raw, non-rarefied integer counts (ps_filt).
Comparisons: Six independent pairwise comparisons per cohort:
1. Admission vs. Surgery
2. Admission vs. Discharge
3. Surgery vs. Discharge (Repeated for both Cohort A and Cohort H)
Threshold: Adjusted P-value (BH) < 0.05.
Outputs: Volcano-style plots (log2FoldChange by Family/Class) and Excel tables of significant DEGs for each comparison.

4. Workflow Part II: S. epidermidis Epidome Analysis

Script involved: Phyloseq.Rmd (Sections: ST summary, Binary Prevalence Analysis). Input Data: count_table_seq31_seq37_ST.txt (ASVs classified into STs via g216 and yycH BLAST against the Epidome DB). Normalized to median depth (56,191).

4.1 ST Composition & Alpha Diversity

Composition: Stacked barplots of relative ST abundance across the 108 samples.
Alpha Diversity: Shannon index and Observed STs compared between Cohorts (t-test) and across Timepoints (t-test).
goeBURST: Clonal complex (CC) visualization based on PubMLST allele profiles.

4.2 Continuous Abundance Testing (Specific STs)

Method: Wilcoxon signed-rank test (paired, Admission .1 vs. Discharge .3).
Targets: Analyzed individually for clinically relevant STs (ST2, ST5, ST8, ST23, ST59, ST60, ST73, ST100, ST130, ST215) and in combined clinical groups (e.g., ST5+ST87+ST130+ST210+ST384).
Handling Ties: Evaluated using exact calculations via the coin package and the Sign Test (BSDA) to ensure robustness against zero-inflation.

4.3 🌟 Novel Binary Prevalence Analysis (Cochran’s Q & McNemar)

Because continuous abundance of specific high-risk STs (ST2, ST5, ST23) showed no significant shifts, we converted the data to a binary presence/absence matrix to track strain acquisition and clearance.

Data Transformation: Abundance > 0 becomes 1 (Present); becomes (Absent).
Overall Temporal Shift: Cochran’s Q test (RVAideMemoire::cochran.qtest) to evaluate if the prevalence proportion changes across the 3 timepoints (Admission, Surgery, Discharge) for a given ST.
Pairwise Transitions: Robust McNemar tests for paired binary data (Admission vs Surgery, Surgery vs Discharge, Admission vs Discharge).
- Robustness: Custom R function handles zero-variance (no discordant pairs) by assigning $p = 1.0$ instead of crashing.
Correction: Benjamini-Hochberg (BH) FDR correction applied to the 3 pairwise p-values.
Visualization: Line charts with scatter points showing the Prevalence (%) trend across the 3 timepoints, annotated with $n/N$ (number of positive patients / total patients).
Export: The binary matrix is exported to ST_binary_matrix.xlsx.

5. Execution Instructions (How to Reproduce)

Step 1: Prepare the Directory

Ensure your working directory contains the following raw files:

table_even9893.biom and ../clustering/rep_set.tre (16S data)
../map3_corrected.txt (Metadata)
../count_table_seq31_seq37_ST.txt (Epidome ST counts)
adiv_even.txt, adiv_even_A.txt, adiv_even_H.txt (Pre-calculated 16S alpha metrics from QIIME)
alpha_diversity_metrics_samples_ST.csv (Pre-calculated ST alpha metrics)

Step 2: Run the MicrobiotaProcess Pipeline (Beta/Alpha/Composition)

Open your terminal or RStudio and run the MicrobiotaProcess.R script. This will generate the figures_MP/ and figures_All_Combined/ directories.

Rscript MicrobiotaProcess.R

Outputs: PCoA plots, rarefaction curves, Bray-Curtis distance matrices, pairwise PERMANOVA Excel files, and Class-level heatmaps.

Step 3: Render the Comprehensive HTML Report

Open RStudio, load Phyloseq.Rmd, and render it. This integrates the outputs from Step 2, runs DESeq2, performs the Epidome statistical tests, and generates the final interactive report.

# In R Console:
rmarkdown::render('Phyloseq.Rmd', output_file = 'Phyloseq.html')

Outputs: Phyloseq.html (the master report), figures/ directory (DESeq2 plots, ST prevalence plots, goeBURST), and various .xlsx files for ST binary matrices and DEG tables.

6. Summary of Key Output Files

File / Directory	Description
`Phyloseq.html`	Master Report. Contains all 16S and Epidome statistics, tables, and embedded figures.
`figures_MP/`	High-res PCoA, rarefaction, alpha diversity, and composition heatmaps from `MicrobiotaProcess`.
`figures_All_Combined/`	Pairwise PERMANOVA results (`Bray_pairwise_PERMANOVA.xlsx`) and combined PCoA plots.
`figures/`	DESeq2 volcano plots, ST prevalence trend lines, and Alpha diversity boxplots.
`DEGs_*.xlsx`	Excel tables of differentially abundant OTUs for all 6 pairwise comparisons per cohort.
`ST_binary_matrix.xlsx`	The 0/1 presence/absence matrix for ST2, ST5, and ST23 used for the McNemar analysis.

Pipeline maintained by Jiabin Huang. For questions regarding the Epidome Python scripts or upstream QIIME1/DADA2 processing, refer to the respective bash/R scripts in the epidome/scripts/ directory.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Daily Archives: 2026年6月17日

How to Convert GenBank Files to GTF/FASTA for the nf-core/rnaseq Pipeline (Data_Nina_RNAseq_2026)

Step 1: Save and Run the GTF Formatting Script

Step 2: Verify the Output

Step 3: Re-run Nextflow with `-resume`

What is causing the error?

How to fix it

Summary of what will happen now:

🔍 The Root Cause

🛠️ How to Fix It

Solution 1: Add the Host Genome to your Reference (Recommended)

Solution 2: Remove the Control Samples from the Samplesheet

Solution 3: Manually Specify Strandedness

Solution 4: Verify Sample Identities

💡 Summary Recommendation

Comprehensive Reproducible Pipeline for Longitudinal Nasal Microbiome and S. epidermidis Epidome Analysis (DATA_B/Data_Luise_Epidome_longitudinal_nose)

1. Study Design & Metadata

2. R Environment & Dependencies

3. Workflow Part I: Overall Microbiome (16S) Analysis

3.1 Data Import & Normalization Strategy

3.2 Alpha Diversity (Longitudinal & Cross-Sectional)

3.3 Beta Diversity & Ordination

3.4 Taxonomic Composition

3.5 Differential Abundance Analysis (DESeq2)

4. Workflow Part II: S. epidermidis Epidome Analysis

4.1 ST Composition & Alpha Diversity

4.2 Continuous Abundance Testing (Specific STs)

4.3 🌟 Novel Binary Prevalence Analysis (Cochran’s Q & McNemar)

5. Execution Instructions (How to Reproduce)

Step 1: Prepare the Directory

Step 2: Run the MicrobiotaProcess Pipeline (Beta/Alpha/Composition)

Step 3: Render the Comprehensive HTML Report

6. Summary of Key Output Files

Step 1: Save and Run the GTF Formatting Script

Step 2: Verify the Output

Step 3: Re-run Nextflow with -resume

What is causing the error?

How to fix it

Summary of what will happen now:

🔍 The Root Cause

🛠️ How to Fix It

Solution 1: Add the Host Genome to your Reference (Recommended)

Solution 2: Remove the Control Samples from the Samplesheet

Solution 3: Manually Specify Strandedness

Solution 4: Verify Sample Identities

💡 Summary Recommendation

1. Study Design & Metadata

2. R Environment & Dependencies

3. Workflow Part I: Overall Microbiome (16S) Analysis

3.1 Data Import & Normalization Strategy

3.2 Alpha Diversity (Longitudinal & Cross-Sectional)

3.3 Beta Diversity & Ordination

3.4 Taxonomic Composition

3.5 Differential Abundance Analysis (DESeq2)

4. Workflow Part II: S. epidermidis Epidome Analysis

4.1 ST Composition & Alpha Diversity

4.2 Continuous Abundance Testing (Specific STs)

4.3 🌟 Novel Binary Prevalence Analysis (Cochran’s Q & McNemar)

5. Execution Instructions (How to Reproduce)

Step 1: Prepare the Directory

Step 2: Run the MicrobiotaProcess Pipeline (Beta/Alpha/Composition)

Step 3: Render the Comprehensive HTML Report

6. Summary of Key Output Files

Step 3: Re-run Nextflow with `-resume`