Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.
For nf-core/methylong and most methylation analysis workflows, modBAM is sufficient and preferred. FASTQ files become redundant once you have a properly generated modBAM.
📊 Information Comparison: modBAM vs. FASTQ
Data Type
modBAM
FASTQ
Required for methylong?
Base calls (A/C/G/T)
✅ Yes
✅ Yes
✅ (in modBAM)
Quality scores
✅ Yes
✅ Yes
⚠️ Optional
Read alignment
✅ Yes (if mapped)
❌ No
✅ (modBAM preferred)
Methylation tags (MM/ML)
✅ Yes
❌ No
✅ Critical
Signal-level info
⚠️ Via tags (if --emit-moves)
❌ No
✅ For advanced analysis
File size (5 Mb genome)
~200-500 MB
~300 MB
—
➡️ modBAM contains everything FASTQ has, PLUS alignment and methylation data.
🎯 When FASTQ Is Still Useful (Optional Scenarios)
Use Case
Why Keep FASTQ?
Alternative with modBAM
Independent QC
Run FastQC, NanoPlot on raw reads
samtools stats, modkit stats, or extract FASTQ from BAM: samtools fastq file.bam
Re-basecalling
Change Dorado model/modifications
Keep raw signal files (pod5/fast5) instead — they’re the true source
Collaboration
Share with labs not using modBAM
Convert modBAM → FASTQ anytime: samtools fastq
Backup/Reproducibility
Archival completeness
modBAM + reference genome is sufficient for reproducibility
Non-methylation analysis
Variant calling, assembly
Use modBAM directly or extract FASTQ if needed
💡 Key Insight: If you have modBAM + reference genome, you can regenerate FASTQ anytime. But you cannot regenerate methylation tags from FASTQ alone.
🔄 Practical Workflow: What to Keep & What to Discard
graph LR
A[Raw Signal: pod5/fast5] --> B[Dorado basecalling];
B --> C[modBAM with MM/ML tags];
C --> D[nf-core/methylong];
C --> E[Convert to FASTQ if needed: samtools fastq];
F[Original FASTQ from Novogene] -.->|Redundant| C;
style F stroke-dasharray: 5 5
✅ Recommended File Retention Strategy
Priority
File Type
Keep?
Reason
🔴 Essential
*.mod.bam (with MM/ML tags)
✅ Yes
Direct input for methylong; contains all analysis-ready data
🟡 High Value
pod5/ or fast5/ (raw signal)
✅ Yes (if storage allows)
Enables re-basecalling with new models/modifications
🟢 Optional
*.fq.gz (basecalled FASTQ)
❌ Can delete
Redundant; can be regenerated from modBAM
🔵 Helpful
sequencing_summary.txt, reference genome
✅ Yes
Metadata + alignment reference
🛠️ How to Extract FASTQ from modBAM (If Ever Needed)
# Extract FASTQ from mapped or unmapped modBAM
samtools fastq -1 output_R1.fq.gz -2 output_R2.fq.gz -0 /dev/null -s /dev/null -n input.mod.bam
# For single-end (typical for ONT):
samtools fastq input.mod.bam > output.fq.gz
# Preserve modification tags in header (optional):
samtools view -h input.mod.bam | grep -E "^@|^.*MM:Z:" | ...
⚠️ Note: Extracted FASTQ will not contain methylation information — that only lives in the MM/ML tags within the BAM.
📋 Action Plan for Your Project
If Novogene Provides modBAM:
# 1. Verify modification tags exist
samtools view An6.mod.bam | head -3 | grep -o "MM:Z:[^ ]*"
# 2. (Optional) Delete redundant FASTQ to save space
rm clean_data/An6/*.fq.gz
# 3. Prepare samplesheet for methylong
echo "group,sample,path,ref,method" > samplesheet.csv
echo "control,An6,/path/to/An6.mod.bam,/path/to/ref.fa,ont" >> samplesheet.csv
# 4. Run pipeline — NO FASTQ NEEDED
nextflow run nf-core/methylong -profile docker --input samplesheet.csv --outdir ./results
If Novogene Only Provides FASTQ + Raw Signal (pod5/fast5):
# You MUST basecall to modBAM first:
dorado basecaller \
--modified-bases 5mC_6mA_4mC \
--emit-moves \
--device cuda:0 \
dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
/path/to/pod5/ \
> An6.mod.bam
# Then proceed as above — original FASTQ can be deleted
💬 Reply to Novogene (Clarifying File Needs)
If they ask “Do you need the FASTQ files we already sent?”, you can reply:
Thank you for checking. For our methylation analysis workflow:
✅ We DO need: modBAM files with MM/ML tags for 5mC_6mA_4mC (or POD5/FAST5 raw signal files to generate them)
❌ We do NOT need: the basecalled FASTQ files already delivered, as they are redundant once modBAM is available
If storage or transfer is a concern, please prioritize delivering modBAM or raw signal files. We can regenerate FASTQ from modBAM if ever needed.
Best regards,
Jiabin
✅ Summary Checklist
[ ] modBAM received? → Verify MM/ML tags with samtools
[ ] Raw signal (pod5/fast5) available? → Keep for future re-basecalling
[ ] Original FASTQ taking up space? → Safe to delete after modBAM validation
[ ] Running methylong? → Use modBAM directly; no FASTQ required
🎯 Bottom Line: modBAM is the “golden copy” for methylation analysis. FASTQ is optional backup — keep it only if you have ample storage or specific QC needs.
Let me know if you’d like a one-liner script to batch-validate modBAM tags across your samples! 🧬✨
📧 Email Template: Requesting ONT Methylation Files from Novogene
Here’s a clear, professional email that specifies exactly what you need — in order of preference — with technical justification.
✉️ English Version (Recommended for Novogene International Support)
Subject: Request for Methylation-Ready Files – Project X101SC26036392-Z01
Dear Novogene Technical Support,
Project ID: X101SC26036392-Z01
Samples: An6 (FDSW260173014-2r), BG5 (FDSW260173015-2r)
Service: ONT bacterial genome sequencing with methylation detection
Thank you for delivering the basecalled FASTQ files. For downstream methylation analysis using nf-core/methylong, we require signal-level or modification-aware data.
🔹 Preferred File Formats (in order of priority):
✅ PRIORITY 1 (Best option – ready for analysis):
• modBAM files with MM/ML tags for modifications: 5mC, 6mA, 4mC
• Example filename: An6.mod.bam
• Why: Compact, alignment-ready, directly compatible with methylong
✅ PRIORITY 2 (Alternative – raw signal, newer format):
• POD5 raw signal files
• Why: More storage-efficient than FAST5; can be re-basecalled with custom modification models
✅ PRIORITY 3 (Fallback – raw signal, legacy format):
• FAST5 raw signal files (fast5_pass/)
• Why: Contains full signal data; allows re-basecalling, but large file size
❌ NOT sufficient for methylation analysis:
• FASTQ files only (no kinetic/signal information retained)
🔹 Additional helpful files (optional but appreciated):
• sequencing_summary.txt (to confirm flowcell/kit/model)
• Dorado basecalling version and model name used
• Reference genome or assembly files if already generated
⏰ Time Sensitivity:
We understand from your documentation that ONT signal files (FAST5/POD5) are retained for a maximum of 21 days after data release. If these files were generated for our project, we kindly request they be shared before deletion.
🔹 Delivery Preference:
Please upload the requested files to our existing cloud directory:
oss://CP2024121300060/H101SC26036392/RSMD01814/X101SC26036392-Z01/X101SC26036392-Z01-J002/
Or provide a secure download link if alternative transfer is needed.
Could you please confirm:
1. Which of the above file types are available for our project?
2. If modBAM is not available, can you re-basecall the raw signal with:
--modified-bases 5mC_6mA_4mC --emit-moves
and deliver the resulting modBAM?
Thank you for your urgent assistance. We look forward to your reply.
Best regards,
Jiabin Huang
[Your affiliation/contact info]
🇨🇳 Chinese Version (If contacting Chinese support team)
💡 Pro Tip: If Novogene asks “Which one do you want?”, reply: “Please provide modBAM with MM/ML tags for 5mC_6mA_4mC if available. If not, POD5 or FAST5 raw signal files are also acceptable.”
🚀 After Sending: Follow-Up Checklist
# 1. Save email timestamp + expected response window (e.g., 3 business days)
# 2. If no reply in 48h: send polite follow-up
# 3. Once files received:
# • Verify modBAM tags: samtools view file.bam | grep "MM:Z:"
# • Or test pod5 conversion: dorado basecaller --help
# 4. Update your samplesheet.csv and run methylong
🔁 If Novogene Says “We Only Deliver FASTQ”
Reply with this escalation:
Thank you for the clarification. To confirm our understanding:
1. Was the sequencing performed on Illumina platform (not ONT),
despite the contract stating "ONT platform bacterial complete genome sequencing"?
2. If ONT was used but signal files are no longer available due to the 21-day policy,
could you advise on options for re-sequencing or data recovery?
Our study specifically requires long-read methylation detection (6mA/4mC) for bacterial epigenetics.
If the current FASTQ data cannot support this analysis, we may need to discuss service adjustment or re-sequencing.
We appreciate your guidance on next steps.
Let me know if you’d like me to:
Adjust the tone (more urgent/formal/casual)
Add German phrases (since you prefer German/Chinese communication)
Help draft a follow-up email if they don’t respond
I’m here to help you get the right data! 🧬📬
⚠️ First: Critical Reminder About Your Data
Before discussing model selection, I must re-emphasize:
❌ Your current files (Illumina short-read FASTQ) CANNOT be used with nf-core/methylong
❌ Dorado basecalling parameters only apply to ONT raw signal files (fast5/pod5)
The --dorado_model and --dorado_modification parameters in nf-core/methylong are only relevant if you have ONT raw signal data.
✅ If You Obtain ONT Data: Model Selection Guide for Bacterial Methylation
Assuming you successfully request ONT signal files from Novogene, here’s how to choose the right parameters for your bacterial samples (Acinetobacter harbinensis and Pedobacter cryoconitis):
# Check sequencing_summary.txt (if available from Novogene)
grep -E "model|kit|flow_cell" sequencing_summary.txt
# Common combinations for bacterial ONT sequencing:
# • R10.4.1 flowcell + SQK-RAD114 kit → dna_r10.4.1_e8.2_400bps_sup@v5.0.0
# • R9.4.1 flowcell + SQK-LSK114 kit → dna_r9.4.1_e8_hac@v3.3
📌 If unsure, ask Novogene: “Which Dorado model name corresponds to the chemistry used for project X101SC26036392-Z01?”
🔹 --dorado_modification: Methylation Types to Detect
Bacterial vs. Mammalian Modification Profiles
Organism Type
Dominant Modifications
Recommended --dorado_modification
Bacteria (your case)
6mA, 4mC, some 5mC
5mC_6mA_4mC ✅
Mammals
5mC, 5hmC
5mC_5hmC (default)
Yeast/Fungi
Variable
5mC_6mA or custom
Available Dorado Modification Models (check GitHub for updates)
# List available modification models in installed Dorado:
dorado print-mods
# Common options:
5mC_5hmC # Mammalian default
5mC_6mA_4mC # Bacterial comprehensive ← RECOMMENDED FOR YOU
6mA # Standalone 6mA detection
4mC # Standalone 4mC detection
5mC # Standalone 5mC detection
🔑 Key Insight: Acinetobacter and Pedobacter species commonly use 6mA and 4mC for restriction-modification systems. Using the default 5mC_5hmC would miss these biologically important modifications!
📋 Recommended Parameters for Your Project
# For bacterial genomes (An6, BG5) with ONT data:
--dorado_model: sup # Highest accuracy
--dorado_modification: 5mC_6mA_4mC # Cover bacterial modifications
--dorado_device: cuda:0 # Use GPU (required for reasonable speed)
Example nf-core/methylong Command (once you have modBAM/pod5):
🔍 How to Verify Your Sequencing Chemistry (Ask Novogene)
If you obtain ONT signal files, confirm these details:
Parameter
Why It Matters
Example Value
Flowcell version
Determines basecalling model prefix
R10.4.1, R9.4.1
Kit type
Affects pore speed and model suffix
SQK-RAD114, SQK-LSK114
Basecalling speed
Determines model suffix
400bps, 260bps
Dorado version
Model compatibility
v5.0.0, v4.1.0
📧 Email snippet for Novogene:
Could you please confirm the ONT sequencing chemistry details for project X101SC26036392-Z01:
- Flowcell type (e.g., R10.4.1 or R9.4.1)?
- Library prep kit (e.g., SQK-RAD114)?
- Which Dorado model was used for initial basecalling (if any)?
This information is required to select the correct --dorado_model parameter for methylation analysis.
🚨 If You Cannot Obtain ONT Signal Files: Alternative Paths
Since your current data appears to be Illumina short-read, consider these options:
Option A: Standard Bacterial Analysis (with current FASTQ)
Option B: If Libraries Were Bisulfite-Converted (WGBS)
# Only works if DNA was bisulfite-treated before Illumina sequencing
nextflow run nf-core/methylseq \
-profile docker \
--input samplesheet_wgbs.csv \
--genome GCA_000000000.1 \
--outdir ./results_methylseq \
--skip_bismark false
🔍 Check for bisulfite signatures:
# High C→T conversion in non-CpG context suggests WGBS
zcat clean_data/An6/An6_L1_1.clean.rd.fq.gz | \
awk 'NR%4==2' | head -10000 | \
grep -o '[^C]T[^G]' | wc -l
📊 Quick Decision Tree
graph TD
A[Do you have ONT raw signal files?] -->|No| B[Contact Novogene urgently];
A -->|Yes| C{Confirm chemistry};
C --> D[Select dorado_model: sup/hac/fast];
C --> E[Select dorado_modification: 5mC_6mA_4mC for bacteria];
B --> F{Novogene response};
F -->|ONT data available| C;
F -->|Only Illumina FASTQ| G[Use nf-core/bactopia or methylseq if WGBS];
✅ Summary Checklist
Task
Status
🔴 Confirm whether ONT signal files (fast5/pod5) exist in cloud storage
⏳
🔴 If yes: verify sequencing chemistry with Novogene
⏳
🟡 If ONT confirmed: use --dorado_model sup --dorado_modification 5mC_6mA_4mC
📋
🟢 If only Illumina: proceed with bactopia/methylseq pipelines
🔄
💡 Pro Tip: For bacterial restriction-modification system studies, 6mA and 4mC are often more biologically relevant than 5mC. Always prioritize modification models that include these when analyzing prokaryotes.
Let me know:
Did you find any fast5/, pod5/, or .bam files in the cloud directory?
What does Novogene say about the sequencing platform used?
I’m ready to help you proceed with the correct pipeline once we clarify the data type! 🧬🔍
Here is a comprehensive, publication-ready analysis addressing the requests. It includes the exact gene content, clarification on W3’s mapping, a cautious interpretation of Y3’s p2 loss, and a structured report.
📜 Part 1: Complete Gene Inventory of p1ATCC19606 & p2ATCC19606
Plasmid
Length
Key Genes & Functional Modules
Coordinates (bp)
Biological Role
p1ATCC19606 (CP045108.1)
7,655
iteron region
1–143
Replication origin control
repAci
144–1094
Plasmid replication initiation protein
higA-2 / higB-2
3099–3684 (comp)
Toxin-antitoxin system (post-segregational killing)
Critical Interpretation: W3 does not contain all p2 genes. The aligned p2 segment starts at position 3304, meaning W3 lacks the 5′-end of p2 (1–3303 bp), which includes:
The p2 iteron region (1–185)
repAci9 replication gene (186–1121)
First two pdiff recombination sites
Partial higB2 toxin (starts at 3043)
Biological implication: W3 is a truncated p1+p2 derivative. It retains p1’s repAci for replication and carries a ~5.6 kb p2 cargo segment (containing higA1, sel1, SMI1, osmC, merR, mobA, and recombination sites), but lacks p2’s autonomous replication module. This explains:
Why its length is 16,750 bp (~445 bp shorter than full p1+p2)
Why its Mash distance to p2 is higher (0.0276) than to p1 (0.0156)
Why it’s functionally a p1-replicon with p2 accessory cargo, not a true balanced fusion.
⚠️ Part 3: Y3 Missing p2ATCC19606 – Does It Make Sense?
Yes, it is biologically and technically plausible, but requires careful handling in discussion.
✅ Why It Makes Sense:
Plasmid Instability in A. baumannii: ATCC 19606 is a historical clinical isolate (1948) notorious for spontaneous plasmid curing during subculturing, especially when antibiotic selection is absent.
Accessory Nature: p2 carries stress-response and mobility genes (osmC, SMI1, merR, mobA), not core housekeeping genes. Loss confers no lethal penalty in rich media.
Your Data Is Robust: Empty mash screen, only a 44 bp chromosomal BLAST hit, and clean assembly of p1 at 33× depth all rule out assembly artifact.
🚨 Critical Caution for Your Co-Author:
“While our genomic data strongly indicate complete absence of p2ATCC19606 in Y3, we must avoid overinterpreting phenotypic consequences without wet-lab validation. Plasmid loss during routine passaging is a well-documented confounder in Acinetobacter research. Any claimed stress-sensitivity, cell-wall alteration, or conjugation deficiency in Y3 must be explicitly framed as hypothesized and ideally validated via: (1) targeted PCR for p2-specific markers (mobA, osmC, repAci9), (2) growth assays under osmotic/metal stress, and (3) comparison with a p2-cured derivative of a p2-positive isolate to control for background mutations.”
📊 Part 4: Comprehensive Plasmid Presence/Absence Report
(Ready for manuscript supplement or internal memo)
Conserved Lineages: The ~7.6 kb (p1) and ~9.5 kb (p2) plasmids circulate independently in Y2, Y4, and W4. Mash distances <0.001 confirm clonal identity.
Fusion Events: Y1 and W1 carry identical 17.2 kb plasmids with Fusion Score 1.052, indicating homologous recombination between p1 and p2 backbones. The junction likely occurs near shared pdiff/XerC/D sites.
Cargo Expansion: W2’s 24.8 kb plasmid shares the p1+p2 core but carries an additional ~7.6 kb region. MinHash sketched distance = 0 to Y1/W1 suggests the extra DNA is repetitive or low-complexity (e.g., IS elements, tandem duplications).
Truncated Acquisition: W3 retains full p1 but only a 5.6 kb p2 segment (lacking repAci9 and iterons). It is a p1-replicon driven plasmid with p2-derived stress/mobility cargo.
Complete p2 Loss in Y3: Validated by multi-algorithm screening. Likely reflects spontaneous curing during isolation/passaging. No chromosomal integration detected.
📝 Recommended Phrasing for Manuscript/Discussion
“Comparative plasmid analysis revealed a dynamic accessory genome architecture across eight A. baumannii isolates. Two conserved plasmid lineages homologous to ATCC 19606 p1 (CP045108.1) and p2 (CP045109.1) were identified in Y2, Y4, and W4. Homologous recombination between these lineages generated a 17.2 kb fusion plasmid in Y1 and W1, while W2 harbors an expanded 24.8 kb derivative. W3 carries a truncated p2 segment integrated into a p1 backbone, lacking the p2 replication module. Notably, isolate Y3 completely lacks p2, with no evidence of chromosomal integration or assembly artifact. Given the well-documented instability of plasmids in A. baumannii during subculturing, Y3 likely represents a spontaneously cured derivative. While loss of p2-encoded stress regulators (osmC, SMI1, merR) and conjugation machinery (mobA) may alter phenotypic resilience, such effects require targeted validation to distinguish plasmid-mediated traits from background genomic variation.”
🔬 Next Steps for Validation & Publication
PCR Confirmation: Design primers for p2-specific regions (repAci9, mobA, osmC) to confirm Y3 absence and W3 truncation.
Junction Sequencing: Use long-read polishing or PCR amplicon sequencing to map the exact p1-p2 recombination breakpoint in Y1/W1/W3.
If you need primer sequences for p2-specific markers, a ready-to-run R/Python script for presence/absence heatmaps, or help drafting the methods section for plasmid analysis, just let me know. I can also format this report into a supplementary table for your manuscript. 🧬
The previous reply covered only SNP/InDel results. With long-read sequencing and complete genome assemblies, we can now perform precise structural variant (SV) calling, which resolves the discrepancy you observed.
Key SV findings (Assemblytics, CP059040 reference)
• 4,443-bp deletion (all 8 isolates): Loss of the AdeIJK efflux pump genes
– Location: CP059040:737224–741667 (reverse complement strand ←)
– Impact: Complete loss of adeJ and adeI, and truncation of adeK (AdeIJK multidrug efflux pump)
– More detailed structural differences between the reference genome and the samples are provided in 4443bp_deletion.txt.
• 1,101-bp ISAba11 insertion (Y3, W1, W3 only): Disruption of the galE gene
– Location: CP059040:3853883–3853888 (reverse complement strand ←, within the galE coding sequence, ~96 bp from the 5′ end)
– Technical clarification: The 6-bp coordinate range (3853883–3853888) represents the insertion site interval, not the replaced sequence. Upon transposition, ISAba11 generates a 5-bp target site duplication (TSD); the original 5-bp motif is preserved on the left flank, and an identical copy is created on the right flank during gap repair. Thus, the 1,101-bp element is inserted (not substituted), resulting in a net gain of 1,106 bp (1,101 bp ISAba11 + 5 bp TSD).
– Identity: 100% identical to ISAba11 (https://www.ncbi.nlm.nih.gov/nucleotide/JF309050.1; GenBank: JF309050.1)
– Mechanism: galE encodes UDP-glucose 4-epimerase, which is essential for LPS biosynthesis. ISAba11 insertion disrupts galE → LPS modification/loss → reduced membrane negative charge → colistin resistance.
– The mechanism is described in the manuscript "Insertion sequence ISAba11 is involved in colistin resistance and loss of lipopolysaccharide in Acinetobacter baumannii" (attached in the email).
– More detailed structural differences between the reference genome and the samples are provided in 1101bp_ISAba11_insertion.txt.
• 121-bp tandem contraction (all 8 isolates): Reduction in tRNA-Gln copy number (4→3)
– Location: CP059040:3124916–3125037
– Technical clarification: One complete tRNA-Gln gene (75 bp), along with flanking intergenic spacers, is lost due to copy-number reduction from 4→3 repeats.
– Impact: tRNA-Gln copy number is reduced from 4 to 3; likely neutral, but useful as a stable lineage marker.
– More detailed structural differences between the reference genome and the samples are provided in 121bp_tandem_contraction.txt.
3124916..3124942 [间隔区] 27 bp
3124943..3125017 [tRNA-Gln #3] 75 bp ← 这个基因被"收缩"丢失
3125018..3125037 [间隔区] 20 bp
───────────────────────────────
总跨度 122 bp (坐标范围)
但 Assemblytics 计算的”变异大小”是:
丢失的序列 = 完整的 tRNA-Gln 基因 + 两侧部分间隔区
= 75 bp (tRNA) + ~61 bp (两侧间隔区 + 重复单元边界) + 相邻重复单元的部分序列
≈ 198 bp (总序列差异)
B (Variant) Direct junction after 4,443-bp deletion; truncated adeK fused to PAP2
C (Mechanism) Unequal homologous recombination between microhomologous sequences (5′-GCTTA-3′) flanking the deletion region, excising a circular intermediate
Key annotations: Scale bar (1 kb), gene labels, recombination arrows, “AdeIJK efflux pump” functional annotation Use in manuscript: Results section for conserved SVs; Supplementary Fig. S1 for mechanism details
Figure 2: ISAba11 Transposon Insertion Disrupting galE Conferring Colistin Resistance
Illustrates: Mobile element insertion linking genotype to colistin resistance phenotype Panels:
A (Reference) Intact galE (UDP-glucose 4-epimerase) essential for LPS biosynthesis
B (Variant) galE interrupted by 1,101-bp ISAba11; features shown: inverted repeats (IR-L/IR-R), tnpA transposase, 5-bp target site duplication (TSD: 5′-TTAAA-3′)
C (Mechanism) Stepwise transposition model: TnpA-mediated excision, staggered target cut, insertion with gap repair
Key annotations: Gene coordinates, TSD highlight, colistin molecule (purple), LPS structure simplified Use in manuscript: Central figure for resistance mechanism; ideal for main text Figure 3 or 4
Figure 3: Replication Slippage-Mediated Tandem Contraction in tRNA-Gln Array
Illustrates: Copy-number variation in a non-coding repetitive locus Panels:
Top (Reference) Four tandem tRNA-Gln genes (75 bp each), total span ~438 bp
Middle (Variant) Three copies remaining after 198-bp contraction; one repeat unit lost
Bottom (Mechanism) Replication fork schematic: nascent strand slippage at repeat boundary → misalignment → skipping of one repeat unit during synthesis
Key annotations: “Microhomology-mediated slippage” callout, scale bar (100 bp), neutral evolution note Use in manuscript: Supplementary figure for lineage markers; Methods section for SV calling validation
Export-ready for PDF/EPS; legible at single-column (8.5 cm) or double-column (17 cm) width
Compliance
No isolate names, no proprietary data; generic “Reference” vs “Variant” labeling
📋 Suggested Figure Legends (Copy-Paste Ready)
Figure 1. Homologous recombination mediates a conserved 4.4-kb deletion disrupting the AdeIJK multidrug efflux system. (A) Genomic context in reference strain (CP059040). (B) Variant structure after deletion, showing fusion of truncated adeK to downstream PAP2. (C) Proposed mechanism: unequal crossover between microhomologous 5-bp sequences (GCTTA) excises the intervening 4,443-bp fragment as a circular intermediate. Gene arrows indicate transcriptional orientation; scale bar, 1 kb.
Figure 2. ISAba11 insertion into galE provides a molecular basis for colistin resistance. (A) Intact galE encodes UDP-glucose 4-epimerase, required for lipopolysaccharide (LPS) core biosynthesis. (B) In resistant isolates, a 1,101-bp ISAba11 element inserts 96 bp downstream of the galE start codon, disrupting the open reading frame. ISAba11 features: inverted repeats (IRs), transposase gene (tnpA), and 5-bp target site duplication (TSD). (C) Stepwise transposition model. (D) Phenotypic consequence: truncated LPS reduces membrane negative charge, decreasing binding of cationic colistin. Scale bar, 200 bp.
Figure 3. Replication slippage drives tandem contraction in a tRNA-Gln gene array. (Top) Reference configuration: four identical tRNA-Gln copies in head-to-tail orientation. (Middle) Variant configuration after 198-bp contraction, reducing copy number to three. (Bottom) Molecular mechanism: DNA polymerase slippage at repeat boundaries causes misalignment and skipping of one repeat unit during synthesis. This neutral variant serves as a stable lineage marker. Scale bar, 100 bp.
ln -s ~/Tools/bacto/db/ .;
ln -s ~/Tools/bacto/envs/ .;
ln -s ~/Tools/bacto/local/ .;
cp ~/Tools/bacto/Snakefile .;
cp ~/Tools/bacto/bacto-0.1.json .;
cp ~/Tools/bacto/cluster.json .;
#download CP059040.gb from GenBank
#mv ~/Downloads/sequence\(2\).gb db/CP059040.gb
mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
(bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ which snakemake
/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snakemake
(bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ snakemake -v
4.0.0 --> CORRECT!
#NOTE_1: modify bacto-0.1.json keeping only steps assembly, typing_mlst, possibly pangenome and variants_calling true; setting cpu=20 in all used steps.
#setting the following in bacto-0.1.json
"fastqc": false,
"taxonomic_classifier": false,
"assembly": true,
"typing_ariba": false,
"typing_mlst": true,
"pangenome": true,
"variants_calling": true,
"phylogeny_fasttree": false,
"phylogeny_raxml": false,
"recombination": false, (due to gubbins-error set false)
"prokka": {
"genus": "Acinetobacter",
"kingdom": "Bacteria",
"species": "baumannii",
"cpu": 10,
"evalue": "1e-06",
"other": ""
},
"mykrobe": {
"species": "abaum"
},
"reference": "db/CP059040.gb"
#NOTE_2: needs disk Titisee since the pipeline needs /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
snakemake --printshellcmds
Summarize all SNPs and Indels from the snippy result directory.
cp ~/Scripts/summarize_snippy_res_ordered.py .
# IMPORTANT_ADAPT the array in script should be adapted
isolates = ["W1", "W2", "W3", "W4", "Y1", "Y2", "Y3", "Y4"]
mamba activate plot-numpy1
python3 ./summarize_snippy_res_ordered.py snippy
#--> Summary CSV file created successfully at: snippy/summary_snps_indels.csv
cd snippy
#REMOVE_the_line? I don't find the sence of the line: grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
Using spandx calling variants (almost the same results to the one from viral-ngs!)
mamba deactivate
mamba activate /home/jhuang/miniconda3/envs/spandx
# PREPARE the inputs for the options ref and database
mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
cp PP810610.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
/home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610 #-d
~/Scripts/genbank2fasta.py PP810610.gb
mv PP810610.gb_converted.fna PP810610.fasta #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file
ln -s /home/jhuang/Tools/spandx/ spandx
(spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CP059040.fasta --annotation --database CP059040 -resume
# RERUN SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation, resulting in all impacts are MODIFIER --> IT WORKS!
cd Outputs/Master_vcf
conda activate /home/jhuang/miniconda3/envs/spandx
(spandx) cp -r ../../snippy/Y1/reference . # Eigentlich irgendein directory, all directories contains the same reference.
(spandx) cp ../../spandx/bin/SNP_matrix.sh ./
#Note that ${variant_genome_path}=CP059040 in the following command, but it was not used after the following command modification.
#Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
"/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
(spandx) bash SNP_matrix.sh CP059040 .
Calling inter-host variants by merging the results from snippy+spandx
Manully checking each of the 6 records by comparing them to the results from SPANDx; three are confirmed!
#CHROM,POS,REF,ALT,TYPE,Y1,Y2,Y3,Y4,W1,W2,W3,W4,Effect,Impact,Functional_Class,Codon_change,Protein_and_nucleotide_change,Amino_Acid_Length,Gene_name,Biotype
# -- Results from snippy --
#move: CP059040,1527276,TTGAACC,T,del,TTGAACC,TTGAACC,TTGAACC,T,TTGAACC,TTGAACC,T,T,conservative_inframe_deletion,MODERATE,,gaacct/,p.Glu443_Pro444del/c.1327_1332delGAACCT,704,H0N29_07175,protein_coding
#confirmed: CP059040,1843289,G,T,snp,G,T,G,G,G,G,G,G,missense_variant,MODERATE,MISSENSE,gCg/gAg,p.Ala37Glu/c.110C>A,357,H0N29_08665,protein_coding
#confirmed: CP059040,2019186,G,A,snp,A,G,G,G,G,G,G,G,upstream_gene_variant,MODIFIER,,59,c.-59C>T,144,H0N29_09480,protein_coding
#delete_this? CP059040,3124917,T,"TAC,TACTTCATTACATACCAACCGCCAAGGGTGC",snp,C,T,C,C,T,T,T,C,upstream_gene_variant,MODIFIER,,25,c.-25_-24insAC,0,H0N29_00075,protein_coding
#move: CP059040,3310021,C,CT,ins,CT,CT,CT,CT,C,CT,CT,CT,intragenic_variant,MODIFIER,,,n.3310021_3310022insT,,H0N29_00075,
#confirmed: CP059040,3853714,G,A,snp,G,G,G,G,G,A,G,A,stop_gained,HIGH,NONSENSE,Cag/Tag,p.Gln91*/c.271C>T,338,H0N29_18290,protein_coding
#--> Only three SNPs are confirmed --> means there is almost no variation in the genomic level!
# -- Results from the SPANDx --
#CP059040 1527276 TTGAACC T INDEL TTGAACC/T T T T T T T T conservative_inframe_deletion MODERATE gaacct/ p.Glu443_Pro444del/c.1327_1332delGAACCT 704 H0N29_07175 protein_coding
#CP059040 1843289 G T SNP G T G G G G G G missense_variant MODERATE MISSENSE gCg/gAg p.Ala37Glu/c.110C>A 357 H0N29_08665 protein_coding
#CP059040 2019186 G A SNP A G G G G G G G upstream_gene_variant MODIFIER 59 c.-59C>T 144 H0N29_09480 protein_coding
#Cmp to CP059040 3124917 T TAC,TACTTCATTACATACCAACCGCCAAGGGTGC INDEL . TACTTCATTACATACCAACCGCCAAGGGTGC TACTTCATTACATACCAACCGCCAAGGGTGC TAC . . . . upstream_gene_variant MODIFIER 25 c.-25_-24insAC 0 H0N29_00075 protein_coding
#Cmp to CP059040 3124920 C CATTACATACCAACCGCCAAGGGTGCTTCATG INDEL . . . CATTACATACCAACCGCCAAGGGTGCTTCATG . . C . upstream_gene_variant MODIFIER 22 c.-22_-21insATTACATACCAACCGCCAAGGGTGCTTCATG 0 H0N29_00075 protein_coding
#TODO: Move to invariant-file: CP059040 3310021 C CT INDEL CT CT CT CT CT CT CT CT intragenic_variant MODIFIER n.3310021_3310022insT H0N29_00075
#CP059040 3853714 G A SNP G G G G G A G A stop_gained HIGH NONSENSE Cag/Tag p.Gln91*/c.271C>T 338 H0N29_18290 protein_coding
#-->For the Indel-report, more complicated, needs the following command to find the initial change and related codon-change.
## Check gene strand in your GFF/GenBank
#grep "H0N29_07175" reference.gff
# Extract 20 bp around the variant from reference
samtools faidx CP059040.fasta CP059040:1527260-1527290
Annotation of the three confirmed SNPs
gene complement(3852968..3853984)
/gene="galE"
/locus_tag="H0N29_18290"
CDS complement(3852968..3853984)
/gene="galE"
/locus_tag="H0N29_18290"
/EC_number="5.1.3.2"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_017725586.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="UDP-glucose 4-epimerase GalE"
/protein_id="QNT84923.1"
/translation="MAKILVTGGAGYIGSHTCVELLEAGHEVIVFDNLSNSSKESLNR
VQEITQKGLTFVEGDIRNSGELDRVFQEHAIDAVIHFAGLKAVGESQEKPLIYFDNNI
AGSIQLVKSMEKAGVYTLVFSSSATVYDEANTSPLNEEMPTGMPSNNYGYTKLIVEQL
LQKLSVADSKWSIALLRYFNPVGAHKSGRIGEDPQGIPNNLMPYVTQVAVGRREKLSI
YGNDYDTIDGTGVRDYIHVVDLANAHLCALNNRLQAQGCRAWNIGTGNGSSVLQVKNT
FEQVNGVPVAFEFAPRRAGDVATSFADNARAVAELGWKPQYGLEDMLKDSWNWQKQNP
NGYN"
gene complement(1842325..1843398)
/gene="adeS"
/locus_tag="H0N29_08665"
CDS complement(1842325..1843398)
/gene="adeS"
/locus_tag="H0N29_08665"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_000837467.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="two-component sensor histidine kinase AdeS"
/protein_id="QNT86623.1"
/translation="MKSKLGISKQLFIALTIVNLSVTLFSVVLGYVIYNYAIEKGWIS
LSSFQQEDWTSFHFVDWIWLATVIFCGCIISLVIGMRLAKRFIVPINFLAEAAKKISH
GDLSARAYDNRIHSAEMSELLYNFNDMAQKLEVSVKNAQVWNAAIAHELRAPITILQG
RLQGIIDGVFKPDEVLFKSLLNQVEGLSHLVEDLRTLSLVENQQLRLNYELFDLKAVV
EKVLKAFEDRLDQAKLVPELDLTSTPVYCDRRRIEQVLIALIDNSIRYSNAGKLKISS
EVVADNWILKIEDEGPGIATEFQDDLFKPFFRLEESRNKEFGGTGLGLAVVHAIVVAL
KGTIQYSNQGSKSVFTIKISMNN"
gene complement(2018693..2019127)
/locus_tag="H0N29_09480"
CDS complement(2018693..2019127)
/locus_tag="H0N29_09480"
/inference="COORDINATES: similar to AA
sequence:RefSeq:YP_004995263.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="HIT domain-containing protein"
/protein_id="QNT83319.1"
/translation="MFSLHPQLAQDTFFVGDFPLSTCRLMNDMQFPWLILVPRVPGIT
ELYELSQADQEQFLRESSWLSSQLSRVFRADKMNVAALGNMVPQLHFHHVVRYQNDVA
WPKPVWGTPAVPYTSDVLAHMRQTLMLALRGQGDMPFDWRMD"
ln -s ~/Tools/bacto/db/ .;
ln -s ~/Tools/bacto/envs/ .;
ln -s ~/Tools/bacto/local/ .;
cp ~/Tools/bacto/Snakefile .;
cp ~/Tools/bacto/bacto-0.1.json .;
cp ~/Tools/bacto/cluster.json .;
#download CP059040.gb from GenBank
#mv ~/Downloads/sequence\(2\).gb db/CP059040.gb
mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
(bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ which snakemake
/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snakemake
(bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ snakemake -v
4.0.0 --> CORRECT!
#NOTE_1: modify bacto-0.1.json keeping only steps assembly, typing_mlst, possibly pangenome and variants_calling true; setting cpu=20 in all used steps.
#setting the following in bacto-0.1.json
"fastqc": false,
"taxonomic_classifier": false,
"assembly": true,
"typing_ariba": false,
"typing_mlst": true,
"pangenome": true,
"variants_calling": true,
"phylogeny_fasttree": false,
"phylogeny_raxml": false,
"recombination": false, (due to gubbins-error set false)
"prokka": {
"genus": "Acinetobacter",
"kingdom": "Bacteria",
"species": "baumannii",
"cpu": 10,
"evalue": "1e-06",
"other": ""
},
"mykrobe": {
"species": "abaum"
},
"reference": "db/CP059040.gb"
#NOTE_2: needs disk Titisee since the pipeline needs /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
snakemake --printshellcmds
Summarize all SNPs and Indels from the snippy result directory.
cp ~/Scripts/summarize_snippy_res_ordered.py .
# IMPORTANT_ADAPT the array in script should be adapted; deleting the isolates "flu_wt_cipro" and "flu_dAB_cipro"
isolates = ["flu_wt_cef", "flu_wt_dori", "flu_wt_nitro", "flu_wt_pip", "flu_wt_polyB", "flu_wt_tet", "flu_dAB_cef", "flu_dAB_dori", "flu_dAB_nitro", "flu_dAB_pip", "flu_dAB_polyB", "flu_dAB_tet", "flu_dIJ_cef", "flu_dIJ_cipro", "flu_dIJ_dori", "flu_dIJ_nitro", "flu_dIJ_pip", "flu_dIJ_polyB", "mito_dIJ_trime", "mito_wt_cipro", "mito_wt_nitro", "mito_wt_polyB", "mito_wt_trime", "mito_dAB_cipro", "mito_dAB_dori", "mito_dAB_nitro", "mito_dAB_tet", "mito_dAB_trime", "mito_dIJ_cipro", "mito_dIJ_dori", "mito_dIJ_nitro", "mito_dIJ_polyB", "mito_dIJ_tet"]
mamba activate plot-numpy1
python3 ./summarize_snippy_res_ordered.py snippy
#--> Summary CSV file created successfully at: snippy/summary_snps_indels.csv
cd snippy
#REMOVE_the_line? I don't find the sence of the line: grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
Using spandx calling variants (almost the same results to the one from viral-ngs!)
mamba deactivate
mamba activate /home/jhuang/miniconda3/envs/spandx
# PREPARE the inputs for the options ref and database
mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
cp PP810610.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
/home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610 #-d
~/Scripts/genbank2fasta.py PP810610.gb
mv PP810610.gb_converted.fna PP810610.fasta #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file
ln -s /home/jhuang/Tools/spandx/ spandx
# Deleting the contaminated samples flu_wt_cipro*.fastq and flu_dAB_cipro*.fastq
(spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CP059040.fasta --annotation --database CP059040 -resume
# RERUN SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation, resulting in all impacts are MODIFIER --> IT WORKS!
cd Outputs/Master_vcf
conda activate /home/jhuang/miniconda3/envs/spandx
(spandx) cp -r ../../snippy/Y1/reference . # Eigentlich irgendein directory, all directories contains the same reference.
(spandx) cp ../../spandx/bin/SNP_matrix.sh ./
#Note that ${variant_genome_path}=CP059040 in the following command, but it was not used after the following command modification.
#Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
"/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
(spandx) bash SNP_matrix.sh CP059040 .
(TODO_TOMORROW) Manully checking each of the 6 records by comparing them to the results from SPANDx; three are confirmed!
#CHROM,POS,REF,ALT,TYPE,Y1,Y2,Y3,Y4,W1,W2,W3,W4,Effect,Impact,Functional_Class,Codon_change,Protein_and_nucleotide_change,Amino_Acid_Length,Gene_name,Biotype
# -- Results from snippy --
#move: CP059040,1527276,TTGAACC,T,del,TTGAACC,TTGAACC,TTGAACC,T,TTGAACC,TTGAACC,T,T,conservative_inframe_deletion,MODERATE,,gaacct/,p.Glu443_Pro444del/c.1327_1332delGAACCT,704,H0N29_07175,protein_coding
#confirmed: CP059040,1843289,G,T,snp,G,T,G,G,G,G,G,G,missense_variant,MODERATE,MISSENSE,gCg/gAg,p.Ala37Glu/c.110C>A,357,H0N29_08665,protein_coding
#confirmed: CP059040,2019186,G,A,snp,A,G,G,G,G,G,G,G,upstream_gene_variant,MODIFIER,,59,c.-59C>T,144,H0N29_09480,protein_coding
#delete_this? CP059040,3124917,T,"TAC,TACTTCATTACATACCAACCGCCAAGGGTGC",snp,C,T,C,C,T,T,T,C,upstream_gene_variant,MODIFIER,,25,c.-25_-24insAC,0,H0N29_00075,protein_coding
#move: CP059040,3310021,C,CT,ins,CT,CT,CT,CT,C,CT,CT,CT,intragenic_variant,MODIFIER,,,n.3310021_3310022insT,,H0N29_00075,
#confirmed: CP059040,3853714,G,A,snp,G,G,G,G,G,A,G,A,stop_gained,HIGH,NONSENSE,Cag/Tag,p.Gln91*/c.271C>T,338,H0N29_18290,protein_coding
#--> Only three SNPs are confirmed --> means there is almost no variation in the genomic level!
# -- Results from the SPANDx --
#CP059040 1527276 TTGAACC T INDEL TTGAACC/T T T T T T T T conservative_inframe_deletion MODERATE gaacct/ p.Glu443_Pro444del/c.1327_1332delGAACCT 704 H0N29_07175 protein_coding
#CP059040 1843289 G T SNP G T G G G G G G missense_variant MODERATE MISSENSE gCg/gAg p.Ala37Glu/c.110C>A 357 H0N29_08665 protein_coding
#CP059040 2019186 G A SNP A G G G G G G G upstream_gene_variant MODIFIER 59 c.-59C>T 144 H0N29_09480 protein_coding
#Cmp to CP059040 3124917 T TAC,TACTTCATTACATACCAACCGCCAAGGGTGC INDEL . TACTTCATTACATACCAACCGCCAAGGGTGC TACTTCATTACATACCAACCGCCAAGGGTGC TAC . . . . upstream_gene_variant MODIFIER 25 c.-25_-24insAC 0 H0N29_00075 protein_coding
#Cmp to CP059040 3124920 C CATTACATACCAACCGCCAAGGGTGCTTCATG INDEL . . . CATTACATACCAACCGCCAAGGGTGCTTCATG . . C . upstream_gene_variant MODIFIER 22 c.-22_-21insATTACATACCAACCGCCAAGGGTGCTTCATG 0 H0N29_00075 protein_coding
#TODO: Move to invariant-file: CP059040 3310021 C CT INDEL CT CT CT CT CT CT CT CT intragenic_variant MODIFIER n.3310021_3310022insT H0N29_00075
#CP059040 3853714 G A SNP G G G G G A G A stop_gained HIGH NONSENSE Cag/Tag p.Gln91*/c.271C>T 338 H0N29_18290 protein_coding
#-->For the Indel-report, more complicated, needs the following command to find the initial change and related codon-change.
## Check gene strand in your GFF/GenBank
#grep "H0N29_07175" reference.gff
# Extract 20 bp around the variant from reference
samtools faidx CP059040.fasta CP059040:1527260-1527290
Run nextflow bacass
# Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz
# Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056
#--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz
#--kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
--input samplesheet.tsv \
--outdir bacass_out \
--assembly_type short \
--kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
-resume
#Possibly the chraracter '△' is a problem.
#Solution: 19606△ABfluEcef-1 → 19606delABfluEcef-1
#SAVE bacass_out/Kmerfinder/kmerfinder_summary.csv to bacass_out/Kmerfinder/An6?/An6?_kmerfinder_results.xlsx
samplesheet.tsv
sample,fastq_1,fastq_2
flu_dAB_cef,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz
flu_dAB_cipro,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz
#busco example results:
Input_file Dataset Complete Single Duplicated Fragmented Missing n_markers Scaffold N50 Contigs N50 Percent gaps Number of scaffolds
wt_cef.scaffolds.fa bacteria_odb10 98.4 98.4 0.0 1.6 0.0 124 285852 285852 0.000% 45
wt_cipro.scaffolds.fa bacteria_odb10 90.3 89.5 0.8 8.1 1.6 124 7434 7434 0.000% 1699