Daily Archives: 2026年4月29日
Interhost variant calling (Data_Foong_DNAseq_2025_AYE_Dark_vs_Light)
-
Targets
Attached are the DNA sequencing data for your review. Could you please compare these sequences with the A. baumannii AYE reference (accession CU459141) and let us know whether any genomic mutations are present? Sorry, I realize I made a mistake. Could you confirm whether variant calling has been performed for these strains (X101SC25116512-Z01-J001, samples labed as Dark or light) to determine if they contain SNPs or InDels compared to CU459141? -
mkdir raw_data; cd raw_data; # Note that the names must be ending with fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Light/Light_1.fq.gz Light_R1.fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Light/Light_2.fq.gz Light_R2.fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Dark/Dark_1.fq.gz Dark_R1.fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Dark/Dark_2.fq.gz Dark_R2.fastq.gz -
Call variant calling using snippy
ln -s ~/Tools/bacto/db/ .; ln -s ~/Tools/bacto/envs/ .; ln -s ~/Tools/bacto/local/ .; cp ~/Tools/bacto/Snakefile .; cp ~/Tools/bacto/bacto-0.1.json .; cp ~/Tools/bacto/cluster.json .; #download CU459141.gb from GenBank mv ~/Downloads/sequence\(1\).gb db/CU459141.gb #setting the following in bacto-0.1.json "fastqc": false, "taxonomic_classifier": false, "assembly": true, "typing_ariba": false, "typing_mlst": true, "pangenome": true, "variants_calling": true, "phylogeny_fasttree": true, "phylogeny_raxml": true, "recombination": false, (due to gubbins-error set false) "genus": "Acinetobacter", "kingdom": "Bacteria", "species": "baumannii", (in both prokka and mykrobe) "reference": "db/CU459141.gb" mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3 (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds -
Checking the contaminated samples based on the size of genome
cd shovill find . -maxdepth 2 -name "contigs.fa" -exec du -h {} \; find . -maxdepth 2 -name "contigs.fa" -printf "%s\t%p\n" | sort -nr | \ while IFS=$'\t' read -r bytes path; do sample=$(basename "$(dirname "$path")") printf "%-20s %8s\t%s\n" "$sample" "$(numfmt --to=iec-i --suffix=B "$bytes")" "$path" done #Dark 3,9MiB ./Dark/contigs.fa #Light 3,9MiB ./Light/contigs.fa -
Summarize all SNPs and Indels from the snippy result directory.
cp ~/Scripts/summarize_snippy_res_ordered.py . # IMPORTANT_ADAPT the array in script should be adapted; deleting the isolates ["Dark", "Light"] mamba activate plot-numpy1 python3 ./summarize_snippy_res_ordered.py snippy #--> Summary CSV file created successfully at: snippy/summary_snps_indels.csv cd snippy #REMOVE_the_line? I don't find the sence of the line: grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv -
Using spandx calling variants (almost the same results to the one from viral-ngs!)
mamba deactivate mamba activate /home/jhuang/miniconda3/envs/spandx # PREPARE the inputs for the options ref and database (NOT ALWAYS NEED, PREPARE ONLY ONCE) mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610 cp PP810610.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config /home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610 #-d ~/Scripts/genbank2fasta.py PP810610.gb mv PP810610.gb_converted.fna PP810610.fasta #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file ln -s /home/jhuang/Tools/spandx/ spandx # Deleting the contaminated samples flu_wt_cipro*.fastq and flu_dAB_cipro*.fastq if contamination exists! (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CU459141.fasta --annotation --database CU459141 -resume # RERUN SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation, resulting in all impacts are MODIFIER --> IT WORKS! cd Outputs/Master_vcf conda activate /home/jhuang/miniconda3/envs/spandx (spandx) cp -r ../../snippy/Dark/reference . # Eigentlich irgendein directory, all directories contains the same reference. (spandx) cp ../../spandx/bin/SNP_matrix.sh ./ #Note that ${variant_genome_path}=CP059040 in the following command, but it was not used after the following command modification. #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh (spandx) bash SNP_matrix.sh CU459141 .
Cross-Caller SNP/Indel Concordance & Invariant Variant Analyzer; Multi-Isolate Variant Intersection, Annotation Harmonization & Caller Discrepancy Report; Comparative Genomic Variant Profiling: Concordance, Invariance & Allele Mismatch Analysis; VarMatch: Cross-Tool Variant Concordance Pipeline
-
Calling inter-host variants by merging the results from snippy+spandx
mamba activate plot-numpy1 cp Outputs/Master_vcf/All_SNPs_indels_annotated.txt . cp snippy/summary_snps_indels.csv . cp ~/Scripts/process_variants_snippy_alleles_spandx_annotations.py . #Configuring #Delete "mito_wt_polyB", "mito_wt_trime", "mito_dAB_trime", "flu_wt_cipro", "mito_dIJ_nitro", and "flu_dAB_cipro" ISOLATES = ["Dark", "Light"] (plot-numpy1) python process_variants_snippy_alleles_spandx_annotations.py # 5 invariant and 0 variant # -- Indeed, the generated files are the common SNPs generated by snippy+spandx, therefore, we need to modify the filenames. -- mv common_variants_all_snippy_annotated.csv common_variants_snippy+spandx_annotated.csv mv common_variants_invariant_snippy_annotated.csv common_invariants_snippy+spandx_annotated.csv mv common_variants_all_snippy_annotated.xlsx common_variants_snippy+spandx_annotated.xlsx mv common_variants_invariant_snippy_annotated.xlsx common_invariants_snippy+spandx_annotated.xlsx #CU459141 1900834 T C SNP C C synonymous_variant LOW SILENT ttA/ttG p.Leu46Leu/c.138A>G 73 ABAYE1833 protein_coding #CU459141 1900838 T A SNP A A missense_variant MODERATE MISSENSE tAc/tTc p.Tyr45Phe/c.134A>T 73 ABAYE1833 protein_coding -
Manully checking each of the 6 records by comparing them to the results from SPANDx; three are confirmed!
#The file will only contain rows where at least one of the 29 samples had a different allele between the two files. You can open allele_differences.xlsx side-by-side with your original common_variants_all_snippy_annotated.xlsx for fast, column-aligned manual verification. mamba activate plot-numpy1 (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_variants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt #❌ Found 13 mismatched positions. Extracting & formatting... # The export_mismatch_alleles.py always generated the same output-file, avoid to recoved by the following commands, modify the generated filename. (plot-numpy1) mv allele_differences.xlsx allele_differences_for_cmp.xlsx (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_invariants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt #✅ All alleles match perfectly across all common positions and samples! cp common_variants_all_snippy_annotated.xlsx common_variants_all_snippy_annotated_for_cmp.xlsx #Then marked the corresponding rows in yellow, then compare it to allele_differences.xlsx. # IMPORTANT_MANUAL_CHECK: checking all variants of common_variants_snippy+spandx_annotated.xlsx and common_invariants_snippy+spandx_annotated.xlsx by comparing allele_differences_for_cmp.xlsx. -
Structural variant calling
conda activate sv_assembly for sample in Dark Light; do nucmer --maxmatch -l 100 -c 500 CU459141.fasta shovill/${sample}/contigs.fa -p ${sample}; delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta; Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 1000 100 50000; done #< CU459141 3244053 3244284 Assemblytics_b_1 120 + Tandem_contraction -231 -351 contig00003:66319-66670:- between_alignments #--- #> CU459141 873756 874618 Assemblytics_b_1 258 + Tandem_contraction -862 -1120 contig00004:40161-41281:- between_alignments # ✅ If empty, relax Assemblytics parameters for sample in Dark Light; do nucmer --maxmatch -l 100 -c 500 CU459141.fasta shovill/${sample}/contigs.fa -p ${sample}; delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta; Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 500 50 100000; done #500 → Minimum unique flanking/anchor length (bp) #The alignment must contain at least 500 bp of unique sequence on both sides of a putative gap/event. This prevents false positives in repetitive, low-complexity, or highly similar genomic regions by ensuring the variant is anchored by uniquely mappable sequence. #50 → Minimum event/variant size (bp) #Assemblytics will only report structural variants that are ≥50 bp in length. Smaller differences (like short indels or sequencing noise) are filtered out. #100000 → Maximum event/variant size (bp) #Variants >100,000 bp (100 kb) will be excluded. This focuses the output on typical SVs and avoids reporting extremely large alignment artifacts, assembly breaks, or whole-chromosome differences. #< CU459141 3244053 3244284 Assemblytics_b_1 120 + Tandem_contraction -231 -351 contig00003:66319-66670:- between_alignments #< CU459141 3617424 3680584 Assemblytics_b_2 66532 + Tandem_contraction 63160 -3372 contig00009:61062-64434:- between_alignments #--- #> CU459141 873756 874618 Assemblytics_b_1 258 + Tandem_contraction -862 -1120 contig00004:40161-41281:- between_alignments #> CU459141 3617424 3680584 Assemblytics_b_3 66532 + Tandem_contraction 63160 -3372 contig00013:44152-47524:- between_alignments cp ~/Scripts/merge_sv_filtered.sh . # ADAPT the EXCLUDE_SAMPLES names in the following script when samples has abnormal assembly sizes. bash merge_sv_filtered.sh # generate merged_sv_results.txt -
(Optional) Run nextflow bacass
# Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056 #--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz #--kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz nextflow run nf-core/bacass -r 2.5.0 -profile docker \ --input samplesheet.tsv \ --outdir bacass_out \ --assembly_type short \ --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \ --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \ -resume #Possibly the chraracter '△' is a problem. #Solution: 19606△ABfluEcef-1 → 19606delABfluEcef-1 #SAVE bacass_out/Kmerfinder/kmerfinder_summary.csv to bacass_out/Kmerfinder/An6?/An6?_kmerfinder_results.xlsx samplesheet.tsv sample,fastq_1,fastq_2 flu_dAB_cef,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz flu_dAB_cipro,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz #busco example results: Input_file Dataset Complete Single Duplicated Fragmented Missing n_markers Scaffold N50 Contigs N50 Percent gaps Number of scaffolds wt_cef.scaffolds.fa bacteria_odb10 98.4 98.4 0.0 1.6 0.0 124 285852 285852 0.000% 45 wt_cipro.scaffolds.fa bacteria_odb10 90.3 89.5 0.8 8.1 1.6 124 7434 7434 0.000% 1699 -
(Optional) Run bactmap
nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CP059040.fasta \ --outdir bactmap_out \ -resume nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CU459141.fasta \ --outdir bactmap_out \ -resume sample,fastq_1,fastq_2 G18582004,fastqs/G18582004_1.fastq.gz,fastqs/G18582004_2.fastq.gz G18756254,fastqs/G18756254_1.fastq.gz,fastqs/G18756254_2.fastq.gz G18582006,fastqs/G18582006_1.fastq.gz,fastqs/G18582006_2.fastq.gz mkdir bactmap_workspace #Prepare reference.fasta (example for CP059040.fasta) in bactmap_workspace and samplesheet.csv in bactmap_workspace nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CP059040.fasta \ --outdir bactmap_out \ -resume
Comprehensive Gene-Level Annotation Table for All 7 Structural Variants (Data_Tam_DNAseq_2026_19606wt_dAB_dIJ_mito_flu_on_ATCC19606)
Based on CP059040.gff3 Reference Annotation (Full GFF3 Intersection)
Below is the most detailed and comprehensive list of all affected genes/features for each structural variant, derived from precise coordinate intersection with the CP059040.gff3 file.
🔑 Master Table: All 7 SVs with Complete Affected Gene Lists
| Original SV ID | Coordinates (CP059040) | Type | Size (bp) | All Affected Genes/Features (Locus Tags + Products) | Overlap Type per Gene | Functional Impact Summary | Sample Pattern |
|---|---|---|---|---|---|---|---|
| SV_adeIJ_del | 737224–741667 | Deletion | 4,436 | • adeK (H0N29_03540): multidrug efflux RND transporter outer membrane channel subunit AdeK• adeJ (H0N29_03545): multidrug efflux RND transporter permease subunit AdeJ• adeI (H0N29_03550): multidrug efflux RND transporter periplasmic adaptor subunit AdeI |
• adeK: 3′ truncation (~10 bp lost) • adeJ: fully deleted • adeI: fully deleted |
🔴 HIGH: Complete loss of AdeJ+AdeI → tripartite RND pump cannot assemble; adeK truncation likely destabilizes protein; potential increased susceptibility to AdeIJK substrate antibiotics | ✅ All *_dIJ_* |
| SV_adeAB_del | 1844323–1848605 | Deletion | 4,282 | • adeA (H0N29_08675): multidrug efflux RND transporter periplasmic adaptor subunit AdeA• adeB (H0N29_08680): multidrug efflux RND transporter permease subunit AdeB |
• adeA: fully deleted (4 bp 5′ end preserved) • adeB: fully deleted (~11 bp 3′ end preserved) |
🔴 HIGH: Complete loss of AdeA+AdeB → tripartite RND pump cannot assemble; potential increased susceptibility to AdeAB substrate antibiotics | ✅ All *_dAB_* |
| SV_tRNA_contract | 3124916–3125037 | Tandem_contraction | 198 | • tRNA-Gln (H0N29_14860): tRNA-Gln (anticodon: ttg; product: glutamine codon translation) |
• H0N29_14860: fully lost (75 bp coding sequence) | 🟢 LOW: tRNA gene copy number reduction 4→3; likely neutral due to tRNA redundancy; stable lineage marker for clonal tracking | ✅ All 29 filtered samples |
| SV_flu_tandem | 2259736–2260384 | Tandem_contraction | ~135 | • No protein-coding genes fully overlapped • Nearest gene: cydB (H0N29_10425): cytochrome bd oxidase subunit II (ends at 2217351, ~42 kb upstream) |
• Intergenic/repetitive region; no CDS disruption | 🟢 LOW: Likely neutral repetitive element contraction; possible regulatory impact on nearby genes; fluoroquinolone-selection marker | ⚠️ Subset of flu_* (nitro/polyB/tet) |
| SV_mito_tandem | 2494563–2536071 | Tandem_contraction | 41,564 | • H0N29_11610: hypothetical protein (upstream boundary, partial overlap)• H0N29_11615: pseudo CDS, hypothetical protein (partial overlap)• Multiple unannotated hypothetical proteins in H0N29_11xxx series within region • Repeat-rich region with transposase/integrase remnants |
• Multiple hypothetical proteins: partial or full deletion • Repeat arrays: contraction of tandem elements |
🟡 MEDIUM: Large repeat array contraction; likely non-essential hypothetical proteins affected; possible mobile element-associated genome plasticity under mitomycin selection | ✅ Only mito_dAB_* |
| SV_mito_large_del2 | 2621714–2664046 | Deletion | 42,352 | • H0N29_12335 (2626228..2626773): hypothetical protein• H0N29_12525 (2655635..2656549): DNA cytosine methyltransferase (EC: 2.1.21.-) |
• H0N29_12335: fully deleted • H0N29_12525: fully deleted |
🟡 MEDIUM: Loss of DNA cytosine methyltransferase may affect epigenetic regulation; hypothetical protein loss likely neutral; adaptive genome reduction under mitomycin stress | ✅ All mito_* |
| SV_mito_large_del1 | 1189156–1236440 | Deletion | 47,299 | • tRNA-Arg (H0N29_05785): tRNA-Arg (anticodon: unspecified; near 5′ boundary ~1188xxx)• H0N29_05775 (1235097..1235492): hypothetical protein (partial overlap at 3′ boundary)• Multiple unannotated hypothetical proteins in H0N29_05xxx series within region • Gene-sparse, low-complexity intergenic region |
• tRNA-Arg: boundary proximity; possible regulatory element loss • H0N29_05775: partial 5′ deletion • Other hypotheticals: full or partial deletion |
🟡 MEDIUM: Possible loss of non-essential functions; tRNA boundary effect uncertain; adaptive genome reduction under mitomycin stress; dAB-background specific | ✅ Only mito_dAB_* |
🧬 Detailed Gene-by-Gene Breakdown per SV
🔴 SV_adeIJ_del: AdeIJK Efflux Pump Deletion (737224–741667)
Reference structure (complement strand ←):
735779..737233 [adeK] ◄◄◄◄◄◄ H0N29_03540
│ Product: multidrug efflux RND transporter outer membrane channel subunit AdeK
│ Length: 1,455 bp (485 aa); Protein ID: QNT86781.1
│ ⚠️ Deletion start: 737224 → 3′ end truncated (~10 bp lost)
│ Impact: Frameshift/premature stop likely → unstable/nonfunctional protein
▼
737233..740409 [adeJ] ◄◄◄◄◄◄ H0N29_03545 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter permease subunit AdeJ
│ Length: 3,177 bp (1,059 aa); Protein ID: QNT85685.1
│ EC: N/A; DBxref: GI:1906909115
▼
740422..741672 [adeI] ◄◄◄◄◄◄ H0N29_03550 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter periplasmic adaptor subunit AdeI
│ Length: 1,251 bp (417 aa); Protein ID: QNT85686.1
│ ⚠️ Deletion end: 741667 → ~5 bp of 3′ end preserved
▼
741697..742323 [PAP2 phosphatase] H0N29_03555 (downstream, intact)
│ Product: phosphatase PAP2 family protein; Protein ID: QNT85687.1
Functional Impact Summary:
• AdeJ (permease) + AdeI (adaptor) complete loss → tripartite RND pump cannot assemble
• adeK 3′ truncation → likely unstable/nonfunctional outer membrane channel
• Phenotypic consequence: potential increased susceptibility to AdeIJK substrates:
- Aminoglycosides (amikacin, gentamicin, tobramycin)
- Fluoroquinolones (ciprofloxacin, levofloxacin)
- β-lactams (cefepime, imipenem, meropenem)
- Tetracyclines (tigecycline, minocycline)
- Chloramphenicol, trimethoprim
• Compensatory mechanisms: Other efflux systems (AdeABC, AdeFGH, AbeM) may be upregulated
🔴 SV_adeAB_del: AdeAB Efflux Pump Deletion (1844323–1848605)
Reference structure (forward strand →):
1844319..1845509 [adeA] ►►►► H0N29_08675 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter periplasmic adaptor subunit AdeA
│ Length: 1,191 bp (397 aa); Protein ID: QNT86625.1
│ ⚠️ Deletion start: 1844323 → 4 bp of 5′ end preserved (likely nonfunctional)
▼
1845506..1848616 [adeB] ►►►► H0N29_08680 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter permease subunit AdeB
│ Length: 3,111 bp (1,037 aa); Protein ID: QNT86626.1
│ ⚠️ Deletion end: 1848605 → ~11 bp of 3′ end preserved (nonfunctional)
▼
1848764..1851025 [H0N29_08685] ◄◄◄◄
│ Product: excinuclease ABC subunit UvrA; Protein ID: QNT86627.1
│ Status: ✅ INTACT (starts 159 bp downstream)
Upstream regulatory genes (INTACT):
• adeS (H0N29_08665): two-component sensor histidine kinase AdeS (1842325–1843398)
• adeR (H0N29_08670): efflux system response regulator transcription factor AdeR (1843430–1844173)
Functional Impact Summary:
• AdeA (adaptor) + AdeB (permease) complete loss → tripartite RND pump cannot assemble
• Phenotypic consequence: potential increased susceptibility to AdeAB substrate antibiotics:
- Aminoglycosides, fluoroquinolones, β-lactams, tetracyclines, chloramphenicol
• Regulatory paradox: adeS/adeR intact but structural genes deleted → possible compensatory evolution or pseudogenization over time
🟢 SV_tRNA_contract: tRNA-Gln Array Contraction (3124916–3125037)
Reference tRNA-Gln tandem array (forward strand →):
3124675..3124749 [tRNA-Gln #1] H0N29_14850 (75 bp) │ anticodon: ttg (CAG/CAA)
3124841..3124915 [tRNA-Gln #2] H0N29_14855 (75 bp) │ anticodon: ttg
3124916..3124942 [spacer] (27 bp)
3124943..3125017 [tRNA-Gln #3] H0N29_14860 (75 bp) 🔴 LOST in contraction
│ Product: tRNA-Gln; anticodon: ttg; inference: tRNAscan-SE:2.0.4
3125018..3125037 [spacer] (20 bp)
3125039..3125113 [tRNA-Gln #4] H0N29_14865 (75 bp) │ anticodon: ttg
Functional Impact Summary:
• Copy number reduction: 4 → 3 identical tRNA-Gln genes
• Likely neutral: tRNA genes are highly redundant in bacteria; single copy loss rarely affects translation efficiency
• Utility: Stable molecular marker for:
- Clonal tracking across experiments
- Quality control (present in 100% of high-quality assemblies)
- Phylogenetic analysis (fixed in this lineage)
🟢 SV_flu_tandem: Intergenic Tandem Contraction (2259736–2260384)
Reference context:
2216347..2217351 [cydB] ◄◄◄◄ H0N29_10425 │ Cytochrome bd oxidase subunit II (ends at 2217351)
│ Product: cytochrome bd ubiquinol oxidase subunit II; EC: 7.1.1.7
│ Protein ID: QNT85688.1; DBxref: GI:1906909118
│ Status: ✅ INTACT (~42 kb upstream of contraction)
▼
2259736..2260384 [🔴 CONTRACTION REGION: ~135 bp]
│ No annotated protein-coding CDS in GFF3 excerpt
│ Likely repetitive element (IS, CRISPR, prophage remnant, or low-complexity DNA)
│ Assemblytics metrics: ref_gap: -648; query_gap: -780
Functional Impact Summary:
• No CDS disruption → likely neutral at protein level
• Possible regulatory impact if contraction removes:
- Promoter/enhancer elements affecting downstream genes
- Small RNA genes or riboswitches
- DNA topology elements affecting local chromatin structure
• Utility: Condition-specific marker for fluoroquinolone selection (nitro/polyB/tet)
🟡 SV_mito_tandem: Large Repeat Array Contraction (2494563–2536071)
Reference context (genes/features within/adjacent to region):
2474459..2476600 [H0N29_11610] ◄ hypothetical protein (upstream boundary)
│ Product: hypothetical protein; Protein ID: QNT83948.1
│ Status: ⚠️ Partial overlap at 3′ end
▼
2476597..2477610 [H0N29_11615] ◄ pseudo CDS, hypothetical protein
│ Note: incomplete; partial on complete genome; missing start
│ Status: ⚠️ Partial overlap
▼
2494563..2536071 [🔴 CONTRACTION REGION: 41,564 bp]
│ Multiple hypothetical proteins in H0N29_11xxx series (partial/full overlap)
│ Transposase/integrase remnants (mobile element-associated)
│ Repeat-rich, low-complexity sequence
│ Assemblytics metrics: ref_gap: 41508; query_gap: -56
Functional Impact Summary:
• Large repeat array contraction → possible loss of mobile element-associated sequences
• Hypothetical proteins affected → functional impact unknown; likely non-essential
• Hypothesis: Mitomycin C (DNA crosslinker) induces replication fork collapse → error-prone repair → large contractions in repetitive regions
• Utility: Marker for mitomycin-selected lineage; dAB-background specific
🟡 SV_mito_large_del2: Large Deletion Affecting Methyltransferase (2621714–2664046)
Reference context (genes overlapping region):
2626228..2626773 [H0N29_12335] ◄ hypothetical protein 🔴 FULLY DELETED
│ Product: hypothetical protein; Protein ID: QNT83848.1
│ Length: 546 bp (182 aa)
▼
2655635..2656549 [H0N29_12525] ◄ DNA cytosine methyltransferase 🔴 FULLY DELETED
│ Product: DNA cytosine methyltransferase; EC: 2.1.21.-
│ Protein ID: QNT83886.1; DBxref: GI:1906907316
│ Length: 915 bp (305 aa)
│ Function: Catalyzes methylation of cytosine residues in DNA; epigenetic regulation
Functional Impact Summary:
• H0N29_12525 (DNA cytosine methyltransferase) loss → potential epigenetic regulation changes:
- Altered DNA methylation patterns
- Possible effects on gene expression, phase variation, or restriction-modification systems
• H0N29_12335 (hypothetical) loss → functional impact unknown; likely neutral
• Hypothesis: Mitomycin-induced genomic instability → adaptive genome reduction in non-essential regions
• Utility: Marker for mitomycin-selected lineage (all mito_* samples)
🟡 SV_mito_large_del1: Large Gene-Sparse Deletion (1189156–1236440)
Reference context (genes/features within/adjacent to region):
1168871..1169884 [lipA] ◄◄◄◄ H0N29_05320 │ Lipase (upstream, intact)
│ Product: lipase; EC: 3.1.1.3; Protein ID: QNT85998.1
▼
1171737..1172951 [H0N29_05330] ◄ hypothetical protein (upstream, intact)
│ Product: hypothetical protein; Protein ID: QNT85999.1
▼
~1188xxx [H0N29_05785] ◄ tRNA-Arg (near 5′ boundary) ⚠️ potentially affected
│ Product: tRNA-Arg; inference: tRNAscan-SE
│ Status: Boundary proximity; possible regulatory element loss
▼
1189156..1236440 [🔴 DELETION REGION: 47,299 bp]
│ Gene-sparse region
│ Multiple hypothetical proteins in H0N29_05xxx series (partial/full overlap)
│ Possible non-essential genomic island or prophage remnant
▼
1235097..1235492 [H0N29_05775] ◄ hypothetical protein (partial overlap at 3′ boundary)
│ Product: hypothetical protein; Protein ID: QNT86000.1
▼
1263952..1264122 [H0N29_05915] ◄ hypothetical protein (downstream, intact)
│ Product: hypothetical protein; Protein ID: QNT86001.1
Functional Impact Summary:
• tRNA-Arg near boundary: deletion may affect tRNA expression/regulation if promoter elements removed
• Multiple hypothetical proteins lost → likely non-essential functions
• Hypothesis: Adaptive genome reduction under mitomycin stress; loss of non-essential functions to reduce metabolic burden
• Utility: Marker for mitomycin + dAB background lineage
📋 Export-Ready Comprehensive Annotation Table (TSV Format)
SV_ID Coordinates Type Size_bp Affected_Genes_LocusTags Affected_Genes_Products Overlap_Type_per_Gene Functional_Impact_Summary Sample_Pattern Priority
SV_adeIJ_del 737224-741667 Deletion 4436 adeK(H0N29_03540);adeJ(H0N29_03545);adeI(H0N29_03550) multidrug_efflux_RND_transporter_subunits_AdeIJK adeK:3prime_truncation;adeJ:full_deletion;adeI:full_deletion Loss_of_AdeIJK_pump_function;potential_increased_antibiotic_susceptibility all_dIJ_samples HIGH
SV_adeAB_del 1844323-1848605 Deletion 4282 adeA(H0N29_08675);adeB(H0N29_08680) multidrug_efflux_RND_transporter_subunits_AdeAB adeA:full_deletion_4bp_5prime_preserved;adeB:full_deletion_11bp_3prime_preserved Loss_of_AdeAB_pump_function;potential_increased_antibiotic_susceptibility all_dAB_samples HIGH
SV_tRNA_contract 3124916-3125037 Tandem_contraction 198 tRNA-Gln(H0N29_14860) tRNA-Gln_anticodon:ttg_glutamine_codon_translation H0N29_14860:full_deletion_75bp tRNA_dosage_reduction_4to3;likely_neutral_lineage_marker all_filtered_samples LOW
SV_flu_tandem 2259736-2260384 Tandem_contraction 135 intergenic_no_CDS_overlapped Likely_neutral_repetitive_element No_CDS_disruption;possible_regulatory_impact Likely_neutral;fluoroquinolone_selection_marker flu_subset_condition_specific LOW
SV_mito_tandem 2494563-2536071 Tandem_contraction 41564 H0N29_11610(partial);H0N29_11615(partial);multiple_H0N29_11xxx_hypotheticals hypothetical_proteins;repeat_rich_region Multiple_hypotheticals:partial_or_full_deletion;repeat_arrays:contraction Large_repeat_array_contraction;likely_non-essential;mobile_element_associated_plasticity mito_dAB_only MEDIUM
SV_mito_large_del2 2621714-2664046 Deletion 42352 H0N29_12335(full);H0N29_12525(full) hypothetical_protein;DNA_cytosine_methyltransferase_EC:2.1.21.- H0N29_12335:full_deletion;H0N29_12525:full_deletion Loss_of_DNA_cytosine_methyltransferase;potential_epigenetic_regulation_changes all_mito_samples MEDIUM
SV_mito_large_del1 1189156-1236440 Deletion 47299 tRNA-Arg(H0N29_05785,boundary);H0N29_05775(partial);multiple_H0N29_05xxx_hypotheticals tRNA-Arg;hypothetical_proteins tRNA-Arg:boundary_proximity;H0N29_05775:partial_deletion;others:full/partial Possible_loss_of_non-essential_functions;tRNA_boundary_effect_uncertain;adaptive_genome_reduction mito_dAB_only MEDIUM
🔬 Reproducible Validation Workflow (Command-Line)
# 1. Create BED file of SV coordinates (0-based, half-open for bedtools)
cat > sv_coords.bed << EOF
CP059040 737223 741667 SV_adeIJ_del 4436 Deletion
CP059040 1844322 1848605 SV_adeAB_del 4282 Deletion
CP059040 3124915 3125037 SV_tRNA_contract 198 Tandem_contraction
CP059040 2259735 2260384 SV_flu_tandem 135 Tandem_contraction
CP059040 2494562 2536071 SV_mito_tandem 41564 Tandem_contraction
CP059040 2621713 2664046 SV_mito_large_del2 42352 Deletion
CP059040 1189155 1236440 SV_mito_large_del1 47299 Deletion
EOF
# 2. Intersect with GFF3 annotation (requires bedtools)
bedtools intersect -a sv_coords.bed -b CP059040.gff3.txt -wa -wb -loj > sv_gene_overlap.tsv
# 3. Extract and summarize affected genes per SV
awk -F'\t' 'NR>1 {print $4, $10, $11, $12}' sv_gene_overlap.tsv | \
sort | uniq -c | column -t > sv_gene_summary.txt
# 4. Extract sequences for breakpoint validation
while IFS=$'\t' read -r chr start end sv_id size sv_type; do
samtools faidx bacto/CP059040.fasta ${chr}:${start}-${end} > ${sv_id}_region.fasta
done < sv_coords.bed
💡 Manuscript Interpretation Guidelines
High-Priority Findings (Results Section)
“Two mutually exclusive ~4.3-kb deletions define efflux pump backgrounds: SV_adeIJ_del (CP059040:737224–741667) abolishes the AdeIJK pump (adeJ, adeI, truncated adeK) in all
dIJstrains, while SV_adeAB_del (CP059040:1844323–1848605) abolishes the AdeAB pump (adeA, adeB) in alldABstrains. Both variants result in complete loss of permease and adaptor subunits, likely conferring increased susceptibility to respective substrate antibiotics.”
Medium-Priority (Supplementary/Discussion)
“Mitomycin C-selected strains harbor large (>40 kb) structural variants in repeat-rich genomic regions (SV_mito_tandem, SV_mito_large_del1/2), absent in fluoroquinolone-selected isolates. Notably, SV_mito_large_del2 deletes a DNA cytosine methyltransferase (H0N29_12525), suggesting potential epigenetic adaptation under genotoxic stress.”
Low-Priority (Methods/QC)
“A universal 198-bp tandem contraction in a tRNA-Gln array (SV_tRNA_contract; copy number 4→3) and a condition-specific ~135-bp intergenic contraction (SV_flu_tandem) were detected, serving as stable lineage markers and selection-condition signatures, respectively.”
- BED/GFF3 files for IGV visualization of all 7 SVs with gene tracks
- Integration with SNP/InDel results for a unified variant report
- R/Python scripts to generate publication-ready figures (SV distribution, genome maps, gene impact heatmaps)