-
Run nextflow bacass
conda deactivate # Downlod k2_standard_08_GB_20251015.tar.gz from https://benlangmead.github.io/aws-indexes/k2#kraken2--bracken # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056; 'tar xzf 20190108_kmerfinder_stable_dirs.tar.gz' #The database does not work! # Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz #The database works! # DEBUG: --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ not working! nextflow run nf-core/bacass -r 2.6.0 -profile docker --help # -- Hybrid assembly -- nextflow run nf-core/bacass -r 2.6.0 -profile docker \ --input samplesheet_bacass.tsv \ --outdir bacass_out \ --assembly_type hybrid \ --assembler unicycler,dragonflye \ --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \ --skip_kmerfinder \ -resume \ -work-dir bacass_out/work # -- Short assembly -- #Maybe BUG is from '--skip_kmerfinder for -r 2.6.0, using db in 2.5.0' nextflow run nf-core/bacass -r 2.5.0 -profile docker \ --input samplesheet.tsv \ --outdir bacass_out \ --assembly_type short \ --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \ --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \ -resume \ -work-dir bacass_out/work -
Verify if the genome is pure
# 1. Go up one level to the main 'bacass_out' directory cd .. # 2. Create directories for CheckM inputs and outputs mkdir -p checkm_input checkm_output # 3. Copy all .fna files into the 'checkm_input' folder # (CheckM cannot search subdirectories, so they must be in one folder) find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \; # 4. Run CheckM on all 4 assemblies checkm lineage_wf -x fna checkm_input checkm_output -
Species Identification: 快速筛查用 Mash → 精确分类用 GTDB-Tk → 种级验证用 FastANI,三者结合可最大限度提高物种鉴定的准确性和可解释性。
# 1. 创建环境(推荐 mamba) mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk mamba activate gtdbtk # 2. 下载数据库(仅需首次,约 60GB) gtdbtk download --data_dir ./gtdb_data --release 220 wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.g mamba env config vars set GTDBTK_DATA_PATH="/mnt/nvme4n1p1/gtdb_data/release232" # 先退出当前环境,再重新激活 mamba deactivate mamba activate gtdbtk # 验证环境变量是否加载成功 echo $GTDBTK_DATA_PATH # 应输出:/mnt/nvme4n1p1/gtdb_data/release232 # 3. 运行分类(你提供的命令 + 实用参数) gtdbtk classify_wf \ --genome_dir ./checkm_input \ --out_dir gtdb_out \ --cpus 64 \ --extension .fna \ --prefix mygenome # 4. 查看结果 cat gtdb_out/classify/mygenome.bac120.summary.tsv # 细菌结果 -
Antimicrobial resistance gene profiling and Resistome and Virulence Profiling with Abricate and RGI (Reisistance Gene Identifier)
conda activate /home/jhuang/miniconda3/envs/bengal3_ac3 abricate --list conda deactivate ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \ ASM=bacass_out/checkm_input/2914_.fna \ SAMPLE=2914 \ OUTDIR=resistome_virulence_2914 \ MINID=80 MINCOV=60 \ THREADS=32 \ ~/Scripts/run_abricate_resistome_virulome_one_per_gene.sh #ABRicate thresholds: MINID=80 MINCOV=60 Database Hit_lines File MEGARes 24 resistome_virulence_2605/raw/2605.megares.tab CARD 21 resistome_virulence_2605/raw/2605.card.tab ResFinder 4 resistome_virulence_2605/raw/2605.resfinder.tab VFDB 0 resistome_virulence_2605/raw/2605.vfdb.tab # Database Hit_lines File # MEGARes 42 resistome_virulence_2631/raw/2631.megares.tab # CARD 37 resistome_virulence_2631/raw/2631.card.tab # ResFinder 16 resistome_virulence_2631/raw/2631.resfinder.tab # VFDB 0 resistome_virulence_2631/raw/2631.vfdb.tab Database Hit_lines File MEGARes 35 resistome_virulence_2914/raw/2914.megares.tab CARD 31 resistome_virulence_2914/raw/2914.card.tab ResFinder 11 resistome_virulence_2914/raw/2914.resfinder.tab VFDB 0 resistome_virulence_2914/raw/2914.vfdb.tab # #ABRicate thresholds: MINID=70 MINCOV=50 # Database Hit_lines File # MEGARes 24 resistome_virulence_2605/raw/2605.megares.tab # CARD 21 resistome_virulence_2605/raw/2605.card.tab # ResFinder 4 resistome_virulence_2605/raw/2605.resfinder.tab # VFDB 3 resistome_virulence_2605/raw/2605.vfdb.tab conda activate /home/jhuang/miniconda3/envs/bengal3_ac3 #NEED_TO_ADAPT: OUTDIR = Path("resistome_virulence_An7") #NEED_TO_ADAPT: SAMPLE = "An7" #DEPRECATED_DUE_TO_NEED_MANULL_SETTING: python ~/Scripts/merge_amr_sources_by_gene.py python ~/Scripts/export_resistome_virulence_to_excel_py36.py \ --workdir resistome_virulence_2914 \ --sample 2914 \ --out Resistome_Virulence_2914.xlsx # Delete the column 'COVERAGE_MAP' in all 'Raw_*' sheets -
Report
Please find below a summary of genomic analyses for samples 2605, 2617, 2631 and 2914. ### 1. Assembly and checkM ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Bin Id Completeness Contamination Strain heterogeneity ------------------------------------------------------------------------------------------------------------------------------------------------------------------ 2631_ 100.00 100.00 78.57 2617_ 100.00 100.00 78.57 2605_ 100.00 0.00 0.00 2914_ 99.98 0.63 0.00 ---------------------------------------------------------------------------------------------------------------------------------------------------------------- From the results of checkM, we see the samples 2631_ and 2617_ both are genomes between 7.0-7.1 M. and the contamination is 100.00, which means the DNA sample contained two closely related strains of the same species from a non-clonal culture. If the true genome size is a standard ~3.7 Mb and the assembler couldn't merge the two highly similar strains, it would build both side-by-side. This results in a ~7.0 Mb assembly where every gene is duplicated. The sample 2605_.fna is 3.7 M and 2914_.fna is about 3.9M. they are pure isolates. ### 1. Species Identification **Sample 2605_:** *Acinetobacter baumannii* ✅ Confirmed | Parameter | Value | Interpretation | |---|---|---| | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* | | ANI | 98.02% | ✅ Well above 95% species threshold | | AF (Alignment Fraction) | 0.874 | ✅ 87.4% of genome aligns; ANI estimate is robust | | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations | 🟢 **Conclusion:** 2605_ is confidently assigned to *Acinetobacter baumannii*. *** **Sample 2617_:** *Acinetobacter baumannii* ✅ Confirmed | Parameter | Value | Interpretation | |---|---|---| | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* | | ANI | 98.00% | ✅ Well above 95% species threshold | | AF (Alignment Fraction) | 0.859 | ✅ 85.9% of genome aligns; ANI estimate is robust | | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations | 🟢 **Conclusion:** 2617_ is confidently assigned to *Acinetobacter baumannii*. *** **Sample 2631_:** *Acinetobacter baumannii* ✅ Confirmed | Parameter | Value | Interpretation | |---|---|---| | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* | | ANI | 98.07% | ✅ Well above 95% species threshold | | AF (Alignment Fraction) | 0.860 | ✅ 86.0% of genome aligns; ANI estimate is robust | | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations | 🟢 **Conclusion:** 2631_ is confidently assigned to *Acinetobacter baumannii*. *** **Sample 2914_:** *Acinetobacter baumannii* ✅ Confirmed | Parameter | Value | Interpretation | |---|---|---| | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* | | ANI | 98.11% | ✅ Well above 95% species threshold | | AF (Alignment Fraction) | 0.873 | ✅ 87.3% of genome aligns; ANI estimate is robust | | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations | 🟢 **Conclusion:** 2914_ is confidently assigned to *Acinetobacter baumannii*. ### 3. Since 2631_ and 2617_ are not a pure isolates, they are the mixed of two strains. I exclude the two samples from AMR and VFDB analysis. AMR Genes and Virulence Factors (VFDB) Summary, see the Resistome_Virulence_2605.xlsx and Resistome_Virulence_2914.xlsx.
Since CheckM requires all the genome files (bins) to be in the same directory, you first need to gather your .fna files into a single folder before running the command.
Here are the exact bash commands to prepare your files and run CheckM for all four samples (2605, 2617, 2631, and 2914).
Step 1: Prepare the Input and Run CheckM
Run the following commands in your terminal:
# 1. Go up one level to the main 'bacass_out' directory
cd ..
# 2. Create directories for CheckM inputs and outputs
mkdir -p checkm_input checkm_output
# 3. Copy all .fna files into the 'checkm_input' folder
# (CheckM cannot search subdirectories, so they must be in one folder)
find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;
# 4. Run CheckM on all 4 assemblies
checkm lineage_wf -x fna checkm_input checkm_output
(Note: If you get an error saying the CheckM database is not found, you need to set the path to your CheckM data folder first using: checkm data setRoot /path/to/your/checkm_data)
Step 2: How to Interpret the Results
CheckM will print a summary table directly to your terminal, and it will also save detailed statistics in checkm_output/bin_stats_ext.tsv.
Since you are working with Acinetobacter sp. (which typically has a genome size of ~3.5 to 4.5 Mb), a 7.0 Mb assembly is almost certainly a mix of two closely related strains that the assembler failed to collapse.
Look specifically at these three columns in the CheckM output for your 2631_ sample:
- Completeness: This should be high (ideally > 95%). Because the assembler built two genomes, CheckM will likely still find all the marker genes, so completeness might look deceptively high.
- Contamination: This is the most critical metric. For a pure isolate, this should be < 5%. If your 2631 assembly is a mixed strain, CheckM will detect the duplicated marker genes and flag this as high contamination (often > 20-50%).
- Strain heterogeneity: This column specifically calculates if there are multiple strains of the same species present. A high value here confirms that your assembly contains a mixture of closely related strains.
What to do next?
- If 2631 has high contamination/strain heterogeneity: You have confirmed it is a mixed assembly. You can either use a tool like
purge_dupsto separate the two strains, or map your raw reads back to the assembly and filter out the contigs with half the average coverage. - If 2631 has low contamination: The 7.0 Mb size might be real (some Acinetobacter strains have large plasmids or are naturally larger), but you should still verify this using a BlobPlot (GC vs. Coverage) to be absolutely sure there is no foreign DNA.
-
Check the other 3 samples: Compare the CheckM results of 2605, 2617, and 2914 to see if they are pure, high-quality genomes (high completeness, low contamination).
------------------------------------------------------------------------------------------------------------------------------------------------------------------ Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity ------------------------------------------------------------------------------------------------------------------------------------------------------------------ 2631_ root (UID1) 5656 56 24 0 0 56 0 0 0 100.00 100.00 78.57 2617_ root (UID1) 5656 56 24 0 0 56 0 0 0 100.00 100.00 78.57 2605_ f__Moraxellaceae (UID4680) 86 689 365 0 689 0 0 0 0 100.00 0.00 0.00 2914_ g__Acinetobacter (UID4685) 42 1193 254 1 1188 4 0 0 0 99.98 0.63 0.00 ----------------------------------------------------------------------------------------------------------------------------------------------------------------
Ah, I understand now! Thank you for clarifying. The CheckM results you showed earlier (with 100% contamination and 56 markers in the ‘2’ column) were indeed from this Unicycler assembly.
Looking at your Unicycler log, we can see exactly why it didn’t fix the problem: Total Length: 7,199,947 bp (~7.2 Mb) Main Chromosome: Component 1 consists of 416 fragmented contigs totaling 7,165,229 bp.
Why Unicycler Failed to Collapse the Genome
The log states: “Since you provided only short reads, Unicycler will essentially function as a SPAdes-optimiser.”
Because you only have short reads (Illumina), Unicycler builds a de Bruijn graph. When a bacterial sample contains a mixture of two closely related strains (a very common scenario in non-clonal isolates), the assembler sees the slight genetic differences between the two strains as “bubbles” in the graph.
Without long reads to span across these bubbles and prove which path belongs to which strain, the software conservatively outputs both paths. This results in an assembly that contains both strains side-by-side, doubling your genome size to ~7.2 Mb and causing the 100% duplication flagged by CheckM.
How to Fix This (Since Re-assembling Won’t Work)
Running another short-read assembler will just result in the exact same 7.2 Mb duplicated assembly. You now need to computationally “purge” the duplicate contigs.
Option 1: The Quick Fix using BBMap (dedupe.sh)
If you have the BBMap suite installed, it has a fantastic tool specifically for removing duplicate sequences. It will compare all your contigs to each other and remove the redundant ones.
dedupe.sh in=assembly.fasta out=assembly_purged.fasta minidentity=95
minidentity=95tells it to remove contigs that are 95% identical or more to another contig. You can adjust this (e.g., 90 or 98) depending on how closely related the two strains are.- After running this, run CheckM again on
assembly_purged.fasta. You should see the genome size drop to ~3.5 Mb and the contamination drop to 0%.
Option 2: The Thorough Fix using purge_dups
purge_dups is the gold standard for separating primary contigs from duplicated “haplotigs” in an assembly. It works by aligning your assembly to itself and using read depth to separate the true genome from the duplicates.
- Calculate coverage cutoffs:
calcuts assembly.fasta > cutoffs 2>calcults.log - Split the assembly:
split_fa assembly.fasta > assembly.split.fa - Align to itself:
minimap2 -xasm5 assembly.split.fa assembly.split.fa > alignment.paf - Get coverage:
pbcstat alignment.pafandcalcuts PB.base.cov - Purge:
purge_dups -2 -T 1024 -c CUTOFFS assembly.split.fa alignment.paf > dups.bed - Extract pure genome:
get_seqs dups.bed assembly.split.fa(This will outputassembly.split.fa.purged.fa, which is your pure ~3.5 Mb genome).
Option 3: Do you have Long Reads?
If you happen to have Oxford Nanopore (ONT) or PacBio reads for this sample, you can solve this biologically rather than computationally. Running Unicycler with long reads (unicycler -1 short_R1.fastq -2 short_R2.fastq -l long_reads.fastq) will allow the assembler to easily span across the “bubbles” and collapse the two strains into a single, perfect, circular ~3.5 Mb chromosome.
Bonus Observation: You likely have 3 plasmids!
Looking at the end of your log, Unicycler successfully closed 3 circular contigs:
- 8,731 bp (7.89x coverage)
- 5,837 bp (5.27x coverage)
- 2,120 bp (7.74x coverage)
These are almost certainly plasmids. The massive ~7.2 Mb “Component 1” is your duplicated chromosome. When you use purge_dups or dedupe.sh to fix the chromosomal duplication, these plasmids should remain intact in your final assembly!
是的,这种强制“去冗余”(Purging duplicates / Haplotype purging)的操作确实有可能引入组装错误或偏差。
在真核生物(如动植物)的组装中,分离单倍型是常规操作;但在细菌(单倍体) 基因组中,强行使用去冗余工具来“修复”7.2Mb的组装结果,是一把双刃剑。
以下是这种操作可能引入的具体错误和风险,以及背后的科学原因:
1. 误删真实的生物学重复序列(Over-purging / 假阴性缺失)
细菌基因组中天然存在许多真实的重复序列,例如:
- rRNA 操纵子(通常有 5-10 个高度相似的拷贝)。
- IS 元件(插入序列)和转座子(在 Acinetobacter 等细菌中非常活跃,经常有多个拷贝)。
- 基因家族或旁系同源基因(Paralogs)。
风险:如果你使用 bbmap (dedupe.sh) 并设置了一个较高的相似度阈值(例如 minidentity=95),去冗余工具无法区分“组装错误导致的重复”和“基因组天然存在的重复”。它可能会把你基因组中真实存在的、具有重要功能(如耐药性、毒力)的 IS 元件或 rRNA 拷贝当作“冗余的单倍型”直接删除,导致你的最终基因组缺失关键基因。
2. 产生“嵌合体”基因组(Chimeric Assembly)
你的 CheckM 结果显示完美的 1:1 重复(所有 marker 基因都在 ‘2’ 列)。这通常意味着两种可能:
- 情况 A(组装软件的 Bug):样本是纯的,但 SPAdes/Unicycler 因为某些复杂的局部重复或测序偏好性,把同一段序列组装了两遍。
- 情况 B(样本不纯/混合菌株):你的培养物中混入了两个亲缘关系极近的菌株(Mixed strains)。短读长无法跨越它们之间的 SNP/Indel 差异(即 de Bruijn 图中的 bubbles),所以软件把两套基因组都保留了下来。
风险:如果是情况 B,去冗余工具在决定“保留哪一套、丢弃哪一套”时,可能会在两个菌株的序列之间来回切换。最终你得到的并不是一个真实的单一基因组,而是一个自然界中不存在的“弗兰肯斯坦(Frankenstein)”嵌合体。
- 后果:这种嵌合体会严重影响后续的 SNP calling(产生大量假阳性突变)、进化树构建(Phylogeny)以及耐药基因(AMR)的准确定位。
3. 破坏基因组的连续性(Structural Breaks)
风险:在剥离“副拷贝”的过程中,去冗余算法可能会在原本连续的 contig 上造成人为的断裂。这会导致你的组装结果碎片化(Contig 数量增加,N50 降低),原本可以闭合的环状染色体可能会断裂成多个线性片段。
如何安全地处理 2631 和 2617?(最佳实践)
为了避免引入上述错误,不要盲目直接运行去冗余工具。建议按照以下步骤进行排查和处理:
第一步:回贴原始 Reads,查看覆盖度(Coverage)分布
将你的原始短读长(clean reads)比对回这 7.2Mb 的组装结果上(使用 bwa 或 bowtie2),然后统计每个 Contig 的平均覆盖度。
- 如果所有 Contig 的覆盖度都非常均匀(例如都在 100x 左右):这说明是情况 A(组装软件的 Bug)。基因组是纯的,只是被错误地组装了两遍。此时去冗余是相对安全的,但仍需谨慎。
- 如果 Contig 的覆盖度呈现明显的两个峰(例如一半是 100x,另一半是 50x):这证实了是情况 B(混合菌株)。
- 如果覆盖度极其混乱:说明可能存在更复杂的污染。
第二步:根据覆盖度结果选择策略
- 如果是混合菌株(情况 B):
- 最完美的生物学解决方案:重新划线分离单克隆(Single colony isolation),重新提取 DNA 测序。这是唯一能获得 100% 纯基因组的方法。
- 计算解决方案:不要使用
dedupe.sh。你应该使用基于覆盖度的分离工具(如 metaSPAdes 的 binning 功能,或者 BlobTools),利用 100x 和 50x 的覆盖度差异,将两个菌株的 Contig 彻底分开,分别得到两个独立的基因组,然后再分别跑 CheckM。
- 如果是组装 Bug(情况 A):
- 你可以尝试使用
purge_dups,它比dedupe.sh更智能,会结合自比对(self-alignment)和覆盖度来区分“主序列(Primary)”和“副序列(Haplotig)”。 - 关键操作:在运行
purge_dups后,必须将提取出的“Primary”序列再次运行 CheckM 和 BUSCO。 - 验收标准:如果 CheckM 显示 Completeness > 95%,Contamination < 5%,并且基因组大小降到了 ~3.5 Mb,说明去冗余成功且没有误删核心基因。如果 Completeness 大幅下降,说明你误删了真实的基因,需要调高
purge_dups的保守参数重新运行。
- 你可以尝试使用
第三步:尝试重新组装(替代方案)
有时候,Unicycler 的默认参数对某些特定的 Acinetobacter 菌株不够友好。你可以尝试直接使用 SPAdes 并开启特定的模式:
spades.py --isolate --careful -1 2631__1.fastp.fastq.gz -2 2631__2.fastp.fastq.gz -o spades_2631_careful
--careful 参数会让 SPAdes 在组装后尝试纠正错配和 Indel,有时它内部的 repeat resolution 机制能更好地 collapse 那些被 Unicycler 错误保留的重复区域。
总结
强行去冗余确实有引入嵌合体或误删真实重复序列的风险。对于细菌基因组,“计算修补”永远是下策。最严谨的做法是通过 Read Mapping 确认覆盖度,如果是混合菌株则进行 Binning 分离或重新实验;如果是组装冗余,则使用 purge_dups 并严格依赖二次 CheckM 结果来验证基因组的完整性。