ragtag.py usage

minimap2 -cx asm20 --paf-no-hit CP059040.fasta adeABadeIJ_contigs.min500.fasta > asm20_all.paf
awk '$6=="*"{print $1}' asm20_all.paf
#-->contig00020
#-->contig00021
#-->contig00027
seqkit grep -v -r \
  -p "^contig00020([[:space:]]|$)" \
  -p "^contig00021([[:space:]]|$)" \
  -p "^contig00027([[:space:]]|$)" \
  adeABadeIJ_contigs.min500.fasta > adeABadeIJ_contigs.min500.no20_21_27.fasta
ragtag.py scaffold CP059040.fasta adeABadeIJ_contigs.min500.no20_21_27.fasta -o ragtag_adeABadeIJ  -C 

minimap2 -cx asm20 --paf-no-hit CP059040.fasta adeIJK_contigs.min500.fasta > asm20_all.paf
awk '$6=="*"{print $1}' asm20_all.paf
#-->contig00016
seqkit grep -v -r -p "^contig00016(\s|$)" adeIJK_contigs.min500.fasta > adeIJK_contigs.min500.no16.fasta
ragtag.py scaffold CP059040.fasta adeIJK_contigs.min500.no16.fasta -o ragtag_adeIJK  -C 

minimap2 -cx asm20 --paf-no-hit CP059040.fasta A6WT_contigs.min500.fasta > asm20_all.paf
awk '$6=="*"{print $1}' asm20_all.paf
#-->contig00016
seqkit grep -v -r -p "^contig00016(\s|$)" A6WT_contigs.min500.fasta > A6WT_contigs.min500.no16.fasta
ragtag.py scaffold CP059040.fasta A6WT_contigs.min500.no16.fasta -o ragtag_A6WT  -C 

这说明的是:

ragtag.scaffold.confidence.txt 里列出的并不是“所有能被 minimap2 比对上的 contig”,而是“最终被 RagTag 成功放置到 scaffold/chromosome 里的 contig”。

也就是说,RagTag 的流程不是“只要能比对上就一定放进去”,而是会经过几步筛选:

  1. 先用 minimap2 做比对
  2. 再按一些条件过滤比对结果,比如:

    • 唯一比对长度(-f
    • 比对质量(-q
  3. 然后计算 3 个置信度:

    • grouping_confidence
    • location_confidence
    • orientation_confidence
  4. 最后还会去掉那些位置被其他 contig 完全包含(contained)的 contig

所以你的情况可以这样理解:

  • 21 个 contig 在 minimap2 的 PAF 结果里有比对
  • 但只有 15 个 contig 最终通过 RagTag 的筛选并被放进 scaffold
  • 所以 ragtag.scaffold.confidence.txt 里只看到 15 个

你贴出来的这 15 个 contig 都是通过阈值的。 例如:

  • 默认 -i = 0.2
  • 默认 -a = 0.0
  • 默认 -s = 0.0

你这个 contig00002location_confidence = 0.1338,虽然不高,但仍然 大于 0.0,所以默认情况下还是能保留。


为什么 21 个能比对上,最后只剩 15 个?

通常有两类原因:

1. 比对虽然存在,但没通过 RagTag 后续过滤

例如:

  • 唯一比对长度太短
  • MAPQ 太低
  • 置信度不够高

2. 这些 contig 虽然能比对上,但它们在参考基因组上的位置被别的 contig 覆盖或包含

这种情况下,RagTag 会认为这些 contig 不是独立的 scaffold 单元,于是不会把它们单独放进最终 chromosome。

这也是为什么:

“能比对上” ≠ “一定能被 RagTag 放进最终染色体 scaffold”


你怎么找出那 6 个丢掉的 contig?

先把 21 个有比对的 contig 和最终 15 个被放置的 contig 做比较:

cut -f1 asm20.paf | sort -u > mapped21.txt
cut -f1 ragtag_output/ragtag.scaffold.confidence.txt | sort -u > placed15.txt
comm -23 mapped21.txt placed15.txt > missing6.txt
cat missing6.txt

这会得到那 6 个“有比对但没进入最终 scaffold”的 contig


怎么进一步判断这 6 个为什么没进去?

建议你开 debug 模式重新跑一次 RagTag:

ragtag.py scaffold CP046654.fasta A6WT_contigs.min500.fasta \
  -o ragtag_debug \
  -w \
  --debug \
  --mm2-params "-x asm20 -t 8"

然后检查这 6 个 contig 在 debug 文件里的情况:

grep -F -f missing6.txt ragtag_debug/ragtag.scaffold.debug.query.info.txt

判断方法:

  • 如果某个 contig 在 debug.query.info.txt 里都没有出现

    • 说明它在前面过滤阶段就被去掉了
    • 常见原因:-f 太高、-q 太高
  • 如果它出现在 debug.query.info.txt,但不在 confidence.txt

    • 说明它虽然进入了打分阶段,但后来被排除了
    • 常见原因:

      • 置信度不够
      • 或者被判定为 contained contig

如果你想尽量让更多 contig 被放进去,可以怎么调参数?

可以先试一个更宽松的参数组合:

ragtag.py scaffold CP046654.fasta A6WT_contigs.min500.fasta \
  -o ragtag_relaxed \
  -w \
  --debug \
  -f 200 \
  -q 0 \
  -i 0.0 \
  -a 0.0 \
  -s 0.0 \
  -d 200000 \
  --mm2-params "-x asm20 -t 8"

这会放宽:

  • -f 200:降低最小唯一比对长度
  • -q 0:允许更低的 MAPQ
  • -i 0.0:不限制 grouping confidence
  • -a 0.0:不限制 location confidence
  • -s 0.0:不限制 orientation confidence
  • -x asm20:允许更高序列差异
  • -d 200000:允许更远距离的 alignment merge

但是要注意

即使你把参数放宽,也不一定能把 21 个 contig 全都放进最终 chromosome

因为如果那 6 个 contig 是:

  • 重复区域
  • 插入序列(IS)
  • rRNA 区域
  • 与其他 contig 在参考基因组上高度重叠
  • 被判定为 contained

那么 RagTag 不放它们进去,反而是为了避免错误拼接

换句话说:

不是参数不够宽松,而是 RagTag 认为这些 contig 不适合被独立放进最终 chromosome scaffold。


如果你只是想让所有 contig 都出现在最终输出里

你可以加 -C

ragtag.py scaffold CP046654.fasta A6WT_contigs.min500.fasta -C

这样做的效果是:

  • 已经能放进 chromosome 的 contig 仍然放进 chromosome
  • 不能放进去的 contig 会被连接到一个额外的 chr0

注意:

-C 不是把它们强行放进主染色体,而是把未放置 contig 收集到一个额外序列里。


总结

你的结果说明:

  • 21 个 contig 有 minimap2 比对结果
  • 但只有 15 个 contig 被 RagTag 真正放进最终 scaffold
  • 其余 6 个 contig 很可能:

    • 在过滤中被去掉
    • 或者虽然能比对,但位置被其他 contig 包含/重叠,因此没有被独立放置

所以:

confidence.txt 里的 15 个,不代表只有这 15 个能比对;而是只有这 15 个最终被 RagTag 接受并用于 scaffold。

如果你愿意,我也可以继续帮你把这段整理成一版更简洁的中文操作指南。



Not quite.

What your output shows is:

  • minimap2 processed 22 query sequences in each run, as shown by the stderr line mapped 22 sequences. That line should not be read as “22 contigs had accepted alignments in the PAF output.” (lh3.github.io)
  • Your command cut -f1 ${p}.paf | sort -u | wc -l counted 21 unique query names in the PAF file for asm5, asm10, and asm20. Since PAF normally contains alignment records, and minimap2 only includes unmapped queries in PAF if you add --paf-no-hit, the most likely interpretation is that 21 contigs had at least one reported alignment and 1 contig had no reported hit. (lh3.github.io)

So the practical answer is:

No, these results do not mean all 22 contigs mapped. They suggest that 21/22 contigs produced at least one alignment in the PAF, and 1 contig did not, for all three presets. (lh3.github.io)

Also, “has at least one minimap2 alignment” is still not the same as “can be scaffolded by RagTag.” RagTag applies extra filters after alignment, including minimum unique alignment length (-f), minimum MAPQ (-q), and confidence thresholds such as grouping confidence (-i). A contig can therefore appear in the PAF but still remain unplaced by RagTag. (lh3.github.io)

To identify the missing contig exactly, run:

grep '^>' A6WT_contigs.min500.fasta | sed 's/^>//' | cut -d' ' -f1 | sort > all_contigs.txt
cut -f1 asm20.paf | sort -u > mapped_contigs.txt
comm -23 all_contigs.txt mapped_contigs.txt

Or rerun minimap2 with explicit no-hit output:

#!!!!!!!!!!! IMPORTANT: OUTPUT unmapped PAF records !!!!!!!!!!
#In the A6WT example, the contigs > 500nt has 22 records, 21 can mapped on the reference, only 15 个 contig 被 RagTag 真正放进最终 scaffold. I want to know which is not mapped, it should be plasmids --> contig00016. 
minimap2 -cx asm20 --paf-no-hit CP046654.fasta A6WT_contigs.min500.fasta > asm20_all.paf
awk '$6=="*"{print $1}' asm20_all.paf
# Then delete the record contig00016, merge all contigs to a chrom sequence for submission
ragtag.py scaffold CP046654.fasta A6WT_contigs.min500.fasta -C

In that second command, minimap2 marks unmapped PAF records with * in the reference-name field. (lh3.github.io)

So your current result is actually encouraging: all three presets recover the same 21 contigs, and the remaining disagreement with RagTag is probably due more to RagTag filtering/scaffolding criteria than to minimap2 failing broadly. (lh3.github.io)



Yes — the first thing to relax is the minimap2 preset inside RagTag.

RagTag scaffold uses whole-genome alignments and defaults to --mm2-params '-x asm5'. That default is relatively strict for very closely related assemblies; minimap2 documents asm5 for average divergence not much higher than about 0.1%, while asm10 is for around 1% divergence and asm20 for several percent. RagTag then further filters alignments by unique alignment length (-f, default 1000), MAPQ (-q, default 10), and grouping confidence (-i, default 0.2). (GitHub)

A good first relaxed run is:

ragtag.py scaffold ref.fa query.fa \
  -o ragtag_asm10 \
  -w \
  -f 500 \
  -q 1 \
  -i 0.1 \
  -d 200000 \
  --mm2-params "-x asm10 -t 8"

A more permissive run is:

ragtag.py scaffold ref.fa query.fa \
  -o ragtag_asm20 \
  -w \
  -f 200 \
  -q 0 \
  -i 0.05 \
  -d 200000 \
  --mm2-params "-x asm20 -t 8"

Why these help:

  • -x asm10 or -x asm20 makes the assembly-to-reference aligner more tolerant of divergence. (lh3.github.io)
  • Lowering -q accepts lower-MAPQ alignments that RagTag would otherwise discard. (GitHub)
  • Lowering -f allows alignments with less unique anchor length to be considered. (GitHub)
  • Increasing -d lets RagTag merge syntenic alignment blocks that are farther apart on the reference. RagTag’s paper says nearby alignments within -d are merged, with the default at 100 kb. (GitHub)
  • Lowering -i makes RagTag keep contigs with more ambiguous chromosome assignment; RagTag excludes sequences below the confidence thresholds set by -i, -a, and -s. (GitHub)

The main reason BLASTn and RagTag can disagree is that this is not just “does it hit somewhere?”. BLASTn can show local similarity, but RagTag needs filtered whole-genome alignments that are sufficiently unique and confident for grouping, ordering, and orienting contigs. If a contig maps to repeats, IS elements, rRNA operons, plasmid fragments, or multiple places equally well, BLASTn may still show hits while RagTag leaves it unplaced. That is consistent with RagTag’s unique-anchor filtering and confidence scoring. (GitHub)

Before scaffolding, I would test the minimap2 presets directly:

for p in asm5 asm10 asm20; do
  minimap2 -cx $p ref.fa query.fa > ${p}.paf
  printf "%s\t" "$p"
  cut -f1 ${p}.paf | sort -u | wc -l
done

That is useful because RagTag uses minimap2 by default, so this tells you whether the loss happens at the alignment stage or later during RagTag filtering. (GitHub)

One important caution: making RagTag too permissive can produce false joins. If your contigs are plasmid-derived, mobile-element-rich, or genuinely absent from the reference, relaxing parameters may force incorrect scaffolding rather than solve the problem. RagTag’s own paper notes that repetitive or ambiguous alignments can mislead scaffolding, and contigs with no acceptable filtered alignments are output as unplaced. ([PMC][3])

If you want, paste your exact ragtag.py scaffold command plus the number of contigs in asm5, asm10, and asm20, and I’ll tune a safer final command for your dataset.

[3]: https://pmc.ncbi.nlm.nih.gov/articles/PMC9753292/ ” Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing – PMC “

Leave a Reply

Your email address will not be published. Required fields are marked *