作者身份与利益冲突表(Authorship and COI Form)and 出版许可协议说明 (License Agreement)

TODOs: next monday fill them.

登录系统 https://cts.sciencemag.org

感谢您配合填写 作者身份与利益冲突声明表

注意事项:

  • 每个问题都必须回答
  • 您可以随时保存并稍后继续填写
  • 在提交之前,编辑部无法看到该表格

提交方式:

点击表格末尾的 “Save and Submit” 按钮。

该表格 必须完成后论文才能被接收

我们建议您尽快完成填写,以便在论文仍在处理期间解决任何问题。

请确保以下信息已经或将会提供给通讯作者,并体现在论文的所有版本中:

  • 作者贡献
  • 资金来源
  • 竞争利益声明
  • 数据与材料可获取性的限制(如有)

请填写作者贡献(CRediT 贡献分类)

请使用下拉菜单选择您在论文中的贡献角色。

如果系统已经识别到部分角色,您可以:

  • 修改
  • 删除
  • 保存适用的选项

1 作者身份(Authorship)

作者资格标准

Science 期刊作者必须满足以下标准,这些标准参考了 国际医学期刊编辑委员会(ICMJE) 的作者定义。

作者必须满足:

每位作者必须:

  • 对研究的 构思或设计 做出重要贡献 或
  • 参与 数据获取、分析或解释
  • 开发研究中使用的 新软件
  • 撰写论文或进行重要修改

并且:

  • 批准投稿版本(以及涉及其贡献的修改版本)
  • 对自己的贡献负责
  • 确保论文任何部分的准确性和完整性问题都得到调查、解决并记录

问题:

我符合上述作者资格标准

选项:

  • Yes(是)
  • No(否)

通讯作者责任

通讯作者需要:

  • 确保所有作者在投稿前已阅读并批准论文
  • 与编辑部沟通并接收审稿意见
  • 确保数据、材料和代码符合透明性和可重复性标准
  • 确保原始数据可以保存并可重新分析
  • 确保论文中的数据呈现真实准确
  • 减少数据共享的障碍
  • 确保所有作者遵守最佳科研实践
  • 负责校样确认
  • 确保所有作者完成 COI 和许可协议

选择:

  • 我不是通讯作者
  • 我是通讯作者(但论文不涉及数据或材料)
  • 我是通讯作者,并同意承担上述责任
  • 我是通讯作者,但无法承担上述责任(需说明原因)

2 作者贡献(Contributions)

请说明你在论文中的贡献。

概念设计(Conceptualization)

研究想法和总体研究目标的制定

Yes / No

方法学(Methodology)

方法或模型的设计

Yes / No

软件(Software)

编程、软件开发、算法实现

Yes / No

验证(Validation)

结果的重复性和验证

Yes / No

正式分析(Formal analysis)

统计、数学或计算分析

Yes / No

实验研究(Investigation)

进行实验或数据收集

Yes / No

资源提供(Resources)

提供试剂、材料、样本、设备等

Yes / No

数据整理(Data curation)

数据管理、整理和维护

Yes / No

论文初稿(Writing – original draft)

撰写初稿

Yes / No

论文修改(Writing – review & editing)

审阅、评论和修改论文

Yes / No

可视化(Visualization)

图表或数据展示

Yes / No

监督(Supervision)

研究监督与指导

Yes / No

项目管理(Project administration)

研究项目协调管理

Yes / No

经费获取(Funding acquisition)

获得研究资金

Yes / No


3 利益冲突声明(Conflict of Interest)

所有作者必须披露可能影响研究客观性的:

  • 机构关系
  • 资金来源
  • 财务利益

这些信息会随论文一起公开。


当前机构

选择:

  • 所有机构已列在论文标题页
  • 还有其他机构关系,已在致谢中说明
  • 还有其他机构关系,将提供给通讯作者

资金来源

选择:

  • 我没有为该研究提供资金
  • 所有资金来源已在论文致谢中列出
  • 还有未列出的资金来源,将提供给通讯作者

财务关系

例如:

  • 股票
  • 股权

选择:

  • 没有财务利益需要披露
  • 已在论文致谢中披露
  • 还有未披露的,将提供给通讯作者

管理 / 咨询关系

过去 5年内 是否:

  • 担任董事或顾问
  • 收取咨询费
  • 演讲费
  • 专家证词费

选择:

  • 没有
  • 已在致谢中披露
  • 还有未披露的

专利

是否是以下专利的发明人:

  • 已申请
  • 正在申请
  • 已授权

选择:

  • 没有
  • 已在致谢中披露
  • 还有未披露的

4 数据政策(Data Policies)

作者需要确认:

论文结论所需的所有数据:

  • 已包含在论文或补充材料中 或
  • 已存储在公开数据库中

选择:

  • 是,我确认
  • 不适用(评论文章)
  • 我无法确认
  • 我将由高级作者回答

高级作者必须:

检查其团队生成的所有原始数据。


5 材料共享(Materials Sharing)

作者需要确认:

论文发表后会合理共享:

  • 材料
  • 数据
  • 代码

选择:

  • 不适用
  • 我无法确认

材料转移协议(MTA)

高级作者必须说明:

材料是否受以下限制:

  • MTA 协议
  • 专利限制

选择:

  • 没有限制
  • 有限制(已告知编辑部)
  • 有限制但未提交协议

6 最终声明(Summary Declaration)

我确认:

  • 已阅读所有问题
  • 已如实填写
  • 所有需要披露的关系已在论文或致谢中说明

如有未列出的关系,已提供给通讯作者。


表单提交时间:

06-Mar-2026


电子签名:

在下面输入姓名即视为电子签名。

Signature: _____


如果你愿意,我可以帮你 快速判断你作为作者通常应该选哪些选项(比如 Conceptualization / Investigation / Writing 等),这样填表只需要 1分钟


下面是这份 License to Publish(出版许可协议) 的中文翻译。为了便于理解,我做了忠实但更易读的整理


出版许可协议说明

在稿件被正式接收发表之前,每位作者都必须向 AAAS 提交一份已签署的出版许可协议

该许可协议授予 AAAS 发表和传播文章所需的权利,同时也保留作者的一些权利。

在提交签署的许可协议之前,请注意以下内容。


我们的理念

美国科学促进会(AAAS)是 Science 系列期刊的非营利出版机构。AAAS 认识到:

  • 科研开放性的重要性
  • 科学研究日益全球化的特点

经过同行评审的研究成果得到广泛传播,以及科学共同体内部的信息共享,是推动科学服务社会发展的关键。

作为其使命的一部分,AAAS 致力于促进全球范围内:

  • 科学家之间
  • 工程师之间
  • 科学界与公众之间

的交流。

为实现这一目标,AAAS 旗下部分期刊向作者提供以 Creative Commons(知识共享)许可 发表论文的选择。

对于以 Creative Commons 许可 发表的论文,AAAS 将在网上免费开放最终发表版本,不设访问障碍或 embargo(延迟开放期)。

对于未经 Creative Commons 许可发表的同行评审研究论文,作者接收稿(Author Accepted Manuscript)可根据许可协议中规定的条件公开存储到公共数据库中。

AAAS 还会:

  • 对具有紧急公共卫生意义的文章立即免费开放
  • 参与多项计划,为世界最贫困国家提供免费的科研内容

AAAS 作者出版许可政策

AAAS 的出版许可允许受到 cOAlition S 资助机构支持的研究论文作者,在 AAAS 发表文章后,以 CC BY 许可(或在资助方允许时以 CC BY-ND)公开传播其接收稿版本

请根据你的情况选择适用的许可类型:

  • Standard(标准)
  • U.S. Government Employee(美国政府雇员)

    • 美国政府承包商和美国政府资助项目获得者应选择上面的 Standard
  • Crown Copyright
  • License

Science 期刊出版许可协议

I. 接收发表的前提条件

以下“许可授予协议”(以下简称“本许可”)必须在稿件被 Science 或其姊妹期刊(包括 Science Advances、Science Immunology、Science Robotics、Science Signaling、Science Translational Medicine)接收发表之前签署并交回给 AAAS。

签署本许可即表示你声明并保证:

  • 你有权签署本许可
  • 你拥有签署本许可所需的一切必要权利

例如:

如果你的机构对出版协议有限制,或者机构主张对教职员工作品拥有某些分发或开放权,而这些权利与本许可条款冲突,那么你必须先从机构获得豁免,解除这些限制后才能签署本许可。

在 AAAS 发表论文后,你所在机构仍可行使本许可 第三部分中保留给作者的权利。

如果该作品版权归你的雇主所有,则必须由:

  • 你的雇主 或
  • 经授权代表

签署本表格。

如果 AAAS 决定不发表你的稿件,则在 AAAS 以书面形式最终通知你不予发表后,本许可自动失效。AAAS 对是否发表任何稿件以及以何种形式发表拥有完全裁量权。


II. 发表权利

鉴于 AAAS 将发表当前题为:

Complex Human Hair-Bearing Skin Organoids reveal Cell Type Specific Susceptibility and Innate Immune Responses to Herpes Simplex Virus 1(aef5563)

的稿件(以下简称“本作品”),其作者为 Jiabin Huang 等人(以下简称“你”)

你在此授予 AAAS 不可撤销的权利,可出于任何目的(包括商业目的)在全球范围内、以任何语言、任何现有或未来开发的形式和媒介,对该作品进行:

  • 发表
  • 复制
  • 传播
  • 传输
  • 展示
  • 存储
  • 翻译
  • 创作衍生作品
  • 以及其他使用

并允许 / 再许可他人进行上述全部或部分行为。

对于与稿件一起提交的补充材料、数据、音频和/或视频文件,你授予 AAAS 非独占性权利,可对这些补充材料进行上述类似使用。


定义

最终发表版本(Final Published Version / Version of Record / VoR) 指由 AAAS 发表的最终校对、编辑、定稿后的版本。

接收稿版本(Accepted Version / Author Accepted Manuscript / AAM) 指已被 AAAS 接收发表、包含同行评审修改但尚未经过 AAAS 编辑和制作的版本。


权利归属

你同意:

AAAS 对最终发表版本所获得的许可是独占性的

但你仍然保留以下权利:

可根据你的雇主或研究资助方的“开放获取”政策,向公众提供接收稿版本及更早版本

你仍保留版权,但须受你授予 AAAS 的上述权利限制。除本许可明确授予的权利外,其余权利均归作者所有。

本许可不转让以下知识产权:

  • 专利权
  • 商标权
  • 其他未明确说明的知识产权

你还授权 AAAS(但 AAAS 无义务)可代表你自费对涉嫌侵犯作品版权的第三方采取维权行动。


III. 作者保留的权利

AAAS 的部分期刊(并非全部)允许作者在接收后选择以 Creative Commons 许可发布最终发表版本。若提供该选项,可能需要支付 文章处理费(APC)

A. 除 III.B 和 III.C 外,AAAS 发表后,作者及合作者保留以下非独占权利,无需再向 AAAS 申请许可:

  1. 将最终发表版本收录在作者本人作品的纸质合集里。
  2. 将最终发表版本用于作者撰写的论文或学位论文中。
  3. 口头报告最终发表版本的内容。
  4. 为作者授课使用而复制最终发表版本。若作者在学术机构任职,该机构也可以为教学复制。
  5. 仅出于非商业目的,向同事分发最终发表版本的复印件或 PDF(且需告知接收者不得再次传播或复制)。
  6. 在作者未来作品中重复使用自己制作的图和表。
  7. 作者可为任何目的、以任何格式使用或授权使用与本作品相关的补充材料。
  8. 如果文章不是同行评审研究论文,则文章发表后可立即将接收稿版本发布到个人网站或雇主网站,但必须附上最终发表版本链接,并注明: “这是作者版本,仅供个人使用,不得再传播。正式版本发表于 Science Advances,日期 [date];doi: 10.1126/sciadv.aef5563。” 且不得将接收稿修改得像出版社排版后的正式版本。
  9. 如果文章是同行评审研究论文,则文章由 AAAS 发表后,你可以将接收稿版本发布到个人网站或任意公共数据库中(但不得修改得像正式出版版本)。同时必须附上正式版本链接及适用的许可说明:

    • a. 如果稿件在 2021年1月1日或之后 提交至 Science 订阅期刊,且研究受到 cOAlition S 资助,可对接收稿使用 CC BY 许可(若资助方允许,也可使用 CC BY-ND)。
    • b. 如果 a 不适用,但稿件在 2023年1月1日或之后 提交,并且受雇主或资助方开放获取政策约束,要求接收稿以 CC BY 发布,则可使用 CC BY
    • c. 如果 a 和 b 都不适用,则你必须在传播的接收稿上注明: “这是作者版本,仅供个人使用,不得再传播。正式版本发表于 Science Advances,日期 [date];doi: 10.1126/sciadv.aef5563。”

B. 如果 AAAS 在稿件接收后提供了 CC BY(知识共享署名 4.0 国际许可) 选项,且你选择了该选项并支付了适用 APC:

那么在论文发表后,你和合作者可以按照 CC BY 许可允许的方式使用文章,但必须遵守该许可的全部条款。


C. 如果 AAAS 在稿件接收后提供了 CC BY-NC(知识共享署名-非商业性使用 4.0 国际许可) 选项,且你选择了该选项并支付了适用 APC:

那么在论文发表后,你和合作者可以按照 CC BY-NC 许可允许的方式使用文章,并在适用时享有上面 III.A 中列出的权利。


IV. 用户访问与使用权(仅适用于以 Creative Commons 许可发表的作品)

如果 AAAS 以 Creative Commons 许可 发表最终发表版本,则:

  • AAAS 将无障碍、无 embargo 地免费在线开放该最终版本
  • 发表后,AAAS 还会将最终发表版本提交至 PMC/UKPMC 以便立即公开

AAAS 将按照接收后作者所选的 Creative Commons 许可发表作品。其他人对最终发表版本的使用应遵守所选许可条款。

A. CC BY

允许用户在任何媒介和格式中:

  • 复制
  • 再分发
  • 改编
  • 混编
  • 商业使用

但前提是必须:

  • 署名作者和原作品
  • 说明是否作了修改
  • 不得暗示作者认可该用途
  • 并附上 CC BY 4.0 许可链接

B. CC BY-NC

允许用户在任何媒介和格式中:

  • 复制
  • 再分发
  • 改编
  • 混编

仅限非商业用途,不得用于商业利益或金钱报酬。

同样要求:

  • 署名
  • 说明修改
  • 不得暗示作者认可
  • 并附上 CC BY-NC 4.0 许可链接

V. 作者的额外声明与保证

你声明并保证:

  • 本作品是由你原创撰写的
  • 其中所有事实真实准确
  • 作品不包含诽谤性内容
  • 作品未在其他地方发表过
  • 作品不侵犯任何第三方的版权、财产权或人身权
  • 未使用人工智能、机器学习或类似算法工具生成作品中的图像、图表、图形或多媒体内容,除非在投稿时已向 Science 期刊编辑充分披露,并得到 AAAS 批准
  • 如果人工智能、机器学习或类似工具用于研究本身,或用于作品撰写或呈现,则必须按照 Science 当前出版政策,在投稿时于投稿信和稿件中说明

你同意,如因上述保证产生任何索赔,你将对 AAAS 进行赔偿、抗辩并使其免责。


VI. 第三方内容

重要: 如果作品中包含任何由第三方拥有或控制的材料,你必须:

  • 获得使用许可
  • 向编辑提交该许可的书面证明
  • 并在作品正文中清楚标明该材料的来源和版权归属

签署本许可即表示你保证你有充分权力签署本协议。

本许可在作品版权保护期内持续有效,除非双方明确书面同意,否则不得撤销。

本许可受美国哥伦比亚特区法律管辖并据其解释,任何争议也依该法律解决,而不适用其冲突法原则。


如适用,请填写以下内容

本作品是否是在美国政府合同或资助下完成:

US Government Contract or Grant

AAAS 承认: 若某项资助或合同规定美国政府拥有将该作品用于非商业政府用途的非独占权利,则美国政府可行使该权利。


如果你愿意,我可以继续帮你做一版 “这份 License 你真正需要关心的 5 个重点”,中文、两分钟看完。

德国父母探亲签证完整材料包(最终版)

直接按这个顺序装订/分组,给父母递签用。材料顺序依据你上传的德国使馆官方清单,并结合目前的情况:有邀请信 + 有 Verpflichtungserklärung + 父母为退休人员


德国父母探亲签证完整材料包(最终版)

一、父母每人各自准备的一套材料

1. 护照

每位申请人准备:

  • 护照原件
  • 护照身份页复印件 1份

要求是护照有效期需超过计划离开申根区日期至少3个月,且至少有2页空白签证页。


2. 申根签证申请表

每位申请人准备:

  • VIDEX 在线填写后打印
  • 申请表亲笔签名
  • 附加声明页亲笔签名
  • 关于《居留法》第54条第2款第8项的声明书签字

签名要和护照签名一致。


3. 证件照

每位申请人准备:

  • 1张白底生物识别照片
  • 6个月内拍摄

4. 医疗保险

每位申请人准备:

  • 保险单打印件 1份

要求:

  • 覆盖整个申根区
  • 覆盖整个停留期间
  • 保额不低于 3 万欧元
  • 明确保门诊、住院、送返等内容。

你前面说保险由你购买,这样是可以的。


5. 户口本

每位申请人可共用/分别提交:

  • 户口本所有非空白页复印件

中国籍申请人无需翻译。


6. 退休人员材料

每位申请人建议准备:

  • 退休证复印件
  • 养老金证明或退休金流水(有的话更好)
  • 回国约束力说明 1份 (这个不需要)

官方清单里说明“退休人员材料不同”,退休人员不走在职证明那一套,所以不需要在职证明和营业执照。


二、由你提供、父母带去递签的材料

7. 邀请信

准备:

  • 邀请信原件
  • 复印件 1份

内容需包括邀请人地址和联系方式、旅行目的、时间、停留地点、费用承担、关系说明,并有亲笔签名和日期。

你这份已经做好了。


8. Verpflichtungserklärung

准备:

  • 原件
  • 复印件 1份

这是官方认可的“旅行和停留费用承担证明”之一。你们有这个之后,通常就不再需要父母自己提供第7项银行流水作为旅费证明。


9. 邀请人身份证明材料

按德国使馆要求,非德籍邀请人需要提交:

  • 护照身份页复印件
  • 德国有效居留许可复印件。

你这部分已经整理成 A4 版了,正好可以直接放进去。


三、关系证明材料

10. 亲属关系证明

准备:

  • 亲属关系公证书原件
  • 复印件 1份
  • 德文或英文翻译
  • Apostille(附加证明书/海牙认证)

使馆对“经常探访在德国生活的亲属”要求的是: 经公证处翻译公证并办理了附加证明书(Apostille)的亲属关系证明(原件+复印件)


四、可选但强烈建议准备

11. 旧申根签证记录

如果父母以前去过申根区,准备:

  • 旧护照
  • 或旧签证页复印件

这项是“如适用”,不是强制。


12. 回国约束力辅助材料

虽然官方清单没单列“回国约束力”这一项,但实际审核很看重。建议每位家长再补:

  • 退休证复印件
  • 养老金证明/银行流水(可选)
  • 户口本复印件
  • 房产证复印件(有就交)
  • 回国约束力说明 1份

这类材料对退休人员很有帮助。


你父母这次通常不需要准备的材料

因为你已经提供了 Verpflichtungserklärung,所以通常不需要再交:

  • 申请人银行流水
  • 存款证明
  • 工资流水
  • 在职证明
  • 公司营业执照

官方第11项已经写明: 旅费可以由申请人自己承担并参考第7项,或者由邀请人根据《居留法》第66至68条承诺承担。你们走的是后者。


最推荐的装订顺序

给每位家长各做一套主材料,顺序如下:

A. 申请人本人材料

  1. 护照原件
  2. 护照首页复印件
  3. 申请表
  4. 签字声明页
  5. 照片
  6. 医疗保险
  7. 户口本复印件
  8. 退休证复印件
  9. 旧申根签证材料(如有)
  10. 回国约束力说明 (这个不需要)

B. 邀请与担保材料

  1. 邀请信原件/复印件
  2. Verpflichtungserklärung 原件/复印件
  3. 你的护照复印件
  4. 你的德国居留许可复印件

C. 关系证明

  1. 亲属关系公证原件
  2. 翻译件
  3. Apostille
  4. 上述材料复印件

最后给你一个递签实操建议

最稳妥的做法是:

  • 父亲一套
  • 母亲一套
  • 共用材料各自都放复印件
  • 原件单独放一个透明文件袋

这样 VFS 收材料时最省事,也最不容易漏。

Consider 做成一份 可打印的中文勾选版 Checklist

探亲访友申根签证申请须知 (2024 年 6 月版)

这个文件包含: 递签材料做成完整的A4材料包结构(非常有用):

例如:

01 Passport
02 Application form
03 Photo
05 Insurance
06 Hukou
06 Relationship proof
07 退休证+须提交退休证(原件 + 复印件)以及固定的退休金收入证明
10 Invitation letter
11 Verpflichtungserklärung
12 Relationship proof
13 Passport copy of inviter
14 old visa copy
  1. 旅行护照(原件+1 份身份信息页的复印件) 亲笔签名的旅行护照(护照有效期超过申请停留期至少 3 个月,至少有两页空白签证页,无破损)。

  2. 申请表(原件) 签名须和护照签名一致。

  3. 1 张证件照片 1 张近期(6 个月内)拍摄的白色背景生物识别证件照。要求如下:尺寸为4.5厘米 x 3.5 厘米,要求高分辨率和白色背景。包含下巴和脖子的正面照,脸部和眼睛不得有遮挡,脸部在照片上占比70-80%。

  4. 旅行医疗保险(原件)

  5. 户口本(复印件) 所有非空白户口页复印件,无需翻译

  6. 退休证+须提交退休证(原件 + 复印件)以及固定的退休金收入证明。

  7. 邀请信(原件)

  8. VE

  9. 申请人与邀请人的关系证明  经常探访在德国生活的亲属: 经公证处翻译公证并办理了附加证明书(Apostille)的亲属关系 证明。2023 年 11 月 30 日前通过中国外交部/地方外办进行了领事认证的中国公文书仍可用于申办本签证,无需额外办理附加证明书。(原件+复印件)

  10. 邀请人的身份证明材料(复印件) 邀请人德国身份证或护照身份信息页的复印件。非德籍邀请人还须另外提交德国有效居留许可的复印件。

  11. 如适用:曾前往申根区旅行的证明材料 例如提交旧护照或提交以往签证和出入境记录复印件。

How to search for motifs in Acinetobacter?

To search for those short peptide motifs (like “DIKDY” or “DLSDY”) across all available Acinetobacter genomes, you can use the online BLAST tool on the NCBI website. Since these are short sequences (5-7 amino acids), you’ll need to adjust the search settings to find meaningful matches.

Here is a step-by-step guide on how to perform this search effectively:

1. Access NCBI BLAST and Choose the Correct Program

First, go to the NCBI BLAST home page and select the appropriate search tool. For protein motifs, you should use blastp.

2. Enter Your Query Sequence

In the “Enter Query Sequence” box, you can directly type or paste your short peptide sequence, such as DIKDY .

  • Important: For each search, use only one of the short peptide motifs you mentioned (e.g., first search for “DIKDY”, then later for “DNYQFDSK”, etc.).

3. Select the Database and Restrict by Organism

This is the most important step to ensure you search only Acinetobacter genomes.

  • In the “Choose Search Set” section, select the Non-redundant protein sequences (nr) database. This is the comprehensive database of all protein sequences at NCBI.
  • In the “Organism” box, type “Acinetobacter” (or “Acinetobacter baumannii”).
  • Select the desired option from the dropdown menu that appears (e.g., Acinetobacter (taxid:469) ) . This will restrict your search to only sequences from this genus.

4. Optimize the Algorithm for Short Sequences

Standard BLAST settings are optimized for long sequences and may miss short motifs. You must change the algorithm to one designed for short inputs.

  • In the “Program Selection” section, click on “Choose Search Set” or “Algorithm” to see more options.
  • Select the blastp-short task. This program is specifically optimized for query sequences shorter than 30 residues .

5. Adjust Advanced Parameters (Optional but Recommended)

You may also want to adjust the Expect threshold (E-value) to get more relevant results. A higher E-value is more lenient and can find more distant matches, which is useful for short, conserved motifs .

  • Click on “Algorithm parameters” to expand the advanced settings.
  • Increase the Expect threshold to 20000 or 200000. The default value of 10 is too stringent for a 5-amino acid query.
  • Consider turning off the low-complexity region filter, as short motifs might be filtered out by default.

6. Run the Search and Interpret the Results

Click the “BLAST” button at the bottom of the page.

  • The results page will show you all protein sequences in Acinetobacter that contain your exact peptide motif.
  • Pay attention to the “Query Cover” column. For a perfect 5-amino acid match, the query cover will be 100%. If you allowed for mismatches, it might be lower.
  • Look at the “Scientific Name” column to see which Acinetobacter species and strains contain the motif .

Summary of Settings for Your Search

Parameter Recommended Setting Reason
Program blastp To search a protein query against a protein database .
Database Non-redundant protein sequences (nr) The most comprehensive public database.
Organism Acinetobacter (taxid:469) To limit results to your genus of interest .
Task blastp-short Optimized for short, peptide-like queries .
Expect threshold 20,000 Increases sensitivity to find short, exact matches.

By following these steps for each of the eight peptide motifs (four from AdeJ and four from AdeB), you will be able to see how conserved they are across different Acinetobacter strains and identify which specific genomes contain these regions of interest.

I hope this step-by-step guide helps you perform your analysis successfully. If the results are too broad or too narrow, feel free to adjust the E-value or experiment with allowing a small number of mismatches in the algorithm parameters.

Data_Marius_16S_2026

Hamburg:

  • Group 1 (pre-Stroke RD): #1-3, #6-8, #10-12
  • Group 2 (pre-Stroke HFD): #4-5, #9, #13-17
  • Group 3 (post-stroke RD): #18-20, #23-25, #27-29
  • Group 4 (post-stroke HFD): #21-22, #26, #30-34
  • Group 5 (Baseline RD): #40-43, #49-53
  • Group 6 (Baseline HFD): #35-39, #44-48

Odense:

  • Group 7 (Baseline Female): #54-69
  • Group 8 (Baseline Male): #96-97, #99-102, #104-109
  • Group 9 (pre-Stroke RD female): #76-78, #80-81
  • Group 10 (pre-Stroke RD male): #114-121
  • Group 11 (pre-Stroke HFD female): #70-75, #79
  • Group 12 (pre-Stroke HFD male): #110-113
  • Group 13 (post-stroke RD female): #88-95
  • Group 14 (post-stroke RD male): #129-136
  • Group 15 (post-stroke HFD female): #82-87
  • Group 16 (post-stroke HFD male): #122-128
  1. Could you please create two independent html files with the respective Analysis of the Samples from Odense or Hamburg?

  2. We are ineterested in the following comparisons:

a. Hamburg:

  • i. Group 1 vs 2 (PCA and DEGs)
  • ii. Group 3 vs 4 (PCA and DEGs)
  • iii. Group 3 vs 4 (PCA and DEGs)

b. Odense:

  • i. Group 7 vs 8 (PCA and DEGs)
  • ii. Group 9 vs 10 (PCA and DEGs)
  • iii. Group 11 vs 12 (PCA and DEGs)
  • iv. Group 13 vs 14 (PCA and DEGs)
  • v. Group 15 vs 16 (PCA and DEGs)

author: “” date: ‘r format(Sys.time(), "%d %m %Y")‘ header-includes:

  • \usepackage{color, fancyvrb} output: rmdformats::readthedown: highlight: kate number_sections : yes pdf_document: toc: yes toc_depth: 2 number_sections : yes


#install.packages(c("picante", "rmdformats"))
#mamba install -c conda-forge freetype libpng harfbuzz fribidi
#mamba install -c conda-forge r-systemfonts r-svglite r-kableExtra freetype fontconfig harfbuzz fribidi libpng
library(knitr)
library(rmdformats)
library(readxl)
library(dplyr)
library(kableExtra)
library(openxlsx)
library(DESeq2)
library(writexl)

options(max.print="75")
knitr::opts_chunk$set(fig.width=8,
                      fig.height=6,
                      eval=TRUE,
                      cache=TRUE,
                      echo=TRUE,
                      prompt=FALSE,
                      tidy=FALSE,
                      comment=NA,
                      message=FALSE,
                      warning=FALSE)
opts_knit$set(width=85)
# Phyloseq R library
#* Phyloseq web site : https://joey711.github.io/phyloseq/index.html
#* See in particular tutorials for
#    - importing data: https://joey711.github.io/phyloseq/import-data.html
#    - heat maps: https://joey711.github.io/phyloseq/plot_heatmap-examples.html
#STEP_1: RUN rmarkdown::render('Phyloseq_Hamburg.Rmd',output_file='Phyloseq_Hamburg.html')
#STEP_2: RUN MicrobiotaProcess_Group_1_2_3_4_5_6.R
#STEP_3: adjust Global PERMANOVA text, Pairwise PERMANOVA table, and PCoA figure.
#STEP_4: RUN_AGAIN rmarkdown::render('Phyloseq_Hamburg.Rmd',output_file='Phyloseq_Hamburg.html')
#options(max.print = 1e6)

Data

Import raw data and assign sample key:

#extend qiime2_metadata_for_qza_to_phyloseq.tsv with Diet and Flora
#setwd("~/DATA/Data_Laura_16S_2/core_diversity_e4753")
#map_corrected <- read.csv("qiime2_metadata_for_qza_to_phyloseq.tsv", sep="\t", row.names=1)
#knitr::kable(map_corrected) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Prerequisites to be installed

install.packages("dplyr")     # To manipulate dataframes
install.packages("readxl")    # To read Excel files into R
install.packages("ggplot2")   # for high quality graphics
install.packages("heatmaply")
source("https://bioconductor.org/biocLite.R")
biocLite("phyloseq")
#mamba install -c conda-forge r-ggplot2 r-vegan r-data.table
#BiocManager::install("microbiome")
#install.packages("ggpubr")
#install.packages("heatmaply")
library("readxl") # necessary to import the data from Excel file
library("ggplot2") # graphics
library("picante")
library("microbiome") # data analysis and visualisation
library("phyloseq") # also the basis of data object. Data analysis and visualisation
library("ggpubr") # publication quality figures, based on ggplot2
library("dplyr") # data handling, filter and reformat data frames
library("RColorBrewer") # nice color options
library("heatmaply")
library(vegan)
library(gplots)
#install.packages("openxlsx")
library(openxlsx)
library(rstatix)
library(knitr)
library(kableExtra)

Read the data and create phyloseq objects

Three tables are needed

  • OTU
  • Taxonomy
  • Samples

Create analysis-specific phyloseq objects

We maintain one filtered “base” phyloseq object and then derive analysis-specific objects from it. This avoids accidental overwriting and ensures each analysis uses the appropriate data scale (counts vs. relative abundance vs. rarefied counts).

  • ps_raw → Raw imported phyloseq object (integer counts; import stage only)
  • ps_baseps_raw with taxonomy + sample metadata properly aligned (the clean master object before any filtering)
  • ps_pruned → Optional sample subset of ps_base (e.g., drop unwanted samples by ID/pattern); still integer counts
  • ps_filt → The shared filtered backbone: low-depth samples removed + taxa with zero total counts dropped; remains integer counts

    library(tidyr)

    # For QIIME1
    #ps.ng.tax <- import_biom("./exported_table/feature-table.biom", "./exported-tree/tree.nwk")

    # For QIIME2
    #install.packages("remotes")
    #remotes::install_github("jbisanz/qiime2R")
    #"core_metrics_results/rarefied_table.qza", rarefying performed in the code, therefore import the raw table.
    library(qiime2R)
    ps_raw <- qza_to_phyloseq(
      features =  "dada2_tests/test_49_f240_r240/table.qza",
      tree = "rooted-tree.qza",
      metadata = "qiime2_metadata_for_qza_to_phyloseq.tsv"
    )

    # Refresh/ensure sample_data (optional but keeps things explicit)
    sample_df <- read.csv("./qiime2_metadata_for_qza_to_phyloseq.tsv", sep="\t", row.names=1, check.names=FALSE)
    SAM <- sample_data(sample_df, errorIfNULL = TRUE)

    # Add taxonomy table (exported from QIIME2)
    taxonomy <- read.delim("./exported-taxonomy/taxonomy.tsv", sep="\t", header=TRUE)

    # Separate taxonomy string into separate ranks
    taxonomy_df <- taxonomy %>% separate(Taxon, into = c("Domain","Phylum","Class","Order","Family","Genus","Species"), sep = ";", fill = "right", extra = "drop")
    # Use Feature.ID as rownames
    rownames(taxonomy_df) <- taxonomy_df$Feature.ID
    taxonomy_df <- taxonomy_df[, -c(1, ncol(taxonomy_df))]  # Drop Feature.ID and Confidence
    # Create tax_table
    tax_table_final <- phyloseq::tax_table(as.matrix(taxonomy_df))

    # Base object: raw integer counts + metadata + taxonomy
    ps_base <- merge_phyloseq(ps_raw, SAM, tax_table_final)
    print(ps_base)

    #colnames(phyloseq::tax_table(ps_base)) <- c("Domain","Phylum","Class","Order","Family","Genus","Species")
    saveRDS(ps_base, "./ps_base.rds")

Visualize data

# Inspect the base object (raw integer counts)
sample_names(ps_base)
rank_names(ps_base)
sample_variables(ps_base)

# Optional: prune to a naming pattern (avoids hard-coding long sample lists)
#samples_keep <- sample_names(ps_base)
#samples_keep <- samples_keep[grepl("^sample-[A-H]", samples_keep)]
#Deleted samples: sample-NTC-9  sample-PC-1 sample-PC-2 sample-PC-3 sample-PC-4  sample-PC-5
#"sample-PC-1","sample-PC-2","sample-PC-3","sample-PC-4","sample-PC-5"
samples_keep <- c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

#    Group 1 (pre-Stroke RD): #1-3, #6-8, #10-12
#    Group 2 (pre-Stroke HFD): #4-5, #9, #13-17
#    Group 3 (post-stroke RD): #18-20, #23-25, #27-29
#    Group 4 (post-stroke HFD): #21-22, #26, #30-34
#    Group 5 (Baseline RD): #40-43, #49-53
#    Group 6 (Baseline HFD): #35-39, #44-48

ps_pruned <- prune_samples(samples_keep, ps_base)

# Drop taxa absent after pruning
ps_pruned <- prune_taxa(taxa_sums(ps_pruned) > 0, ps_pruned)

# Quick sanity checks
nsamples(ps_base); ntaxa(ps_base)
nsamples(ps_pruned); ntaxa(ps_pruned)
#  #Preprocessing statistics for each sample
#   denoising_stats <- read.csv("dada2_tests/test_59_f235_r245/data_stats.tsv", sep="\t")
#   # Display the table
#   kable(denoising_stats, caption = "Preprocessing statistics for each sample") %>%
#     kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

In the QC filtering step, we removed 14 samples that either fell below the minimum sequencing depth (library size < 12,201 reads) or were pre-specified for exclusion via sample_ids (as defined in the previous step). The filtered object (ps_filt) therefore contains only samples meeting the depth cutoff, and taxa were re-pruned to retain only those with nonzero total abundance across the retained samples.

# ------------------------------------------------------------
#   Filter low-depth samples (recommended for all analyses)
# ------------------------------------------------------------
#IMPORTANT_ADAPTION: min_depth needs to be adapted to the rarefaction threshold.
min_depth <- 5610  # <-- adjust to your data / study design, keeps all!
ps_filt <- prune_samples(sample_sums(ps_pruned) >= min_depth, ps_pruned)
ps_filt <- prune_taxa(taxa_sums(ps_filt) > 0, ps_filt)
ps_filt

# Keep a depth summary for reporting / QC
depth_summary <- summary(sample_sums(ps_filt))
depth_summary

Differential abundance (DESeq2)ps_deseq: non-rarefied integer counts derived from ps_filt, with optional count-based taxon prefilter (default: taxa total counts ≥ 10 across all samples)

From ps_filt (e.g. 5669 taxa and 239 samples), we branch into analysis-ready objects in two directions:

  • Direction 1 for diversity analyses

    • Alpha diversity: ps_rarefied ✅ (common)
    • Beta diversity:
      • Unweighted UniFrac / Jaccard: ps_rarefied ✅ (often recommended)
      • Bray–Curtis / ordination on abundances: ps_rel or Hellinger ✅ (rarefaction optional)
      • Aitchison (CLR): CLR-transformed (non-rarefied) ✅ (no rarefaction)

Rarefaction


# RAREFACTION
set.seed(9242)  # This will help in reproducing the filtering and nomalisation.
ps_rarefied <- rarefy_even_depth(ps_filt, sample.size = min_depth)

# # NORMALIZE number of reads in each sample using median sequencing depth.
# total = median(sample_sums(ps.ng.tax))
# #> total
# #[1] 42369
# standf = function(x, t=total) round(t * (x / sum(x)))
# ps.ng.tax = transform_sample_counts(ps.ng.tax, standf)
# ps_rel <- microbiome::transform(ps.ng.tax, "compositional")
#
# saveRDS(ps.ng.tax, "./ps.ng.tax.rds")
  • Direction 2 for taxonomic composition plots

    • Taxonomic compositionps_rel: relative abundance (compositional) computed after sample filtering (e.g. 5669 taxa and 239 samples)
    • Optional cleaner composition plotsps_abund / ps_abund_rel: taxa filtered for plotting (e.g., keep taxa with mean relative abundance > 0.1%); (e.g. 95 taxa and 239 samples) ps_abund = counts, ps_abund_rel = relative abundance (use for visualization, not DESeq2)

For the heatmaps, we focus on the most abundant OTUs by first converting counts to relative abundances within each sample. We then filter to retain only OTUs whose mean relative abundance across all samples exceeds 0.1% (0.001). We are left with 199 OTUs which makes the reading much more easy.

# 1) Convert to relative abundances
ps_rel <- transform_sample_counts(ps_filt, function(x) x / sum(x))

# 2) Get the logical vector of which OTUs to keep (based on relative abundance)
keep_vector <- phyloseq::filter_taxa(
  ps_rel,
  function(x) mean(x) > 0.001,
  prune = FALSE
)

# 3) Use the TRUE/FALSE vector to subset absolute abundance data
ps_abund <- prune_taxa(names(keep_vector)[keep_vector], ps_filt)

# 4) Normalize the final subset to relative abundances per sample
ps_abund_rel <- transform_sample_counts(
  ps_abund,
  function(x) x / sum(x)
)

cat("ps_pruned:", nsamples(ps_pruned), "\n")
cat("ps_filt:", nsamples(ps_filt), "\n")
cat("ps_abund:", nsamples(ps_abund), "\n")
cat("ps_abund_rel:", nsamples(ps_abund_rel), "\n")

Heatmaps

datamat_ = as.data.frame(otu_table(ps_abund))
#"sample-PC-1","sample-PC-2","sample-PC-3","sample-PC-4","sample-PC-5"
datamat <- datamat_[c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")]
# Remove rows with zero variance
datamat <- datamat[apply(datamat, 1, var) > 0, ]
# Remove cols with zero variance
#datamat <- datamat[, apply(datamat, 2, var) > 0]

# (optional) replace with your actual column names
#colnames(datamat) <- c(
#    "A24040201",...
#  )

# ---------- 0) Sample names from datamat ----------
samplenames <- sub("", "", colnames(datamat))  #^sample-

# ---------- 1) Read metadata ----------
meta_path <- "qiime2_metadata.tsv"
meta <- read.delim(
  meta_path,
  sep = "\t",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = FALSE,
  comment.char = ""
)

# ---------- 2) Identify SampleID + Group columns ----------
sample_id_col <- c("#SampleID","SampleID","sample-id","sampleid")
sample_id_col <- sample_id_col[sample_id_col %in% colnames(meta)][1]
if (is.na(sample_id_col)) sample_id_col <- colnames(meta)[1]

group_col <- c("Group","group","GROUP")
group_col <- group_col[group_col %in% colnames(meta)][1]
if (is.na(group_col)) stop("No 'Group' column found in metadata.")

# ---------- 3) Build lookup: sample -> group ----------
meta_ids <- sub("", "", meta[[sample_id_col]])  #^sample-
meta_groups <- trimws(as.character(meta[[group_col]]))
group_map <- setNames(meta_groups, meta_ids)

# Map datamat columns to group labels
groups <- unname(group_map[samplenames])
groups[is.na(groups)] <- "unknown"

# ---------- 4) Color mapping for YOUR labels ----------
#  "positive control" = "#ff7f00",
## (Adjust colors if you prefer different ones)
#color_map <- c(
#  "1" = "#a6cee3",
#  "2" = "#1f78b4",
#  "3" = "#b2df8a",
#  "4" = "#33a02c",
#  "5" = "#fb9a99",
#  "negative control" = "#6a3d9a",
#  "unknown" = "GREY"
#)
# Color map 1..16 + PC
color_map <- c(
  "1"="#a6cee3","2"="#1f78b4","3"="#b2df8a","4"="#33a02c","5"="#fb9a99","6"="#e31a1c"
)
#"7"="#fdbf6f","8"="#ff7f00","9"="#cab2d6","10"="#6a3d9a",
#  "11"="#ffff99","12"="#b15928","13"="#8dd3c7","14"="#bebada",
#  "15"="#80b1d3","16"="#fccde5"
#,"PC"="#000000"

# Assign colors safely
sampleCols <- unname(color_map[groups])
sampleCols[is.na(sampleCols)] <- "GREY"
names(sampleCols) <- samplenames

# ---------- 5) Checks ----------
cat("Unique groups found in datamat:\n")
print(sort(unique(groups)))
cat("\nCounts per group:\n")
print(table(groups, useNA = "ifany"))
cat("\nFirst 10 sample colors:\n")
print(head(sampleCols, 10))

# Optional: list any samples that didn't match metadata
unmatched <- samplenames[groups == "unknown"]
if (length(unmatched) > 0) {
  cat("\nWARNING: Unmatched samples (showing up to 20):\n")
  print(head(unmatched, 20))
}
## --- 1) order columns by group (and then by sample name) ---
#"positive control",
group_order <- c("1","2","3","4","5", "6")
#,"7","8","9","10", "11","12","13","14","15", "16", "PC"
groups_fac  <- factor(groups, levels = group_order, ordered = TRUE)

col_order <- order(groups_fac, samplenames)

datamat_ord     <- datamat[, col_order, drop = FALSE]
groups_ord      <- groups[col_order]
samplenames_ord <- samplenames[col_order]
sampleCols_ord  <- sampleCols[col_order]

stopifnot(identical(colnames(datamat_ord), samplenames_ord))

## group separators
grp_counts <- table(factor(groups_ord, levels = group_order))
grp_breaks <- cumsum(as.integer(grp_counts[grp_counts > 0]))

## --- 2) cluster ROWS using the *ordered* matrix (columns don't matter, but be consistent) ---
hr   <- hclust(as.dist(1 - cor(t(datamat_ord), method = "pearson")), method = "complete")
mycl <- cutree(hr, h = max(hr$height) / 1.08)

mycol_palette <- c("YELLOW","DARKBLUE","DARKORANGE","DARKMAGENTA","DARKCYAN","DARKRED",
                   "MAROON","DARKGREEN","LIGHTBLUE","PINK","MAGENTA","LIGHTCYAN",
                   "LIGHTGREEN","BLUE","ORANGE","CYAN","RED","GREEN")
mycol <- mycol_palette[as.vector(mycl)]

## --- 3) plot using datamat_ord and sampleCols_ord; keep column order fixed ---
library(RColorBrewer)
heatmap_colors <- colorRampPalette(brewer.pal(9, "Blues"))(100)

png("figures/heatmap.png", width = 1800, height = 2400)
heatmap.2(
  as.matrix(datamat_ord),
  Rowv = as.dendrogram(hr),
  Colv = NA,                        # IMPORTANT: do NOT cluster columns
  dendrogram = "row",
  scale = "row",
  trace = "none",
  col = heatmap_colors,
  cexRow = 1.2,
  cexCol = 2.2,                     # <<< bigger x (column) labels
  RowSideColors = mycol,
  ColSideColors = sampleCols_ord,   # IMPORTANT: use ordered colors
  srtCol = 85,
  labRow = row.names(datamat_ord),
  labCol = samplenames_ord,         # optional but explicit
  key = TRUE,
  margins = c(10, 18),              # <<< more space for rotated x labels
  lhei = c(0.7, 15),
  lwid = c(1, 8),
  colsep = grp_breaks,              # optional separators
  sepcolor = "black",
  sepwidth = c(0.02, 0.02)
)
dev.off()

knitr::include_graphics("./figures/heatmap.png")

\pagebreak

# Clean taxonomy prefixes (e.g., "d__", "p__") safely without hard-coded row indices
tt_mat <- as(phyloseq::tax_table(ps_abund_rel), "matrix")
if (is.null(rownames(tt_mat))) rownames(tt_mat) <- phyloseq::taxa_names(ps_abund_rel)

strip_prefix <- function(x) {
  x <- as.character(x)
  sub("^[^_]*__", "", x)
}

for (j in seq_len(ncol(tt_mat))) {
  tt_mat[, j] <- strip_prefix(tt_mat[, j])
}
tt_mat[tt_mat == ""] <- NA_character_

# Ensure taxa names still match
stopifnot(identical(rownames(tt_mat), phyloseq::taxa_names(ps_abund_rel)))

phyloseq::tax_table(ps_abund_rel) <- phyloseq::tax_table(tt_mat)

Taxonomic summary

Bar plots in phylum level

  sample_order <-
  c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

  # create a sample ID column in sample_data (or overwrite an existing one)
  sample_data(ps_abund_rel)$SampleID <- sample_names(ps_abund_rel)

  # set the order as a factor with the desired levels
  sample_data(ps_abund_rel)$SampleID <- factor(
    sample_data(ps_abund_rel)$SampleID,
    levels = sample_order
  )

  #aes(color="Phylum", fill="Phylum") --> aes()
  #ggplot(data=data, aes(x=Sample, y=Abundance, fill=Phylum))
  my_colors <- c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue","royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")

  plot_bar(ps_abund_rel, x = "SampleID", fill = "Phylum") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,           # 85° rotation
        hjust = 1,            # right-justified so labels are neat
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 2))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  ## merge + normalize
  # ps_abund_rel_group  <- merge_samples(ps_abund_rel, "Group")
  # ps_abund_rel_group_ <- transform_sample_counts(
  #   ps_abund_rel_group,
  #   function(x) x / sum(x)
  # )

  # make a safe grouping variable (character, non-numeric-looking)
  sample_data(ps_abund_rel)$Group_chr <- paste0("G", as.character(sample_data(ps_abund_rel)$Group))
  ps_abund_rel_group  <- merge_samples(ps_abund_rel, "Group_chr")
  ps_abund_rel_group_ <- transform_sample_counts(ps_abund_rel_group, function(x) x / sum(x))
  # (optional) if you want x-axis labels 1..6 again:
  sample_names(ps_abund_rel_group_) <- sub("^G", "", sample_names(ps_abund_rel_group_))

  # desired order on x-axis
  #, "positive control"
  group_order <- c("1", "2", "3", "4", "5", "6")
  #, "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "PC"

  plot_bar(ps_abund_rel_group_, fill = "Phylum") +
    geom_bar(stat = "identity", position = "stack") +
    scale_x_discrete(limits = group_order) +          # <-- set order here
    scale_fill_manual(values = my_colors) +
    labs(x = "Group") +   # <- change x-axis label from "Sample" to "Group"
    theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

  cat("ps_pruned:", nsamples(ps_pruned), "\n")
  cat("ps_filt:", nsamples(ps_filt), "\n")
  cat("ps_abund:", nsamples(ps_abund), "\n")
  cat("ps_abund_rel:", nsamples(ps_abund_rel), "\n")
  cat("ps_abund_rel_group:", nsamples(ps_abund_rel_group), "\n")

Bar plots in class level

  my_colors <- c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue","royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")
  plot_bar(ps_abund_rel, x = "SampleID", fill = "Class") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 3))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Class") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

Bar plots in order level

  my_colors <- c(
    "darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid",
    "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink",
    "khaki2", "firebrick", "brown1", "darkorange1", "cyan1",
    "royalblue4", "darksalmon", "darkblue","royalblue4",
    "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen",
    "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1",
    "brown1", "darkorange1", "cyan1", "darkgrey"
  )

  plot_bar(ps_abund_rel, x = "SampleID", fill = "Order") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 4))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Order") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

Bar plots in family level

  my_colors <- c(
          "#FF0000", "#000000", "#0000FF", "#C0C0C0", "#FFFFFF", "#FFFF00", "#00FFFF", "#FFA500", "#00FF00", "#808080", "#FF00FF", "#800080", "#FDD017", "#0000A0", "#3BB9FF", "#008000", "#800000", "#ADD8E6", "#F778A1", "#800517", "#736F6E", "#F52887", "#C11B17", "#5CB3FF", "#A52A2A", "#FF8040", "#2B60DE", "#736AFF", "#1589FF", "#98AFC7", "#8D38C9", "#307D7E", "#F6358A", "#151B54", "#6D7B8D", "#FDEEF4", "#FF0080", "#F88017", "#2554C7", "#FFF8C6", "#D4A017", "#306EFF", "#151B8D", "#9E7BFF", "#EAC117", "#E0FFFF", "#15317E", "#6C2DC7", "#FBB917", "#FCDFFF", "#15317E", "#254117", "#FAAFBE", "#357EC7"
        )
  plot_bar(ps_abund_rel, x = "SampleID", fill = "Family") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 8))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Family") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

# !!!!NOT_USED!!!!: #Export Relative abundances of Phylum, Class, Order, and Family levels across all samples in Excel files!
library(phyloseq)
library(writexl)
library(dplyr)

# Function to check for NA or empty values in a taxonomic rank
check_taxa_names <- function(tax_table, rank) {
  tax_values <- tax_table[[rank]]
  na_count <- sum(is.na(tax_values) | tax_values == "")
  cat("Number of NA or empty values in", rank, ":", na_count, "\n")
  if (na_count > 0) {
    cat("Taxa with NA or empty", rank, ":\n")
    print(tax_values[is.na(tax_values) | tax_values == ""])
  }
}

# Function to create and save relative abundance table for a given taxonomic rank with normalization
save_taxa_abundance <- function(ps, rank, output_file) {
  # Check for NA or empty values in the taxonomy table
  tax_table_df <- as.data.frame(tax_table(ps))
  check_taxa_names(tax_table_df, rank)

  # Aggregate OTUs by taxonomic rank, removing taxa with NA at the specified rank
  ps_glom <- tax_glom(ps, taxrank = rank, NArm = TRUE)

  # Extract OTU table (relative abundances)
  otu_table <- as.data.frame(otu_table(ps_glom))

  # Normalize each column to sum to 1
  otu_table_normalized <- apply(otu_table, 2, function(x) x / sum(x))

  # Convert matrix to data frame
  otu_table_normalized <- as.data.frame(otu_table_normalized)

  # Verify column sums are approximately 1.0
  col_sums <- colSums(otu_table_normalized)
  if (any(abs(col_sums - 1) > 1e-6)) {
    warning("Column sums in ", rank, " table do not equal 1.0: ", paste(col_sums, collapse = ", "))
  } else {
    cat("Column sums for ", rank, " table are all approximately 1.0\n")
  }

  # Extract taxonomy table and get the specified rank for taxa names
  tax_table_glom <- as.data.frame(tax_table(ps_glom))
  taxa_names <- tax_table_glom[[rank]]

  # Replace NA or empty strings with "Unclassified"
  taxa_names <- ifelse(is.na(taxa_names) | taxa_names == "", paste0("Unclassified_", rank), taxa_names)

  # Ensure unique row names
  taxa_names <- make.unique(taxa_names)

  # Set row names to taxa names (for internal reference)
  rownames(otu_table_normalized) <- taxa_names

  # Add taxa names as a column
  otu_table_normalized[[rank]] <- taxa_names

  # Reorder to move rank column to the first position
  otu_table_normalized <- otu_table_normalized[, c(rank, setdiff(names(otu_table_normalized), rank))]

  # Rename sample columns by removing "sample-" prefix
  names(otu_table_normalized)[-1] <- sub("sample-", "", names(otu_table_normalized)[-1])

  # Write the data frame to Excel, including the rank column
  write_xlsx(otu_table_normalized, path = output_file)
  cat("Saved", output_file, "\n")
}

# # Verify column sums of ps_abund_rel
# col_sums <- colSums(otu_table(ps_abund_rel))
# cat("Column sums of ps_abund_rel:\n")
# summary(col_sums)

## Generate Excel files for Phylum, Class, Order, and Family levels with normalization and renamed sample names
#save_taxa_abundance(ps_abund_rel, "Phylum", "relative_abundance_phylum_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Class", "relative_abundance_class_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Order", "relative_abundance_order_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Family", "relative_abundance_family_old.xlsx")
library(phyloseq)
library(writexl)
library(dplyr)

# Function to check for NA or empty values in a taxonomic rank
check_taxa_names <- function(tax_table, rank) {
  tax_values <- tax_table[[rank]]
  na_count <- sum(is.na(tax_values) | tax_values == "")
  cat("Number of NA or empty values in", rank, ":", na_count, "\n")
  if (na_count > 0) {
    cat("Taxa with NA or empty", rank, ":\n")
    print(tax_values[is.na(tax_values) | tax_values == ""])
  }
}

# Function to create and save relative abundance table for a given taxonomic rank with normalization
save_taxa_abundance <- function(ps, rank, output_file) {
  # Clean the taxonomy table by removing D_[level]__ prefixes
  tax_table_df <- as.data.frame(tax_table(ps))
  tax_table_df[[rank]] <- ifelse(is.na(tax_table_df[[rank]]) | tax_table_df[[rank]] == "",
                                 paste0("Unclassified_", rank),
                                 sub("^D_[0-9]+__(.+)", "\\1", tax_table_df[[rank]]))
  tax_table(ps) <- as.matrix(tax_table_df)  # Update taxonomy table with cleaned names

  # Check for NA or empty values in the taxonomy table
  check_taxa_names(tax_table_df, rank)

  # Aggregate OTUs by taxonomic rank, removing taxa with NA at the specified rank
  ps_glom <- tax_glom(ps, taxrank = rank, NArm = TRUE)

  # Extract OTU table (relative abundances)
  otu_table <- as.data.frame(otu_table(ps_glom))

  # Normalize each column to sum to 1
  otu_table_normalized <- apply(otu_table, 2, function(x) x / sum(x))

  # Convert matrix to data frame
  otu_table_normalized <- as.data.frame(otu_table_normalized)

  # Verify column sums are approximately 1.0
  col_sums <- colSums(otu_table_normalized)
  if (any(abs(col_sums - 1) > 1e-6)) {
    warning("Column sums in ", rank, " table do not equal 1.0: ", paste(col_sums, collapse = ", "))
  } else {
    cat("Column sums for ", rank, " table are all approximately 1.0\n")
  }

  # Extract taxonomy table and get the specified rank for taxa names
  tax_table_glom <- as.data.frame(tax_table(ps_glom))
  taxa_names <- tax_table_glom[[rank]]

  # Ensure unique row names
  taxa_names <- make.unique(taxa_names)

  # Set row names to taxa names (for internal reference)
  rownames(otu_table_normalized) <- taxa_names

  # Add taxa names as a column
  otu_table_normalized[[rank]] <- taxa_names

  # Reorder to move rank column to the first position
  otu_table_normalized <- otu_table_normalized[, c(rank, setdiff(names(otu_table_normalized), rank))]

  # Rename sample columns by removing "sample-" prefix
  names(otu_table_normalized)[-1] <- sub("sample-", "", names(otu_table_normalized)[-1])

  # Write the data frame to Excel, including the rank column
  write_xlsx(otu_table_normalized, path = output_file)
  cat("Saved", output_file, "\n")
}

# # Verify column sums of ps_abund_rel
# col_sums <- colSums(otu_table(ps_abund_rel))
# cat("Column sums of ps_abund_rel:\n")
# summary(col_sums)

## Generate Excel files for Phylum, Class, Order, and Family levels with normalization and renamed sample names
#save_taxa_abundance(ps_abund_rel, "Phylum", "relative_abundance_phylum.xlsx")
#save_taxa_abundance(ps_abund_rel, "Class", "relative_abundance_class.xlsx")
#save_taxa_abundance(ps_abund_rel, "Order", "relative_abundance_order.xlsx")
#save_taxa_abundance(ps_abund_rel, "Family", "relative_abundance_family.xlsx")

#Sum up the last two colums with the same row.names to a new column, export the file as csv, then delete the two rows before last, then merge them with csv2xls to a Excel-file, adapt the sheet-names.
#~/Tools/csv2xls-0.4/csv_to_xls.py relative_abundance_phylum.csv relative_abundance_order.csv relative_abundance_family.csv -d$'\t' -o relative_abundance_phylum_order_family.xls;

\pagebreak

Alpha diversity

Plot Chao1 richness estimator, Observed OTUs, Shannon index, and Phylogenetic diversity. Regroup together samples from the same group.

# using rarefied data
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even42369.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering/rep_set.tre
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even4753.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering_stool/rep_set.tre
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even4753.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering_swab/rep_set.tre
hmp.meta <- meta(ps_rarefied)
hmp.meta$sam_name <- rownames(hmp.meta)

# ---- enforce Group order (edit if you have different labels) ----
hmp.meta$Group <- factor(as.character(hmp.meta$Group), levels = group_order)

# for QIIME2: Lesen der Metriken
shannon <- read.table("exported_alpha/shannon/alpha-diversity.tsv", header=TRUE, sep="\t")  #cp -r ../Data_Karoline_16S_2025/exported_alpha/ .
faith_pd <- read.table("exported_alpha/faith_pd/alpha-diversity.tsv", header=TRUE, sep="\t")
observed <- read.table("exported_alpha/observed_features/alpha-diversity.tsv", header=TRUE, sep="\t")
#chao1 <- read.table("exported_alpha/chao1/alpha-diversity.tsv", header=TRUE, sep="\t")    #TODO: Check the correctness of chao1-calculation.

# Umbenennen für Klarheit
colnames(shannon) <- c("sam_name", "shannon")
colnames(faith_pd) <- c("sam_name", "PD_whole_tree")
colnames(observed) <- c("sam_name", "observed_otus")
#colnames(chao1) <- c("sam_name", "chao1")

# Merge alles in ein DataFrame
div.df <- Reduce(function(x, y) merge(x, y, by="sam_name"),
                  list(shannon, faith_pd, observed))

# Meta-Daten einfügen
div.df <- merge(div.df, hmp.meta, by="sam_name")

# Reformat
div.df2 <- div.df[, c("sam_name", "Group", "shannon", "observed_otus", "PD_whole_tree")]
colnames(div.df2) <- c("Sample name", "Group", "Shannon", "OTU", "Phylogenetic Diversity")
write.csv(div.df2, file="alpha_diversities.txt")
knitr::kable(div.df2) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

## ---- Summary stats (report in text/table) -----------------------------------
sum_stats <- div.df2 %>%
group_by(Group) %>%
summarise(n = n(),
            mean = mean(Shannon, na.rm = TRUE),
            sd = sd(Shannon, na.rm = TRUE),
            median = median(Shannon, na.rm = TRUE),
            IQR = IQR(Shannon, na.rm = TRUE),
            .groups = "drop")
# knitr::kable(sum_stats, digits = 3) %>%
# kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(sum_stats, "Shannon_Groups_summary.csv", row.names = FALSE)

Overall alpha-diversity: Kruskal–Wallis test checks whether the metric (e.g., Shannon) differs among groups overall.

## ---- Overall test -------------------------------------------------
#https://uc-r.github.io/t_test
#We can perform the test with t.test and transform our data and we can also perform the nonparametric test with the wilcox.test function.
#stat.test.Shannon <- compare_means(
# Shannon ~ Group, data = div.df2,
# method = "t.test"
#)
#knitr::kable(stat.test.Shannon) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Nonparametric overall test (robust default)
overall_kw <- compare_means(Shannon ~ Group, data = div.df2, method = "kruskal.test")
# knitr::kable(overall_kw, digits = 3) %>%
# kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(overall_kw, "Shannon_Groups_overall_Kruskal.csv", row.names = FALSE)
knitr::kable(overall_kw) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
## Optional parametric overall (use if assumptions OK)
overall_anova <- compare_means(Shannon ~ Group, data = div.df2, method = "anova")
write.csv(overall_anova, "Shannon_Groups_overall_ANOVA.csv", row.names = FALSE)

## Assumption checks (optional; helps decide ANOVA vs KW)
shapiro_res <- div.df2 %>% group_by(Group) %>% shapiro_test(Shannon)
levene_res  <- div.df2 %>% levene_test(Shannon ~ Group)
write.csv(shapiro_res, "Shannon_Groups_shapiro.csv", row.names = FALSE)
write.csv(levene_res,  "Shannon_Groups_levene.csv",  row.names = FALSE)

## ---- Pairwise tests with BH correction --------------------------------------
# Wilcoxon (nonparametric)
# Wilcoxon (nonparametric) — robust to empty/singleton groups
div_shannon <- div.df2 %>%
  dplyr::filter(!is.na(Shannon), !is.na(Group)) %>%
  dplyr::group_by(Group) %>%
  dplyr::filter(dplyr::n() >= 2) %>%
  dplyr::ungroup()
div_shannon$Group <- droplevels(factor(div_shannon$Group))

pw_wilcox <- tryCatch(
  {
    if (dplyr::n_distinct(div_shannon$Group) < 2) {
      tibble::tibble()
    } else {
      rstatix::pairwise_wilcox_test(div_shannon, Shannon ~ Group,
                                   p.adjust.method = "BH", exact = FALSE)
    }
  },
  error = function(e) tibble::tibble()
)

write.csv(pw_wilcox, "Shannon_Groups_pairwise_wilcox_BH.csv", row.names = FALSE)

# t-tests (parametric, optional)
pw_t <- pairwise_t_test(div.df2, Shannon ~ Group, p.adjust.method = "BH")
write.csv(pw_t, "Shannon_Groups_pairwise_t_BH.csv", row.names = FALSE)
div_df_melt <- reshape2::melt(div.df2)
#head(div_df_melt)

#https://plot.ly/r/box-plots/#horizontal-boxplot
#http://www.sthda.com/english/wiki/print.php?id=177
#https://rpkgs.datanovia.com/ggpubr/reference/as_ggplot.html
#http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/82-ggplot2-easy-way-to-change-graphical-parameters/
#https://plot.ly/r/box-plots/#horizontal-boxplot
#library("gridExtra")
#par(mfrow=c(4,1))
p <- ggboxplot(div_df_melt, x = "Group", y = "value",
              facet.by = "variable",
              scales = "free",
              width = 0.5,
              fill = "gray", legend= "right")
#ggpar(p, xlab = FALSE, ylab = FALSE)
lev <- levels(factor(div_df_melt$Group)) # get the variables
#FITTING4: delete H47(1) in lev
#lev <- lev[-c(3)]
# make a pairwise list that we want to compare.
#my_stat_compare_means
#https://stackoverflow.com/questions/47839988/indicating-significance-with-ggplot2-in-a-boxplot-with-multiple-groups
L.pairs <- combn(seq_along(lev), 2, simplify = FALSE, FUN = function(i) lev[i]) #%>% filter(p.signif != "ns")
my_stat_compare_means  <- function (mapping = NULL, data = NULL, method = NULL, paired = FALSE,
    method.args = list(), ref.group = NULL, comparisons = NULL,
    hide.ns = FALSE, label.sep = ", ", label = NULL, label.x.npc = "left",
    label.y.npc = "top", label.x = NULL, label.y = NULL, tip.length = 0.03,
    symnum.args = list(), geom = "text", position = "identity",
    na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...)
{
    if (!is.null(comparisons)) {
        method.info <- ggpubr:::.method_info(method)
        method <- method.info$method
        method.args <- ggpubr:::.add_item(method.args, paired = paired)
        if (method == "wilcox.test")
            method.args$exact <- FALSE
        pms <- list(...)
        size <- ifelse(is.null(pms$size), 0.3, pms$size)
        color <- ifelse(is.null(pms$color), "black", pms$color)
        map_signif_level <- FALSE
        if (is.null(label))
            label <- "p.format"
        if (ggpubr:::.is_p.signif_in_mapping(mapping) | (label %in% "p.signif")) {
            if (ggpubr:::.is_empty(symnum.args)) {
                map_signif_level <- c(`****` = 1e-04, `***` = 0.001,
                  `**` = 0.01, `*` = 0.05, ns = 1)
            } else {
               map_signif_level <- symnum.args
            }
            if (hide.ns)
                names(map_signif_level)[5] <- " "
        }
        step_increase <- ifelse(is.null(label.y), 0.12, 0)
        ggsignif::geom_signif(comparisons = comparisons, y_position = label.y,
            test = method, test.args = method.args, step_increase = step_increase,
            size = size, color = color, map_signif_level = map_signif_level,
            tip_length = tip.length, data = data)
    } else {
        mapping <- ggpubr:::.update_mapping(mapping, label)
        layer(stat = StatCompareMeans, data = data, mapping = mapping,
            geom = geom, position = position, show.legend = show.legend,
            inherit.aes = inherit.aes, params = list(label.x.npc = label.x.npc,
                label.y.npc = label.y.npc, label.x = label.x,
                label.y = label.y, label.sep = label.sep, method = method,
                method.args = method.args, paired = paired, ref.group = ref.group,
                symnum.args = symnum.args, hide.ns = hide.ns,
                na.rm = na.rm, ...))
    }
}

# Rotate the x-axis labels to 45 degrees and adjust their position
#comparisons = list(c("1", "2"), c("1", "3"), c("1", "4"), c("1", "5"), c("1", "6"), c("2", "3"), c("2", "4"), c("2", "5"), c("2", "6"), c("3", "4"), c("3", "5"), c("3", "6"), c("4", "5"), c("4", "6"), c("5", "6")),
p <- p + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size=8))
p2 <- p +
stat_compare_means(
  method="wilcox.test",  #"t.test"",
  comparisons = list(c("1", "2"), c("3", "4"), c("5", "6")),
  label = "p.signif",
  p.adjust.method = "BH",
  method.args = list(exact = FALSE),
  symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns"))
)
##comparisons = L.pairs,
##stat_pvalue_manual
##https://stackoverflow.com/questions/20500706/saving-multiple-ggplots-from-ls-into-one-and-separate-files-in-r
#print(p2)
#ggsave("./figures/alpha_diversity_Group.png", device="png", height = 10, width = 15)
#ggsave("./figures/alpha_diversity_Group.svg", device="svg", height = 10, width = 15)

Pairwise tests: Within each metric (Shannon/OTU/PD), Wilcoxon rank-sum tests compare groups pairwise, with BH correction for multiple testing.

library(dplyr)
library(tidyr)
library(ggpubr)
library(rstatix)
library(writexl)

div.df2 <- div.df2 %>% mutate(Group = factor(Group))

# 1) Long format for all 3 metrics
div_long <- div.df2 %>%
  select(Group, Shannon, OTU, `Phylogenetic Diversity`) %>%
  pivot_longer(
    cols = c(Shannon, OTU, `Phylogenetic Diversity`),
    names_to = "variable",
    values_to = "value"
  ) %>%
  filter(!is.na(Group), !is.na(value))

# 2) Drop groups with <2 samples within each metric
div_long2 <- div_long %>%
  group_by(variable, Group) %>%
  filter(n() >= 2) %>%
  ungroup() %>%
  mutate(Group = droplevels(Group))

# 3) ONE pairwise table for all metrics (BH within each metric)
pw_all <- div_long2 %>%
  group_by(variable) %>%
  pairwise_wilcox_test(value ~ Group, p.adjust.method = "BH", exact = FALSE) %>%
  ungroup()

write.csv(pw_all, "Pairwise_Wilcox_BH_all_metrics.csv", row.names = FALSE)
write_xlsx(pw_all, "Pairwise_Wilcox_BH_all_metrics.xlsx")
knitr::kable(pw_all) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# ---- desired order ----
comp_list <- list(
  c("1","2"),
  c("3","4"),
  c("5","6")
)

comp_order <- tibble(
  group1 = sapply(comp_list, `[`, 1),
  group2 = sapply(comp_list, `[`, 2),
  comp_id = seq_along(comp_list)
)

# 4) base y per facet
y_base <- div_long2 %>%
  group_by(variable) %>%
  summarise(
    y_max = max(value, na.rm = TRUE),
    y_min = min(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(y_span = y_max - y_min)

# 5) match order regardless of direction (A-B same as B-A)
pw_all_ordered <- pw_all %>%
  mutate(g1 = as.character(group1), g2 = as.character(group2)) %>%
  left_join(comp_order, by = c("g1"="group1","g2"="group2")) %>%
  mutate(
    comp_id = ifelse(
      is.na(comp_id),
      comp_order$comp_id[match(
        paste(g2, g1),
        paste(comp_order$group1, comp_order$group2)
      )],
      comp_id
    )
  ) %>%
  select(-g1, -g2) %>%
  filter(!is.na(comp_id))

# 6) assign y.position in your specified order within each facet
pw_all_pos <- pw_all_ordered %>%
  left_join(y_base, by = "variable") %>%
  group_by(variable) %>%
  arrange(comp_id, .by_group = TRUE) %>%
  mutate(
    step = ifelse(y_span > 0, 0.12 * y_span, 0.12),
    y.position = y_max + row_number() * step
  ) %>%
  ungroup()

# 7) plot + annotate
p <- ggboxplot(
  div_long2, x = "Group", y = "value",
  facet.by = "variable", scales = "free",
  width = 0.5, fill = "gray"
) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 8)) +
  stat_pvalue_manual(
    pw_all_pos,
    label = "p.adj.signif",
    y.position = "y.position",
    tip.length = 0.01,
    hide.ns = FALSE
  )

print(p)
## MANUALLY selected alpha diversities unter host-env after 'cp alpha_diversities.txt selected_alpha_diversities.txt'
#knitr::include_graphics("./figures/alpha_diversity_Group.png")
#selected_alpha_diversities<-read.csv("selected_alpha_diversities.txt",sep="\t")
#knitr::kable(selected_alpha_diversities) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
#!!# Beta diversity (Bray-Curtis distance)
#!!## Group1 vs Group2

#fig.cap="Beta diversity",

#for QIIME1: file:///home/jhuang/DATA/Data_Marius_16S/core_diversity_e42369/bdiv_even42369_Group/weighted_unifrac_boxplots/Group_Stats.txt

# -- for QIIME2: MANUALLY filter permanova-pairwise.csv and save as permanova-pairwise_.csv
# #grep "Permutations" exported_beta_group/permanova-pairwise.csv > permanova-pairwise_.csv
# #grep "Group1,Group2" exported_beta_group/permanova-pairwise.csv >> permanova-pairwise_.csv
# #grep "Group3,Group4" exported_beta_group/permanova-pairwise.csv >> permanova-pairwise_.csv
# beta_diversity_group_stats<-read.csv("permanova-pairwise_.csv",sep=",")
# #beta_diversity_group_stats <- beta_diversity_group_stats[beta_diversity_group_stats$Group.1 == "Group1" & beta_diversity_group_stats$Group.2 == "Group2", ]
# #beta_diversity_group_stats <- beta_diversity_group_stats[beta_diversity_group_stats$Group.1 == "Group3" & beta_diversity_group_stats$Group.2 == "Group4", ]
# knitr::kable(beta_diversity_group_stats) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

#NOTE: Run this Phyloseq.Rmd, then run the code of MicrobiotaProcess.R to manually generate Comparison_of_Bray_Distances_Group1_vs_Group2.png and Comparison_of_Bray_Distances_Group3_vs_Group4.png, then run this Phyloseq.Rmd!

#knitr::include_graphics("./figures/Comparison_of_Bray_Distances_Group1_vs_Group2.png")

Principal coordinates analysis (PCoA) of beta-diversity based on Bray–Curtis dissimilarity

Beta-diversity (between-sample differences in bacterial community composition) was quantified using Bray–Curtis dissimilarity computed on Hellinger-transformed abundance data. Beta-diversity patterns were visualized using principal coordinates analysis (PCoA) based on the Bray–Curtis dissimilarity matrix.

Global PERMANOVA on the Bray–Curtis dissimilarity matrix indicated a significant effect of Group on overall community composition (adonis2: p = 1×10⁻⁴; 9,999 permutations).

# --- Global beta-diversity (PERMANOVA) ---
cat("```text\n")
cat(
"[PERMANOVA result]\n",
"The object contained internal attribute: PCoA ADONIS\n",
"Permutation test for adonis under reduced model\n",
"Permutation: free\n",
"Number of permutations: 9999\n\n",
"vegan::adonis2(formula = .formula, data = sampleda, permutations = permutations, method = distmethod)\n",
"          Df SumOfSqs      R2      F Pr(>F)\n",
"Model     5   6.1752 0.51654 9.8296  1e-04 ***\n",
"Residual 46   5.7797 0.48346\n",
"Total    51  11.9549 1.00000  \n",
"---\n",
"Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n",
sep = ""
)
cat("```\n")

Pairwise PERMANOVA tests were performed on Bray–Curtis dissimilarity matrices to compare beta-diversity between all pairs of sample groups (metadata column Group). For each pairwise comparison, the Bray–Curtis matrix was subset to samples from the two groups only, and significance was assessed using vegan::adonis2 with 9,999 permutations. Resulting p-values were adjusted for multiple testing using both Benjamini–Hochberg (BH/FDR) and Bonferroni corrections.

#Manually give the annotation of
Bray_pairwise_PERMANOVA <- read.csv("figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv", sep = ",")
knitr::kable(Bray_pairwise_PERMANOVA, caption = "Pairwise PERMANOVA results (distance-based community differences among Group levels).") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# --- Ordination figures ---
knitr::include_graphics("./figures_MP_Group_1_2_3_4_5_6/PCoA.png")
knitr::include_graphics("./figures_MP_Group_1_2_3_4_5_6/PCoA3.png")

Differential abundance analysis

Differential abundance analysis aims to find the differences in the abundance of each taxa between two groups of samples, assigning a significance value to each comparison.

# ------------------------------------------------------------
#  DESeq2: non-rarefied integer counts + optional taxon prefilter
# ------------------------------------------------------------
ps_deseq <- ps_filt

Group1<-c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12")
Group2<-c("sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17")
Group3<-c("sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29")
Group4<-c("sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34")
Group5<-c("sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53")
Group6<-c("sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

Group 1 vs Group 2

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group1,Group2)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "2")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group1_vs_Group2"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

Group 3 vs Group 4

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group3,Group4)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "4")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group3_vs_Group4"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

Group 5 vs Group 6

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group5,Group6)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "6")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group5_vs_Group6"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

## =========================================================
## MicrobiotaProcess_UPDATED.R
## Uses your objects:
##   - ps_filt        : for alpha + beta diversity (NOT taxa-filtered)
##   - ps_abund_rel   : for cleaner composition plots (taxa filtered for plotting)
##
## Output:
##   - Rarefaction curves + alpha diversity plots
##   - Bray (Hellinger) distance + PCoA + PERMANOVA
##   - Composition plots (Class) + heatmaps (from ps_abund_rel)
## =========================================================

#DEBUG_Update: remotes::install_github("erocoar/gghalves")
#DEBUG: update.packages(ask = FALSE)
suppressPackageStartupMessages({
  library(phyloseq)
  library(MicrobiotaProcess)
  library(microeco)
  library(ggplot2)
  library(aplot)
  library(ggrepel)
  library(vegan)
  library(ggalluvial)
  library(ggh4x)
  library(gghalves)
})

dir.create("figures_MP_Group_1_2_3_4_5_6", showWarnings = FALSE)

## -----------------------------
## 0) Define groups (sample IDs)  ---- EXACTLY as provided
Group1<-c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12")
Group2<-c("sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17")
Group3<-c("sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29")
Group4<-c("sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34")
Group5<-c("sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53")
Group6<-c("sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

## Ensure unique IDs within each vector
Group1 <- unique(Group1);
Group2 <- unique(Group2);
Group3 <- unique(Group3)
Group4 <- unique(Group4);
Group5 <- unique(Group5);
Group6 <- unique(Group6);
#PC <- unique(PC)

## The group order you want everywhere (plots + stats)
#,"PC"
group_levels <- c("Group1","Group2","Group3","Group4","Group5","Group6")

## -----------------------------
## 0.1) Sample subset (union of all groups)
#, PC
keep_samples <- unique(c(Group1, Group2, Group3, Group4, Group5, Group6))

## -----------------------------
## 0.2) Helper: assign Group as an ordered factor (membership-based)
add_group <- function(mpse_obj) {
  sn <- rownames(colData(mpse_obj))
  grp <- rep(NA_character_, length(sn))

  grp[sn %in% Group1] <- "Group1"
  grp[sn %in% Group2] <- "Group2"
  grp[sn %in% Group3] <- "Group3"
  grp[sn %in% Group4] <- "Group4"
  grp[sn %in% Group5] <- "Group5"
  grp[sn %in% Group6] <- "Group6"
  #grp[sn %in% PC]     <- "PC"

  colData(mpse_obj)$Group <- factor(grp, levels = group_levels)

  # warn if any samples weren't assigned (helps catch typos / missing IDs)
  if (any(is.na(colData(mpse_obj)$Group))) {
    warning(
      "Unassigned samples (not found in any group list): ",
      paste(sn[is.na(colData(mpse_obj)$Group)], collapse = ", ")
    )
  }
  mpse_obj
}

## -----------------------------
## 0.3) Colors (edit to taste)
group_colors <- c(
  "Group1" = "#1f77b4",  # DarkBlue
  "Group2" = "#add8e6",  # LightBlue
  "Group3" = "#006400",  # DarkGreen
  "Group4" = "#90ee90",  # LightGreen
  "Group5" = "#6a0dad",  # DarkPurple
  "Group6" = "#dda0dd"   # LightPurple
)
#  "PC"     = """#7f7f7f"

## =========================================================
## 1) Diversity analysis object (alpha + beta)
##    IMPORTANT: start from ps_filt (all taxa retained)
## =========================================================
ps_div <- prune_samples(keep_samples, ps_filt)
ps_div <- prune_taxa(taxa_sums(ps_div) > 0, ps_div)

mpse_div <- as.MPSE(ps_div)
mpse_div <- add_group(mpse_div)

cat("\n[mpse_div] Group counts:\n")
print(table(colData(mpse_div)$Group, useNA = "ifany"))

## =========================================================
## 2) Alpha diversity (rarefaction-based)
## =========================================================
set.seed(9242)
mpse_div %<>% mp_rrarefy()  # creates RareAbundance
mpse_div %<>% mp_cal_rarecurve(.abundance = RareAbundance, chunks = 400)

# Rarefaction curves: sample + grouped
p_rare_1 <- mpse_div %>% mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe)

p_rare_2 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe, .group = Group) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

p_rare_3 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = "Observe",
                    .group = Group, plot.group = TRUE) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

png("figures_MP_Group_1_2_3_4_5_6/rarefaction_of_samples_or_groups.png", width = 1080, height = 600)
print(p_rare_1 + p_rare_2 + p_rare_3)
dev.off()

# Alpha indices from RareAbundance
mpse_div %<>% mp_cal_alpha(.abundance = RareAbundance)

f_alpha_1 <- mpse_div %>%
  mp_plot_alpha(.group = Group, .alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou)) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

f_alpha_2 <- mpse_div %>%
  mp_plot_alpha(.alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou))

png("figures_MP_Group_1_2_3_4_5_6/alpha_diversity_comparison.png", width = 1400, height = 600)
print(f_alpha_1 / f_alpha_2)
dev.off()
#BUG: Error in `gghalves::geom_half_point()`:
#! Problem while converting geom to grob.
#ℹ Error occurred in the 2nd layer.
#Caused by error in `fun()`:
#! argument "layout" is missing, with no default
#Run `rlang::last_trace()` to see where the error occurred.
#install.packages("remotes")
#remotes::install_github("erocoar/gghalves")

# Rarefaction curves: sample + grouped
p_rare_1 <- mpse_div %>% mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe)

p_rare_2 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe, .group = Group) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

p_rare_3 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = "Observe",
                    .group = Group, plot.group = TRUE) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

png("figures_MP_Group_1_2_3_4_5_6/rarefaction_of_samples_or_groups.png", width = 1080, height = 600)
print(p_rare_1 + p_rare_2 + p_rare_3)
dev.off()

# Alpha indices from RareAbundance
mpse_div %<>% mp_cal_alpha(.abundance = RareAbundance)

f_alpha_1 <- mpse_div %>%
  mp_plot_alpha(.group = Group, .alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou)) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

f_alpha_2 <- mpse_div %>%
  mp_plot_alpha(.alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou))

png("figures_MP_Group_1_2_3_4_5_6/alpha_diversity_comparison.png", width = 1400, height = 600)
print(f_alpha_1 / f_alpha_2)
dev.off()
##Error in `gghalves::geom_half_point()`:
#! Problem while converting geom to grob.
#ℹ Error occurred in the 2nd layer.
#Caused by error in `fun()`:
#! argument "layout" is missing, with no default
#Run `rlang::last_trace()` to see where the error occurred.

cat("\n[mpse_div] Group counts:\n")
print(table(colData(mpse_div)$Group, useNA = "ifany"))

## =========================================================
## 3) Beta diversity (Bray–Curtis on Hellinger)
##    IMPORTANT: use non-rarefied Abundance (not taxa-filtered)
## =========================================================
mpse_div %<>% mp_decostand(.abundance = Abundance)             # creates 'hellinger'
mpse_div %<>% mp_cal_dist(.abundance = hellinger, distmethod = "bray")

# Distance between samples
p_dist_1 <- mpse_div %>% mp_plot_dist(.distmethod = bray)
png("figures_MP_Group_1_2_3_4_5_6/distance_between_samples.png", width = 1000, height = 1000)
print(p_dist_1)
dev.off()

# Distance with group info
p_dist_2 <- mpse_div %>% mp_plot_dist(.distmethod = bray, .group = Group) +
  scale_fill_manual(values = group_colors) +
  scale_color_gradient()

png("figures_MP_Group_1_2_3_4_5_6/distance_between_samples_with_group_info.png", width = 1000, height = 1000)
print(p_dist_2)
dev.off()

# Compare distances between groups (Bray)
p_dist_3 <- mpse_div %>%
  mp_plot_dist(.distmethod = bray, .group = Group, group.test = TRUE, textsize = 6) +
  theme(
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x  = element_text(size = 14),
    axis.text.y  = element_text(size = 14)
  )

png("figures_MP_Group_1_2_3_4_5_6/Comparison_of_Bray_Distances.png", width = 1000, height = 1000)
print(p_dist_3)
dev.off()

## PCoA + PERMANOVA (adonis2)
mpse_div %<>% mp_cal_pcoa(.abundance = hellinger, distmethod = "bray")

mpse_div %<>% mp_adonis(
  .abundance   = hellinger,
  .formula     = ~ Group,
  distmethod   = "bray",
  permutations = 9999,
  action       = "add"
)

cat("\n[PERMANOVA result]\n")
print(mpse_div %>% mp_extract_internal_attr(name = "adonis"))
#[PERMANOVA result]
#The object contained internal attribute: PCoA ADONIS
#Permutation test for adonis under reduced model
#Permutation: free
#Number of permutations: 9999
#
#vegan::adonis2(formula = .formula, data = sampleda, permutations = permutations, method = distmethod)
#         Df SumOfSqs      R2      F Pr(>F)
#Model     5   6.1752 0.51654 9.8296  1e-04 ***
#Residual 46   5.7797 0.48346
#Total    51  11.9549 1.00000
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## PCoA plot
p_pcoa_1 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = 4,
    .alpha = 1,
    ellipse = TRUE,
    show.legend = TRUE
  ) +
  scale_fill_manual(values = group_colors) +
  scale_color_manual(values = group_colors) +
  theme(
    axis.text   = element_text(size = 16),
    axis.title  = element_text(size = 18),
    legend.text = element_text(size = 14),
    legend.title= element_text(size = 16)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA.png", width = 1200, height = 1000)
print(p_pcoa_1)
dev.off()

pdf("figures_MP_Group_1_2_3_4_5_6/PCoA.pdf", width = 10, height = 8)
print(p_pcoa_1)
dev.off()

## Optional: label points
library(ggrepel)

df <- p_pcoa_1$data
df$ShortLabel <- gsub("sample-", "", rownames(colData(mpse_div)))  # or df$Sample if present

p_pcoa_2 <- p_pcoa_1 +
  geom_text_repel(
    data = df,
    aes(label = ShortLabel),
    size = 4, max.overlaps = 100
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA_labeled.png", width = 1200, height = 1000)
print(p_pcoa_2)
dev.off()

## =========================================================
## 4) Composition plots object (clean taxa set for plotting)
##    IMPORTANT: start from ps_abund_rel (plotting-filtered taxa)
## =========================================================
ps_plot <- prune_samples(keep_samples, ps_abund_rel)
ps_plot <- prune_taxa(taxa_sums(ps_plot) > 0, ps_plot)

mpse_plot <- as.MPSE(ps_plot)
mpse_plot <- add_group(mpse_plot)

## Summaries for plotting (Class)
mpse_plot %<>%
  mp_cal_abundance(.abundance = Abundance) %>%               # per sample
  mp_cal_abundance(.abundance = Abundance, .group = Group)   # per group

## Class abundance barplots (top 20)
p_class_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    taxa.class = Class,
    topn = 20,
    relative = TRUE
  )

p_class_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    taxa.class = Class,
    topn = 20,
    relative = FALSE
  )

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_samples.png", width = 1200, height = 600)
print(p_class_rel / p_class_abs)
dev.off()

## Heatmaps (grouped)
h_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    relative = TRUE,
    topn = 20,
    geom = "heatmap",
    features.dist = "euclidean",
    features.hclust = "average",
    sample.dist = "bray",
    sample.hclust = "average"
  )

h_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    relative = FALSE,
    topn = 20,
    geom = "heatmap",
    features.dist = "euclidean",
    features.hclust = "average",
    sample.dist = "bray",
    sample.hclust = "average"
  )

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_heatmap.png", width = 1200, height = 600)
print(aplot::plot_list(gglist = list(h_rel, h_abs), tag_levels = "A"))
dev.off()

## Group-level barplots
p_group_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    topn = 20,
    plot.group = TRUE,
    relative = TRUE
  ) +
  scale_fill_manual(values = group_colors)

p_group_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    topn = 20,
    plot.group = TRUE,
    relative = FALSE
  ) +
  scale_fill_manual(values = group_colors)

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_groups.png", width = 1000, height = 1000)
print(p_group_rel / p_group_abs)
dev.off()

cat("\nDONE. Outputs written to ./figures_MP_Group_1_2_3_4_5_6/\n")

## =========================================================
## CONTINUE: Export Bray distances + pairwise PERMANOVA
## (Use mpse_div from the updated script above)
## =========================================================

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(openxlsx)
  library(vegan)
})

## -----------------------------
## Helper: get assay matrix with rows = samples, cols = features
get_sample_by_feature <- function(mpse_obj, assay_name) {
  mat <- assay(mpse_obj, assay_name)

  # sample IDs in MPSE
  samp_ids <- rownames(colData(mpse_obj))

  # If samples are columns, transpose
  if (!is.null(colnames(mat)) && all(samp_ids %in% colnames(mat))) {
    mat <- t(mat)
  }

  # Now enforce row order to match colData
  mat <- mat[samp_ids, , drop = FALSE]
  mat
}

## -----------------------------
## 1) Recompute Bray–Curtis distance (robust extraction)
hell_mat_sf <- get_sample_by_feature(mpse_div, "hellinger")  # rows=samples, cols=features
bray_dist   <- vegdist(hell_mat_sf, method = "bray")

## sanity checks
stopifnot(!any(is.na(as.matrix(bray_dist))))
stopifnot(!any(as.matrix(bray_dist) < 0, na.rm = TRUE))

## -----------------------------
## 2) Export all pairwise Bray distances to Excel
bray_mat <- as.matrix(bray_dist)
samples  <- rownames(bray_mat)

bray_df <- as.data.frame(as.table(bray_mat)) %>%
  rename(Sample1 = Var1, Sample2 = Var2, BrayDistance = Freq) %>%
  filter(Sample1 < Sample2) %>%
  arrange(Sample1, Sample2)

write.xlsx(bray_df, file = "figures_MP_Group_1_2_3_4_5_6/Bray_Curtis_Distances.xlsx")

## -----------------------------
## 3) Pairwise PERMANOVA (post-hoc) using vegan::adonis2
meta2 <- as.data.frame(colData(mpse_div))
meta2 <- meta2[rownames(hell_mat_sf), , drop = FALSE]
meta2$Group <- droplevels(meta2$Group)

groups <- levels(meta2$Group)

res_list <- list()
k <- 1

for (i in 1:(length(groups) - 1)) {
  for (j in (i + 1):length(groups)) {

    g1 <- groups[i]
    g2 <- groups[j]

    idx <- meta2$Group %in% c(g1, g2)
    sub_meta <- meta2[idx, , drop = FALSE]

    sub_dist <- as.dist(as.matrix(bray_dist)[idx, idx])

    ad <- adonis2(sub_dist ~ Group, data = sub_meta, permutations = 9999)

    res_list[[k]] <- data.frame(
      group1 = g1,
      group2 = g2,
      F      = ad$F[1],
      R2     = ad$R2[1],
      p      = ad$`Pr(>F)`[1]
    )
    k <- k + 1
  }
}

pair_res <- do.call(rbind, res_list)
pair_res$p_adj_BH         <- p.adjust(pair_res$p, method = "BH")
#pair_res$p_adj_Bonferroni <- p.adjust(pair_res$p, method = "bonferroni")
pair_res$p.adj.format <- format.pval(pair_res$p_adj_BH, digits = 3, eps = 1e-300)
pair_res$p.adj.signif <- cut(
  pair_res$p_adj_BH,
  breaks = c(-Inf, 1e-4, 1e-3, 1e-2, 0.05, Inf),
  labels = c("****", "***", "**", "*", "ns"),
  right = TRUE
)

write.csv(pair_res, "figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv", row.names = FALSE)

cat("\nPairwise PERMANOVA written to: figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv\n")
cat("Bray distance table written to: figures_MP_Group_1_2_3_4_5_6/Bray_Curtis_Distances.xlsx\n")

## =========================================================
## OPTIONAL: PCoA plot where point size = Shannon and alpha = Observe
## (requires mpse_div already has Shannon/Observe from mp_cal_alpha)
## =========================================================

p_pcoa_sizealpha <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = TRUE
  ) +
  scale_fill_manual(values = group_colors) +
  scale_color_manual(values = group_colors) +
  scale_size_continuous(range = c(2, 6)) +
  theme(
    axis.text   = element_text(size = 16),
    axis.title  = element_text(size = 18),
    legend.text = element_text(size = 14),
    legend.title= element_text(size = 16)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA_sizealpha.png", width = 1200, height = 1000)
print(p_pcoa_sizealpha)
dev.off()

pdf("figures_MP_Group_1_2_3_4_5_6/PCoA_sizealpha.pdf", width = 10, height = 8)
print(p_pcoa_sizealpha)
dev.off()

## =========================================================
## Ensure all three ordination outputs exist:
##   - PCoA  : basic (color/group)
##   - PCoA2 : size = Shannon, alpha = Observe
##   - PCoA3 : same as PCoA2 + sample labels
##
## Assumes you already have:
##   - mpse_div with: pcoa, Group, Shannon, Observe
##   - group_colors defined
## =========================================================

p1 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = 4,
    .alpha = 1,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 1.6,
      keyheight = 1.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 1.6,
      keyheight = 1.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA.png", width = 1200, height = 1000)
p1
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA.pdf")
p1
dev.off()

p2 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_size_continuous(
    range = c(2, 6),
    guide = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA2.png", width = 1200, height = 1000)
p2
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA2.pdf")
p2
dev.off()

library(ggrepel)
colData(mpse_div)$ShortLabel <- gsub("sample-", "", mpse_div@colData@rownames)

p3 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  geom_text_repel(aes(label = ShortLabel), size = 5, max.overlaps = 100) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_size_continuous(
    range = c(2, 6),
    guide = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA3.png", width = 1200, height = 1000)
p3
dev.off()
svg("figures_MP_Group_1_2_3_4_5_6/PCoA3.svg", width = 12, height = 10)
p3
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA3.pdf")
p3
dev.off()

Guide RNA Design for a Toxin–Antitoxin Locus in CP052959 Using the GuideFinder Workflow (Data_JuliaFuchs_RNAseq_2025)

This post documents how I generated CRISPRi gRNA candidates for a toxin–antitoxin locus in Staphylococcus epidermidis HD46 using a GuideFinder-inspired pipeline (Spoto et al., 2020; mSphere). The workflow is designed to be reproducible and works on complete microbial genomes.

Target locus

Two adjacent genes form a co-transcribed toxin–antitoxin (TA) unit:

  • Toxin: HJI06_09100 (1889026–1889421; strand -)
  • Antitoxin: HJI06_09105 (1889421–1889651; strand -)

Because RNA-seq coverage across the TA boundary is continuous, the locus behaves like an operon; therefore, targeting the antitoxin-side entry region (higher coordinates on the minus strand) is a good strategy to knock down the TA unit as a whole.


Overview of the pipeline

The workflow consists of four stages:

  1. Extract CDS sequences from the GenBank file → CP052959_CDS.fasta
  2. BLAST CDS to genome to generate a coordinate table → CP052959_gene_coordinates.tsv
  3. Scan promoter + gene body for NGG PAMs and generate candidate 20-mers with basic filters → CP052959_guides_preoff.tsv + BLAST hit table
  4. Off-target filtering using several scenarios (off_n1/off_n3/off_n5/off_n10/off_refined) → scan_results/.../guides_all.tsv + guides_top5.tsv

A key debugging lesson: if a target gene has no candidates, it is often because the gene scan window is too short (especially for short genes) or the promoter window is too small. Setting scan_gene_frac = 1.0 (scan the full gene) is often the fix.


Requirements

Software

  • Python 3 (Biopython)
  • R (with Biostrings, readr, dplyr, stringr, tidyr)
  • BLAST+ (makeblastdb, blastn)
  • (Optional) samtools for quick sequence validation

Input files

Download from NCBI:

  • Genome FASTA: CP052959.1.fna (or equivalent)
  • GenBank annotation: CP052959.1.gbk (or equivalent)

For convenience, I standardize filenames locally as:

  • CP052959.fna (genome)
  • CP052959.gbk (annotation)

Step-by-step commands

0) Prepare working directory

mkdir -p HD46_GuideFinder
cd HD46_GuideFinder

Place the following files in this directory:

  • CP052959.gbk
  • CP052959.fna
  • the scripts from the bottom of this post

1) Extract CDS sequences from GenBank (Python)

This creates a CDS FASTA where each header encodes metadata. Important note from troubleshooting: headers can contain spaces; later, it’s safest to avoid spaces or replace them with underscores.

✅ Run:

python 1_extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

Expected output:

  • CP052959_CDS.fasta

Example log:

  • Wrote 2573 CDS sequences to CP052959_CDS.fasta

2) Pre-processing: build BLAST DB + map CDS to genome (R)

This step wraps:

  • makeblastdb
  • blastn (CDS vs genome)
  • parsing BLAST output into a coordinate table

✅ Run:

#NOTE that we need to manually replace spaces (' ') with underscores ('_') in all headers of CP052959_CDS.fasta.
Rscript 2_PreProcessing_CP052959.R

Expected output:

  • GenomeDB_CP052959.* (BLAST database files)
  • CDS_vs_genome_CP052959.blast
  • CP052959_gene_coordinates.tsv

Sanity check (recommended):

grep -E "HJI06_09100|HJI06_09105" CP052959_gene_coordinates.tsv

You should see the correct coordinates and minus strand (consistent with complement(...) in GenBank).


3) Filter TA coordinates + create TA_targets.txt (R)

The coordinate table can contain multiple BLAST hits per locus_tag (including short, spurious alignments). So I keep only the longest hit per locus_tag and write:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

✅ Run:

Rscript 3_Filter_TA_coords.R

Expected outputs:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

4) Generate candidate guides (promoter + gene body) and BLAST hit table (R)

This script scans:

  • promoter window + gene body window
  • identifies NGG PAM (and CCN for reverse-strand equivalent)
  • extracts adjacent 20-nt protospacers
  • filters by GC and homopolymer
  • BLASTs guides to genome and writes a hit table

✅ Run:

Rscript 4_GuideFinder_CP052959_base.R

Expected outputs:

  • CP052959_guides_preoff.tsv
  • BLASTguides.fasta
  • MatchGuides.blast
  • CP052959_blast_hits.tsv

Troubleshooting note (from my run)

Initially, HJI06_09105 produced no candidates. The fix was to adjust scanning parameters:

  • scan_gene_frac <- 1.0 (scan full gene; crucial for short antitoxin genes)
  • increase promoter length (e.g., 600 bp)
  • optionally relax GC/homopolymer thresholds slightly

5) Off-target filtering (R)

This script reads:

  • CP052959_guides_preoff.tsv
  • CP052959_blast_hits.tsv

Then produces multiple filtering scenarios:

  • off_n1 / off_n3 / off_n5 / off_n10: keep guides where the number of strict genomic hits ≤ N
  • off_refined: remove any guide with a second perfect 20/20 match elsewhere; allow only “weak” secondary hits

✅ Run:

Rscript 5_ScanOfftarget_CP052959.R

Expected output structure:

scan_results/
  off_n1/
  off_n3/
  off_n5/
  off_n10/
  off_refined/

Each contains:

  • guides_all.tsv
  • guides_top5.tsv (ranked by dist_to_tss and GC)

Key concepts (explained from the debugging/communication notes)

What is PAM?

PAM = protospacer adjacent motif. For SpCas9/dCas9, PAM is typically NGG. Cas9/dCas9 binding normally requires a correct PAM adjacent to the DNA target site.

What is guide_seq vs protospacer?

In this pipeline:

  • guide_seq is the 20-nt protospacer sequence extracted from the genome adjacent to PAM.
  • The gRNA spacer you clone is usually this same 20-nt string (PAM is not included in gRNA).

What does guide_strand mean?

  • guide_strand = +: the 20-nt guide_seq is exactly the sequence on the genome forward strand (CP052959.fna) at guide_start..guide_end.
  • guide_strand = -: the guide_seq is the reverse complement of the forward strand sequence at that coordinate interval.

It is not a problem if a paired design uses one “+” and one “-” guide. They simply refer to target sites on different strands.

Why do some target sites appear multiple times?

Two reasons:

  1. The same coordinate-defined target may appear as a forward/reverse complement pair (because both PAM contexts were scanned).
  2. In tight loci like TA operons, promoter/gene scan windows can overlap between toxin and antitoxin, so the same genomic site may be reported under both locus_tags. For practical ordering, treat each coordinate-defined site as one physical target.

Final candidate selection (from my off_refined output)

After off_refined filtering, I had 10 candidates covering both genes.

Recommended single guide (entry-proximal for TA-unit knockdown):

  • CP052959|HJI06_09105|gene|1889575|+
  • guide_seq: AGTGCTGCGATCACTTCTGT
  • PAM: CGG
  • This is entry-proximal (antitoxin-side) and passed the strict off_refined rule.

Recommended “enhanced” 2-guide set:

  • Guide A (entry-proximal): CP052959|HJI06_09105|gene|1889575|+
  • Guide B (upstream roadblock): CP052959|HJI06_09105|promoter|1889908|- (guide_seq: CATTCTGACTTTATGGAAAC, PAM AGG)
  • These are ~333 bp apart and both pass off_refined.

Reproducibility note: verifying a candidate sequence in the genome

I used samtools faidx to confirm that the guide_seq matches the genome coordinates:

samtools faidx CP052959.fna
samtools faidx CP052959.fna CP052959.1:1889575-1889594

Expected sequence: AGTGCTGCGATCACTTCTGT


Raw code (scripts used)

Below are the exact scripts used so you can reproduce this pipeline on another computer.


1) 1_extract_cds_from_gbk.py

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

gbk_file = "CP052959.gbk"
out_fasta = "CP052959_CDS.fasta"

records = []

for rec in SeqIO.parse(gbk_file, "genbank"):
    for feature in rec.features:
        if feature.type != "CDS":
            continue

        seq = feature.extract(rec.seq)

        locus_tag = feature.qualifiers.get("locus_tag", ["unknown"])[0]
        gene      = feature.qualifiers.get("gene", [""])[0]
        product   = feature.qualifiers.get("product", [""])[0]

        header_parts = [locus_tag]
        if gene:
            header_parts.append(gene)
        if product:
            header_parts.append(product)

        header = "|".join(header_parts)

        rec_out = SeqRecord(
            seq,
            id=header,
            description=""
        )
        records.append(rec_out)

SeqIO.write(records, out_fasta, "fasta")

print(f"Wrote {len(records)} CDS sequences to {out_fasta}")

2) 2_PreProcessing_CP052959.R

# PreProcessing_CP052959.R
# Build BLAST DB from CP052959.fasta and align CP052959_CDS.fasta to get gene coordinates

# Packages
if (!requireNamespace("Biostrings", quietly = TRUE)) {
  install.packages("BiocManager")
  BiocManager::install("Biostrings")
}
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")

library(Biostrings)
library(readr)
library(dplyr)
library(stringr)

genome_fasta <- "CP052959.fna"
cds_fasta    <- "CP052959_CDS.fasta"
db_name      <- "GenomeDB_CP052959"
blast_out    <- "CDS_vs_genome_CP052959.blast"
coord_table  <- "CP052959_gene_coordinates.tsv"

message("Using genome FASTA: ", genome_fasta)
message("Using CDS FASTA:    ", cds_fasta)

# 1. Make BLAST database
cmd_db <- sprintf("makeblastdb -in %s -dbtype nucl -out %s",
                  shQuote(genome_fasta), shQuote(db_name))
message("Running: ", cmd_db)
system(cmd_db)

# 2. BLAST CDS vs genome to get coordinates
# outfmt: qseqid sseqid sstart send sstrand length pident bitscore
cmd_blast <- paste(
  "blastn",
  "-task blastn",
  "-query", shQuote(cds_fasta),
  "-db", shQuote(db_name),
  "-outfmt '6 qseqid sseqid sstart send sstrand length pident bitscore'",
  "-max_target_seqs 1",
  "-perc_identity 95",
  "-out", shQuote(blast_out)
)
message("Running: ", cmd_blast)
system(cmd_blast)

# 3. Load BLAST output
cols <- c("qseqid", "sseqid", "sstart", "send", "sstrand",
          "length", "pident", "bitscore")

blast <- read_tsv(blast_out,
                  col_names = cols,
                  show_col_types = FALSE,
                  comment = "#")

if (nrow(blast) == 0) {
  stop("No BLAST hits found. Check that CP052959_CDS.fasta matches CP052959.fasta.")
}

# 4. Normalise coordinates (start < end) and define strand
blast2 <- blast %>%
  mutate(
    start  = pmin(sstart, send),
    end    = pmax(sstart, send),
    strand = if_else(sstart <= send, "+", "-")
  ) %>%
  select(qseqid, sseqid, start, end, strand, length, pident, bitscore)

# 5. Parse gene ID / name from qseqid (header from CP052959_CDS.fasta)
# Format: locus_tag|gene|product (if available)
blast2 <- blast2 %>%
  mutate(
    locus_tag = str_split_fixed(qseqid, "\\|", 3)[, 1],
    gene      = str_split_fixed(qseqid, "\\|", 3)[, 2]
  )

# 6. Save coordinate table
coord <- blast2 %>%
  transmute(
    gene_id    = if_else(gene == "", locus_tag, gene),
    locus_tag  = locus_tag,
    contig_id  = sseqid,
    start      = start,
    end        = end,
    strand     = strand,
    length_nt  = length,
    pident     = pident,
    bitscore   = bitscore
  )

write_tsv(coord, coord_table)
message("Wrote gene coordinate table to: ", coord_table)

3) 3_Filter_TA_coords.R

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(readr)
  library(dplyr)
})

# ---------- 输入 / 输出 ----------
coord_in  <- "CP052959_gene_coordinates.tsv"
coord_out <- "TA_gene_coordinates.tsv"
targets_out <- "TA_targets.txt"

# 你的 TA targets
targets <- c("HJI06_09100", "HJI06_09105")

# ---------- 读取坐标表 ----------
coords <- read_tsv(coord_in, col_names = FALSE, show_col_types = FALSE)

# 预期 9 列:product, locus_tag, seqid, start, end, strand, aln_len, pident, bitscore
if (ncol(coords) < 9) {
  stop("Input coord table has < 9 columns: ", coord_in)
}

colnames(coords)[1:9] <- c("product","locus_tag","seqid","start","end","strand","aln_len","pident","bitscore")

# 强制类型(避免读成字符导致 slice_max 异常)
coords <- coords %>%
  mutate(
    locus_tag = as.character(locus_tag),
    aln_len = as.integer(aln_len),
    pident = as.numeric(pident),
    bitscore = as.numeric(bitscore)
  )

# ---------- 过滤:只保留 TA + 每基因最长命中 ----------
coords_ta <- coords %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# ---------- 安全检查 ----------
missing <- setdiff(targets, unique(coords_ta$locus_tag))
if (length(missing) > 0) {
  stop("Missing targets in filtered results: ", paste(missing, collapse = ", "),
       "\nCheck that locus_tags exist in ", coord_in)
}

if (nrow(coords_ta) != length(targets)) {
  warning("Filtered rows (", nrow(coords_ta), ") != number of targets (", length(targets), ").",
          "\nThis can happen if duplicate rows/ties were removed; please inspect output.")
}

# ---------- 写输出 ----------
write_tsv(coords_ta, coord_out)

# targets.txt:一行一个 locus_tag(按你给定的顺序写)
writeLines(targets, targets_out)

message("Wrote: ", coord_out)
message("Wrote: ", targets_out)
message("\nFiltered TA coordinates:")
print(coords_ta)

4) 4_GuideFinder_CP052959_base.R

#!/usr/bin/env Rscript

# GuideFinder_CP052959_base.R
# 目的:
#   1) 仅针对 TA targets 生成所有候选 guides(NGG PAM,20bp)
#   2) 计算 dist_to_tss(promoter 为负,gene body 为正,按转录方向)
#   3) 输出 guides_preoff.tsv + guides fasta
#   4) 运行 BLAST 做全基因组命中统计,输出 blast_hits.tsv
#
# 输入:
#   CP052959.fna                (整条基因组序列,单条序列 FASTA;可软链接到 CP052959.1.fna)
#   CP052959_gene_coordinates.tsv (PreProcessing_CP052959.R 生成)
#   TA_targets.txt                (两行:HJI06_09100 / HJI06_09105)
#
# 输出:
#   CP052959_guides_preoff.tsv
#   BLASTguides.fasta
#   MatchGuides.blast
#   CP052959_blast_hits.tsv

suppressPackageStartupMessages({
  if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
  if (!requireNamespace("Biostrings", quietly = TRUE)) BiocManager::install("Biostrings")
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
  if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")

  library(Biostrings)
  library(readr)
  library(dplyr)
  library(stringr)
  library(tidyr)
})

# ---------- 参数区 ----------
genome_fasta        <- "CP052959.fna"
coord_table         <- "CP052959_gene_coordinates.tsv"   # PreProcessing_CP052959.R 生成的
targets_file        <- "TA_targets.txt"                  # 只跑 TA 两个基因
blast_db_name       <- "GenomeDB_CP052959"               # 预处理里建的 BLAST DB 前缀
guides_fasta        <- "BLASTguides.fasta"
blast_matches_file  <- "MatchGuides.blast"
guides_preoff_tsv   <- "CP052959_guides_preoff.tsv"
blast_hits_tsv      <- "CP052959_blast_hits.tsv"

guide_len      <- 20L
pam            <- "NGG"              # SpCas9
#promoter_len   <- 200L              # 扫描 TSS 上游长度(bp),可按需改 100/300
#scan_gene_frac <- 0.30              # gene body 扫描前 30%(更适合 CRISPRi)
#gc_min         <- 0.30              # S. epidermidis GC 偏低,建议 0.30~0.35
#gc_max         <- 0.80              # 基本不会用到,但留着
#homopolymer_n  <- 4L                # 连续 A 或 T >= 4 就过滤(可改 5 更宽松)
promoter_len   <- 600L     # 原来 200L,改大:增加上游/入口区域的 PAM 机会
scan_gene_frac <- 1.00     # 原来 0.30,改为扫完整 gene body(短基因必需)
gc_min         <- 0.25     # 原来 0.30,略放宽(S. epi GC 低,避免全被 GC 过滤)
gc_max         <- 0.80
homopolymer_n  <- 5L       # 原来 4L,略放宽(避免把仅有的候选全杀掉)

# BLAST 参数:用于命中统计(不是最终 off_refined)
blast_task     <- "blastn-short"
blast_wordsize <- 7L
blast_evalue   <- 1000

# ---------- 工具函数 ----------
stopif_file_missing <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
}

calc_gc <- function(seq) {
  s <- as.character(seq)
  if (nchar(s) == 0) return(NA_real_)
  g <- str_count(s, "G")
  c <- str_count(s, "C")
  (g + c) / nchar(s)
}

has_homopolymer_AT <- function(seq, n = 4L) {
  s <- as.character(seq)
  patA <- paste0("A{", n, ",}")
  patT <- paste0("T{", n, ",}")
  str_detect(s, patA) | str_detect(s, patT)
}

revcomp_str <- function(s) {
  as.character(reverseComplement(DNAString(s)))
}

# 扫描 forward strand 的 NGG:返回 data.frame(protospacer + PAM)
scan_forward_NGG <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  # dna: DNAString(该区域序列,5'->3' 按基因组正链)
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # i: protospacer 起点(0-based in R string positions)
  # protospacer = s[i+1 .. i+20]
  # PAM         = s[i+21 .. i+23] must match NGG
  max_i <- L - (guide_len + 3L)
  if (max_i < 0) return(tibble())

  idx <- 0L
  for (i in 0:max_i) {
    pam_seq <- substr(s, i + guide_len + 1L, i + guide_len + 3L)
    if (nchar(pam_seq) != 3) next
    if (!str_detect(pam_seq, "^[ACGT]GG$")) next

    prot <- substr(s, i + 1L, i + guide_len)
    if (nchar(prot) != guide_len) next

    guide_start <- offset_genome_1based + i
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- guide_end + 1L
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = prot,        # 常规报告为 protospacer(不含 PAM)
      pam_seq     = pam_seq,
      guide_strand = "+",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# 扫描 reverse strand:等价扫描 forward 的 CCN(因为 reverse 的 PAM = NGG)
scan_reverse_CCN <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # 在 forward 上找 CCN:pam_fwd = CCN (positions j..j+2)
  # reverse 上 PAM = NGG,对应的 protospacer 在 forward 上是 pam_fwd 后面紧接 20nt(j+3..j+22)
  # 输出 guide_seq 应为该 20nt 的反向互补(5'->3')
  max_j <- L - (3L + guide_len)
  if (max_j < 0) return(tibble())

  idx <- 0L
  for (j in 0:max_j) {
    pam_fwd <- substr(s, j + 1L, j + 3L)
    if (nchar(pam_fwd) != 3) next
    if (!str_detect(pam_fwd, "^CC[ACGT]$")) next

    prot_fwd <- substr(s, j + 4L, j + 3L + guide_len)  # downstream 20nt
    if (nchar(prot_fwd) != guide_len) next

    guide_seq <- revcomp_str(prot_fwd)
    # 基因组坐标(1-based):prot_fwd 覆盖 [j+4 .. j+23](长度20)
    guide_start <- offset_genome_1based + j + 3L
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- offset_genome_1based + j
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = guide_seq,
      pam_seq     = revcomp_str(pam_fwd),   # 反向互补后就是 NGG
      guide_strand = "-",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# dist_to_tss:按转录方向计算 guide 到 TSS 的距离(bp)
# 约定:
#   + 链:TSS ~ gene_start(start),promoter 在 start-promoter_len .. start-1(dist 负)
#   - 链:TSS ~ gene_end(end),promoter 在 end+1 .. end+promoter_len(dist 负)
calc_dist_to_tss <- function(gene_start, gene_end, gene_strand, guide_start, guide_end) {
  if (gene_strand == "+") {
    # 取 guide 的 5'端(转录方向上的前端)近似
    guide_5p <- guide_start
    as.integer(guide_5p - gene_start)
  } else {
    # - 链转录方向为 大 -> 小,5'端在更大坐标
    guide_5p <- guide_end
    as.integer(gene_end - guide_5p)
  }
}

# ---------- 输入检查 ----------
stopif_file_missing(genome_fasta)
stopif_file_missing(coord_table)
stopif_file_missing(targets_file)

message("Using genome FASTA: ", genome_fasta)
message("Using coord table:  ", coord_table)
message("Using targets:      ", targets_file)

targets <- readLines(targets_file)
targets <- targets[targets != ""]
if (length(targets) == 0) stop("targets_file is empty: ", targets_file)
message("Targets: ", paste(targets, collapse = ", "))

# ---------- 读取基因组 ----------
genome_set <- readDNAStringSet(genome_fasta)
if (length(genome_set) < 1) stop("No sequences in genome FASTA: ", genome_fasta)
if (length(genome_set) > 1) {
  message("Warning: genome FASTA has multiple sequences. Using the first one: ", names(genome_set)[1])
}
genome <- genome_set[[1]]
genome_len <- length(genome)
genome_id <- names(genome_set)[1]
message("Genome length: ", genome_len, " bp; genome_id: ", genome_id)

# ---------- 读取坐标表并过滤(关键:只保留 targets + 每基因最长命中) ----------
# 你的表是 9 列(从 grep 输出推断):
# V1=product, V2=locus_tag, V3=seqid, V4=start, V5=end, V6=strand, V7=aln_len, V8=pident, V9=bitscore
coords_raw <- read_tsv(coord_table, col_names = FALSE, show_col_types = FALSE)
if (ncol(coords_raw) < 9) stop("coord_table seems not to have 9 columns: ", coord_table)

coords <- coords_raw %>%
  transmute(
    product   = as.character(X1),
    locus_tag = as.character(X2),
    seqid     = as.character(X3),
    start     = as.integer(X4),
    end       = as.integer(X5),
    strand    = as.character(X6),
    aln_len   = as.integer(X7),
    pident    = as.numeric(X8),
    bitscore  = as.numeric(X9)
  ) %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# 安全检查:targets 都得存在
missing_targets <- setdiff(targets, coords$locus_tag)
if (length(missing_targets) > 0) {
  stop("Some targets not found in coord_table after filtering: ", paste(missing_targets, collapse = ", "))
}

message("Filtered coord rows: ", nrow(coords))
print(coords)

# ---------- 逐基因生成候选 guides ----------
all_guides <- list()
gid <- 0L

for (i in seq_len(nrow(coords))) {
  row <- coords[i, ]

  gene_start <- min(row$start, row$end)
  gene_end   <- max(row$start, row$end)
  gene_strand <- row$strand
  locus <- row$locus_tag
  prod  <- row$product

  gene_len <- gene_end - gene_start + 1L
  scan_gene_len <- max(1L, floor(gene_len * scan_gene_frac))

  # promoter 区间(按链方向定义“上游”)
  if (gene_strand == "+") {
    tss <- gene_start
    prom_start <- max(1L, gene_start - promoter_len)
    prom_end   <- gene_start - 1L
    gene_scan_start <- gene_start
    gene_scan_end   <- min(gene_end, gene_start + scan_gene_len - 1L)
  } else {
    tss <- gene_end
    prom_start <- gene_end + 1L
    prom_end   <- min(genome_len, gene_end + promoter_len)
    gene_scan_end   <- gene_end
    gene_scan_start <- max(gene_start, gene_end - scan_gene_len + 1L)
  }

  message("\n--- ", locus, " (", gene_strand, ") ---")
  message("Gene: ", gene_start, "..", gene_end, " (len=", gene_len, "), scan gene part=", scan_gene_len,
          ", TSS~", tss)
  message("Promoter region: ", prom_start, "..", prom_end)
  message("Gene-scan region:", gene_scan_start, "..", gene_scan_end)

  # 提取 promoter / gene-scan 序列(按基因组正链坐标截取)
  guides_gene <- tibble()
  guides_prom <- tibble()

  if (prom_start <= prom_end) {
    dna_prom <- subseq(genome, start = prom_start, end = prom_end)
    guides_prom <- bind_rows(
      scan_forward_NGG(dna_prom, offset_genome_1based = prom_start, region_label = "promoter"),
      scan_reverse_CCN(dna_prom, offset_genome_1based = prom_start, region_label = "promoter")
    )
  }

  if (gene_scan_start <= gene_scan_end) {
    dna_gene <- subseq(genome, start = gene_scan_start, end = gene_scan_end)
    guides_gene <- bind_rows(
      scan_forward_NGG(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene"),
      scan_reverse_CCN(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene")
    )
  }

  guides <- bind_rows(guides_prom, guides_gene)
  if (nrow(guides) == 0) {
    message("No NGG candidates found in promoter+gene-scan region for ", locus)
    next
  }

  # 过滤:GC + homopolymer
  guides <- guides %>%
    mutate(
      locus_tag = locus,
      product   = prod,
      gene_start = gene_start,
      gene_end   = gene_end,
      gene_strand = gene_strand,
      tss_pos   = tss,
      gc        = vapply(guide_seq, function(x) calc_gc(DNAString(x)), numeric(1)),
      has_hpoly = vapply(guide_seq, function(x) has_homopolymer_AT(x, homopolymer_n), logical(1)),
      dist_to_tss = mapply(
        FUN = calc_dist_to_tss,
        gene_start = gene_start,
        gene_end   = gene_end,
        gene_strand = gene_strand,
        guide_start = guide_start,
        guide_end   = guide_end
      )
    ) %>%
    filter(!is.na(gc)) %>%
    filter(gc >= gc_min, gc <= gc_max) %>%
    filter(!has_hpoly)

  if (nrow(guides) == 0) {
    message("All candidates filtered out by GC/homopolymer for ", locus)
    next
  }

  # 给每条 guide 一个唯一 ID
  guides <- guides %>%
    arrange(region, abs(dist_to_tss), desc(gc)) %>%
    mutate(
      guide_id = {
        # 稳定可读:CP052959|locus|region|start|strand
        paste0("CP052959|", locus_tag, "|", region, "|", guide_start, "|", guide_strand)
      }
    )

  all_guides[[length(all_guides) + 1L]] <- guides
  message("Kept guides after filters: ", nrow(guides))
}

guides_df <- bind_rows(all_guides)

if (nrow(guides_df) == 0) {
  stop("No guides generated after filtering. Consider lowering gc_min or relaxing homopolymer_n/promoter_len.")
}

# 输出 preoff
write_tsv(guides_df %>%
            select(
              guide_id, locus_tag, product, region,
              guide_seq, pam_seq, guide_strand,
              guide_start, guide_end, pam_start, pam_end,
              gene_start, gene_end, gene_strand, tss_pos, dist_to_tss,
              gc, has_hpoly
            ),
          guides_preoff_tsv)

message("\nWrote guides_preoff: ", guides_preoff_tsv)

# ---------- 写 BLASTguides.fasta ----------
# 注意:写出的序列是 guide_seq(20nt)
guides_set <- DNAStringSet(guides_df$guide_seq)
names(guides_set) <- guides_df$guide_id
writeXStringSet(guides_set, guides_fasta, format = "fasta")
message("Wrote guides FASTA for BLAST: ", guides_fasta)

# ---------- 运行 BLAST ----------
# 检查 BLAST DB 是否存在,否则尝试建库
db_nsq <- paste0(blast_db_name, ".nsq")
db_nin <- paste0(blast_db_name, ".nin")
db_nhr <- paste0(blast_db_name, ".nhr")
if (!file.exists(db_nsq) && !file.exists(db_nin) && !file.exists(db_nhr)) {
  message("BLAST DB not found (", blast_db_name, "). Trying to build it from ", genome_fasta)
  cmd_mk <- sprintf("makeblastdb -in '%s' -dbtype nucl -out '%s'", genome_fasta, blast_db_name)
  message("Running: ", cmd_mk)
  system(cmd_mk, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)
}

cmd_blast <- sprintf(
  "blastn -task %s -word_size %d -evalue %g -query '%s' -db '%s' -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 20 -out '%s'",
  blast_task, blast_wordsize, blast_evalue, guides_fasta, blast_db_name, blast_matches_file
)
message("Running: ", cmd_blast)
system(cmd_blast, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)

stopif_file_missing(blast_matches_file)

# ---------- 解析 BLAST 输出 ----------
blast_raw <- read_tsv(
  blast_matches_file,
  col_names = c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"),
  show_col_types = FALSE
)

if (nrow(blast_raw) == 0) {
  stop("No BLAST hits found. Check that guides_fasta and genome DB match.")
}

# 统一 sstart/send 为 min/max(方便判断)
blast_hits <- blast_raw %>%
  mutate(
    smin = pmin(sstart, send),
    smax = pmax(sstart, send),
    hit_strand = ifelse(sstart <= send, "+", "-")
  )

# 命中计数(粗略:所有 hits;后续 off_refined 由你的 ScanOfftarget 脚本做)
hit_counts <- blast_hits %>%
  count(qseqid, name = "n_hits") %>%
  rename(guide_id = qseqid)

# 合并 guides 注释
blast_merged <- blast_hits %>%
  rename(guide_id = qseqid) %>%
  left_join(guides_df %>% select(guide_id, locus_tag, region, guide_seq, guide_strand, guide_start, guide_end), by = "guide_id") %>%
  left_join(hit_counts, by = "guide_id") %>%
  arrange(locus_tag, region, guide_id, desc(pident), desc(length), desc(bitscore))

write_tsv(blast_merged, blast_hits_tsv)
message("Wrote blast hits table: ", blast_hits_tsv)

message("\nDONE.")
message("Next step: run ScanOfftarget_CP052959.R to generate off_n3/off_n5/off_refined outputs.")

5) 5_ScanOfftarget_CP052959.R (updated with strict hit counting)

#!/usr/bin/env Rscript

# ScanOfftarget_CP052959.R
# 适配当前版本输出:
#   - guides_preoff: CP052959_guides_preoff.tsv(locus_tag, dist_to_tss, gc, guide_id...)
#   - blast_hits:    CP052959_blast_hits.tsv(guide_id, pident, length, mismatch, gapopen... + n_hits)
#
# 输出:
#   scan_results/off_n1|off_n3|off_n5|off_n10|off_refined/{guides_all.tsv,guides_top5.tsv}

suppressPackageStartupMessages({
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  library(readr)
  library(dplyr)
})

guides_preoff_tsv <- "CP052959_guides_preoff.tsv"
blast_hits_tsv    <- "CP052959_blast_hits.tsv"
guide_len         <- 20L
base_outdir       <- "scan_results"

message("Loading guides: ", guides_preoff_tsv)
guides <- read_tsv(guides_preoff_tsv, show_col_types = FALSE)

message("Loading BLAST hits: ", blast_hits_tsv)
blast <- read_tsv(blast_hits_tsv, show_col_types = FALSE)

if (!dir.exists(base_outdir)) dir.create(base_outdir)

# ---------- 兼容列名:gene_id / locus_tag;GC / gc ----------
# guides 表必须有 guide_id
if (!"guide_id" %in% names(guides)) stop("guides_preoff 缺少 guide_id 列:", guides_preoff_tsv)

# gene_id:优先用 gene_id,没有就用 locus_tag
if (!"gene_id" %in% names(guides)) {
  if (!"locus_tag" %in% names(guides)) stop("guides_preoff 既没有 gene_id 也没有 locus_tag")
  guides <- guides %>% mutate(gene_id = locus_tag)
}

# GC:优先用 GC,没有就用 gc
if (!"GC" %in% names(guides)) {
  if (!"gc" %in% names(guides)) stop("guides_preoff 既没有 GC 也没有 gc")
  guides <- guides %>% mutate(GC = gc)
}

# dist_to_tss 必须存在
if (!"dist_to_tss" %in% names(guides)) stop("guides_preoff 缺少 dist_to_tss 列")

# ---------- BLAST hits 列检查 ----------
# 我们需要:guide_id, pident, length, mismatch, gapopen
needed_blast_cols <- c("guide_id", "pident", "length", "mismatch", "gapopen")
missing_cols <- setdiff(needed_blast_cols, names(blast))
if (length(missing_cols) > 0) {
  stop("blast_hits 缺少必要列:", paste(missing_cols, collapse = ", "),
       "\n请确认 blast_hits_tsv 是由 GuideFinder_CP052959_base.R 生成的版本。")
}

# ---------- 规则封装 ----------
# 通用:按 n_hits 上限过滤
#是太严(仍然 0),就放宽一点: mismatch 2-->3: filter_by_nhits(10, require_full_length = TRUE, max_mismatch = 3, require_no_gaps = TRUE)
filter_by_nhits <- function(max_hits,
                            require_full_length = TRUE,
                            max_mismatch = 2,
                            require_no_gaps = TRUE,
                            min_pident = 0) {

  # 1) 先从 blast hits 里挑“更像真实 off-target 风险”的命中
  b <- blast

  # 强制数值类型(避免字符导致比较全 NA)
  b <- b %>%
    mutate(
      pident  = as.numeric(pident),
      length  = as.integer(length),
      mismatch = as.integer(mismatch),
      gapopen = as.integer(gapopen)
    )

  if (require_full_length) {
    b <- b %>% filter(length == guide_len)
  }
  if (require_no_gaps) {
    b <- b %>% filter(gapopen == 0)
  }
  if (!is.null(max_mismatch)) {
    b <- b %>% filter(mismatch <= max_mismatch)
  }
  if (!is.null(min_pident) && min_pident > 0) {
    # 注意:你的 pident 通常是 0-100 的百分比
    b <- b %>% filter(pident >= min_pident)
  }

  # 2) 用“严格命中”重新计算每条 guide 的 n_hits(更合理)
  valid_ids <- b %>%
    count(guide_id, name = "n_hits_strict") %>%
    filter(n_hits_strict <= max_hits) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 精细规则 off_refined:
#  - exactly 1 perfect hit: length==guide_len & pident==100 & mismatch==0 & gapopen==0
#  - 其他 hits(若存在):length < 15 或 mismatch >= 3
filter_refined <- function() {
  blast_grouped <- blast %>%
    group_by(guide_id) %>%
    summarise(
      n_hits = n(),
      n_perfect = sum(length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0),
      other_ok = all(
        ifelse(
          length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0,
          TRUE,
          (length < 15 | mismatch >= 3)
        )
      ),
      .groups = "drop"
    )

  valid_ids <- blast_grouped %>%
    filter(n_perfect == 1, other_ok) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 写结果到子目录
run_scenario <- function(name, guides_selected) {
  outdir <- file.path(base_outdir, name)
  if (!dir.exists(outdir)) dir.create(outdir, recursive = TRUE)

  all_path <- file.path(outdir, "guides_all.tsv")
  top_path <- file.path(outdir, "guides_top5.tsv")

  write_tsv(guides_selected, all_path)

  guides_top <- guides_selected %>%
    group_by(gene_id) %>%
    arrange(dist_to_tss, desc(GC)) %>%
    slice_head(n = 5) %>%
    ungroup()

  write_tsv(guides_top, top_path)

  message(sprintf("[%s] 总 guides: %d, 覆盖基因数: %d",
                  name, nrow(guides_selected), length(unique(guides_selected$gene_id))))
}

# ---------- 运行多套参数 ----------
message("Running scenario: n_hits <= 1")
run_scenario("off_n1", filter_by_nhits(1))

message("Running scenario: n_hits <= 3")
run_scenario("off_n3", filter_by_nhits(3))

message("Running scenario: n_hits <= 5")
run_scenario("off_n5", filter_by_nhits(5))

message("Running scenario: n_hits <= 10")
run_scenario("off_n10", filter_by_nhits(10))

message("Running scenario: refined rule")
run_scenario("off_refined", filter_refined())

message("All scenarios finished. Check 'scan_results/' directory.")

Final notes

  • If move to another machine, the only strict requirements are: BLAST+, R packages, and the same input FASTA/GBK.
  • If a target gene yields no candidates again, check:

    • scan_gene_frac (set to 1.0 for short genes)
    • promoter length (increase if needed)
    • GC/homopolymer thresholds (slightly relax)

If the gRNA backbone available (e.g., BsaI/BsmBI Golden Gate format) and we can add an appendix section that auto-generates the exact top/bottom oligos for ordering from the selected guides.

Guide RNA Design with GuideFinder for the Toxin–Antitoxin Unit in Staphylococcus epidermidis HD46 (CP052959.1)

Part A|把你“125 的流程”迁移到 CP052959.1 的 step-by-step(中文)

0️⃣ 目录结构建议(最省事)

新建一个工作目录,例如:

HD46_GuideFinder/
  CP052959.gbk
  CP052959.fna
  extract_cds_from_gbk.py
  PreProcessing_CP052959.R
  GuideFinder_CP052959_base.R
  ScanOfftarget_CP052959.R
  (可选) *.Rmd

你可以把你当时的 PreProcessing_125.R / GuideFinder_125_base.R / ScanOfftarget_125.R 复制一份改名,然后把里面所有 “125” 的文件名前缀改成 “HD46” 或 “CP052959.1”。


1️⃣ 从 GenBank 提取 CDS → 生成 CP052959_CDS.fasta(Python + Biopython)

输入: CP052959.gbk 输出: CP052959_CDS.fasta(每个 CDS 一条序列,header 用 locus_tag)

运行(示例):

python extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

检查输出里应该能看到类似:

  • >HJI06_09100 ...
  • >HJI06_09105 ...

这一步的目的:后面要把每条 CDS 用 BLAST 定位回整条基因组,得到统一的坐标表(尤其适配 draft genome / 多 contig 的情况;complete genome 也能这么跑,流程统一)。


2️⃣ Pre-processing:把 CDS BLAST 回基因组 → 生成 gene coordinate table

2.1 建 BLAST 数据库(GenomeDB)

用基因组 FASTA 建库(输入 CP052959.1.fna):

makeblastdb -in CP052959.1.fna -dbtype nucl -out GenomeDB_HD46

2.2 BLAST:CDS vs Genome

(你笔记里用的是 -task blastn,这里沿用;如果你 CDS 很短也可用 blastn-short

blastn -task blastn \
  -query HD46_CDS.fasta \
  -db GenomeDB_HD46 \
  -outfmt "6 qseqid sseqid sstart send sstrand length pident bitscore" \
  -max_target_seqs 1 \
  -perc_identity 95 \
  -out CDS_vs_genome_HD46.blast

2.3 生成坐标表

运行你改好的预处理脚本(对应你笔记里的 PreProcessing_125.R):

Rscript PreProcessing_HD46.R
# 或者
Rscript -e "rmarkdown::render('PreProcessing_HD46.Rmd')"

输出: HD46_gene_coordinates.tsv

这张表通常会包含每个基因(locus_tag)的:

  • contig/seqid(这里应该是 CP052959.1)
  • start/end
  • strand
  • (如果脚本计算了)promoter 区、TSS 近似位置、gene length 等

3️⃣ Guide 设计:跑 GuideFinder(或你简化版 GuideFinder-like 脚本)

你笔记里分成:

  • GuideFinder_125_base.R:做 PAM 扫描、GC/同聚物过滤、计算 dist_to_tss、写出待 BLAST 的 guides fasta
  • 然后再 ScanOfftarget_125.R:根据 BLAST 命中做 off-target 分层(off_n3/off_n5/off_refined)

我建议你按同样分法跑(更清晰、也更容易 debug)。

3.1(可选但很推荐)只做 TA 这两个基因:建 targets 列表

新建 TA_targets.txt

HJI06_09100
HJI06_09105

然后在 GuideFinder 脚本里加一个“只保留这两个基因”的过滤(如果你当时 125 是全基因组,这里就能显著加速)。

3.2 运行 base 设计脚本

Rscript GuideFinder_HD46_base.R

你期望它会做:

  1. 扫 NGG PAM、生成 20bp(或 20/21/22)候选
  2. GC / homopolymer / bad-seed / restriction-site 过滤
  3. 正确计算 dist_to_tss(你笔记里写这次修好了)
  4. 写出 BLASTguides.fasta 并跑 BLAST(或至少准备好 BLAST 输入)
  5. 产生一个“preoff”的 guides 表(比如 HD46_guides_preoff.tsv

3.3 Off-target 扫描与分层输出

Rscript ScanOfftarget_HD46.R

它会把结果放进类似这样的目录结构(沿用你笔记习惯):

scan_results/
  off_n3/
    guides_all.tsv
    guides_top5.tsv
  off_n5/
    guides_all.tsv
    guides_top5.tsv
  off_refined/
    guides_all.tsv
    guides_top5.tsv

Part B|你要用 off_refined 的“选择逻辑”怎么理解(中文)

你笔记里这段非常关键,我用更直白的方式总结一下(保持你原有意图):

1) 现在脚本里的 n_hits 统计是什么?

  • 目前 n_hits 是“guide 的 BLAST 命中次数”,不区分 promoter / CDS / intergenic
  • 不检查命中位点是否旁边有 NGG PAM
  • 因此它会偏“保守”(宁可把一些其实没有 PAM 的匹配也算作潜在风险)

2) off_refined 策略核心是什么?

你笔记里这句说得很到位:

  • off_refined 已经在 BLAST level 上把真正危险的“第二个完美 20/20 命中”踢掉了
  • 剩下的 guide 通常是“1 个 on-target + 一堆残缺 off-target”(比如 mismatch 多、seed 不完整等),更接近你理想的“只在目标位点真正起作用”

3) 你提出的“只把有 PAM 的命中算 off-target”是对的

从 dCas9/Cas9 生物学机制看:

  • 没有合适 PAM(这里是 NGG),多数情况下不构成真实 off-target 风险
  • 你可以在后续加一层:对每个 BLAST 命中回到 genome 取邻域序列,检查是否存在 NGG,再把“有 PAM 的命中”纳入 n_hits

但就你现在说的用途(实际做 CRISPRi 设计),off_refined 已经是很实用的默认档位


Part C|给 Xxxxxx 的邮件草稿(你直接复制粘贴就能发)

你要“写一封邮件给 Xxxxxx,说原理上可行,并做了一个 probe-test;最佳匹配 Enterobacter intestinihominis…;用 off_refined;解释结果;并附算法细节;附件 guides_all.tsv / guides_top5.tsv”。 我下面给英文邮件(更像工作邮件)。你把方括号里的占位符替换成你的具体数字/文件名即可。

Hi Xxxxxx,

In principle the guide design pipeline is working end-to-end now. I ran a small probe-test on Patricia’s previously assembled complete isolate “125” genome input.

Quick taxonomy sanity check (PATRIC/closest match): Best match: Enterobacter intestinihominis (taxid = 3133180; species; lineage: Bacteria → Pseudomonadati/Pseudomonadota → Gammaproteobacteria → Enterobacterales → Enterobacteriaceae).

What the pipeline produced (strategy: off_refined):

  • It completes the sequence-level filters (GC + homopolymers) and computes dist_to_tss correctly.
  • It then runs a BLAST-based off-target scan and generates the final guide tables.
  • Attached are:

    • guides_all.tsv (all guides passing filters)
    • guides_top5.tsv (up to 5 best guides per gene)

Summary of the results (off_refined):

  • Total genes evaluated: [N_genes]
  • Genes with ≥1 guide: [N_with_guides] ([pct]% coverage)
  • Genes without guides: [N_without_guides]
  • Total guides retained in guides_all.tsv: [N_guides_all]
  • Median guides per gene (genes with guides): [median_guides_per_gene] (These numbers are from the attached tables.)

Algorithm details (off_refined, in plain terms):

  1. Candidate generation: scan for NGG PAM sites and create guide candidates (20 bp; optionally 21/22 bp depending on settings).
  2. Primary filters: remove candidates failing GC threshold and basic sequence-quality rules (e.g., long homopolymer runs; optional “bad-seed” motifs / restriction sites).
  3. TSS proximity scoring: compute dist_to_tss and prioritize guides closer to the inferred transcription start region for CRISPRi.
  4. Off-target screening via BLAST: align guide (or seed region, depending on mode) against the whole genome.
  5. off_refined rule: discard any guide that shows a second perfect full-length match elsewhere in the genome (i.e., another 20/20 hit besides the intended on-target). The remaining guides typically have 1 on-target hit plus only weak/partial off-target hits.

Note: in the current implementation, n_hits is counted genomewide from BLAST hits (no promoter/CDS distinction and no explicit PAM check at hit sites), which is conservative. If needed, we can further refine by counting only BLAST hits that also have an appropriate NGG PAM in the correct context.

Best, Yyyy


如果你接下来要把这个流程用到你真正关心的 HD46(CP052959.1)TA 单元,我建议你按上面 Part A 跑完后,把 scan_results/off_refined/guides_top5.tsvHJI06_09100HJI06_09105 的行贴出来,我可以再帮你快速做“最终挑哪 1 条单 guide / 哪一对 paired guides 最适合用来一起 knockdown 这个 TA 单元”的二次筛选。

GuideFinder-Based gRNA Design for CRISPRi Knockdown of a Toxin–Antitoxin Operon in S. epidermidis HD46 (CP052959.1)

两个基因的 locus_tag 分别是:

  • 毒素(toxin)HJI06_09100,坐标 complement(1889026..1889421)
  • 抗毒素(antitoxin)HJI06_09105,坐标 complement(1889421..1889651)

这里有一个非常重要的更正:它们在 complement,也就是 负链(minus strand)。 因此“操纵子上游/启动子在前端”对应的方向是:

  • 负链基因的“上游(5’端)”在 更大的坐标数那一侧
  • 你的 TA 单元从坐标上看是 1889651(更大)→ 1889026(更小) 这个方向转录

你想把 toxin+antitoxin 当一个单元一起 knockdown,那么最有效的策略通常是: 优先打在 antitoxin 的 5’端附近(靠近 1889651 一侧)以及其上游启动子区域,这样最容易把整条共转录本压下去。


1) “paired guides(成对 guides)”用中文再讲清楚

paired guides = 为同一个目标(同一个基因/同一转录单元)挑两条不同位置的 guide,同时表达,让 dCas9 在两个点上“堵住”转录,从而增强抑制的强度或稳定性。

  • 它不是“分别打两个基因”这个意思
  • 对你来说更合适的用法是:

    • 1 条 guide 卡在 启动子/TSS 附近
    • 另 1 条卡在 操纵子前段(这里就是 antitoxin 5’端或紧邻区域)
  • GuideFinder 会按你设置的“最小间距”(常见 100 bp)去输出可用的 guide 对

2) 用 GuideFinder 给 HD46(CP052959.1)这两个 locus_tag 设计 guides:实操步骤

下面按“你有完整基因组(complete genome)”的最常见流程写(你不需要自己手算 NGG 位点,GuideFinder 会做)。

Step A:准备文件

从 NCBI 下载这两个文件到同一目录(建议):

  1. 基因组序列 FASTACP052959.1.fna
  2. 基因组注释 GenBankCP052959.1.gbff(或 .gbk)

FASTA 提供序列;GenBank 提供 gene/CDS 坐标和 locus_tag。GuideFinder 依赖这些信息来定位目标基因并取启动子区域。

Step B:安装依赖

你需要:

  • R / RStudio
  • BLAST+(命令行 blastn):GuideFinder 用它做 off-target(脱靶)筛查
  • 下载 GuideFinder(GitHub: ohlab/Guide-Finder)的 Rmarkdown/R 脚本

Step C:准备目标基因列表

建一个文本文件(例如 targets.txt),内容就两行:

HJI06_09100
HJI06_09105

Step D:在 GuideFinder 里设置关键参数(推荐给 S. epidermidis 的组合)

因为 S. epidermidis GC 偏低,建议用“先严格、再迭代放宽”的策略(GuideFinder 的特色功能):

第一轮(主参数)

  • PAM:NGG(默认 SpCas9)
  • GC minimum:35%(或稍低一点)
  • TSS/启动子附近优先:让它偏向“离 TSS 更近”的 guide
  • off-target:开启(强烈建议不要关)

迭代轮(只针对第一轮找不到 guide 的基因自动放宽)

  • GC minimum:降到 30%
  • TSS 最大距离:放宽
  • 必要时再放宽一些序列过滤(同聚物等),但 off-target 最好别放得太松

Step E:运行并看输出

GuideFinder 通常会给你两类输出表(名字可能因版本略有不同):

  1. Top hits(单条 guide 推荐)
  2. Paired guides(成对 guide 推荐)

你现在的目标是“把 TA 单元一起压下去”,所以你从输出里优先挑:

  • 靠近 1889651 一侧(负链的 5’端/上游)的 guides

    • 重点看 HJI06_09105(antitoxin)输出里最靠前、最接近其 5’端/推定启动子区的 guides
    • HJI06_09100(toxin)也会出 guides,但从“整单元 knockdown”角度,最“入口”的位置通常更强

3) 针对你这个 TA 单元,怎么挑“最可能一把压住两基因”的 guide

因为它们紧挨着且共转录,你最推荐的组合是:

方案 1:单 guide(最简、最常成功)

  • 选 1 条 最接近操纵子启动子/TSS 的 guide
  • 实验上先用它验证 toxin 与 antitoxin 的 mRNA 是否都下降(RT-qPCR 两个都测)

方案 2:paired guides(更稳的增强版)

  • 选两条都位于“操纵子前端”的 guide(一般以 antitoxin 5’端/上游区域为核心)
  • 两条之间满足你设的最小间距(比如 100 bp)
  • 用同一个载体同时表达两条 guide(或串联表达)

4) 一个很关键的小提醒:负链时,“上游”在坐标更大端

你最初邮件里写的是 “(+)”,但 GenBank 注释清楚写了 complement。 所以当你在 GuideFinder 输出里看到坐标时,记住:

  • 越接近 1889651(更大坐标)越接近“转录起点/上游入口”
  • 这类 guide 通常更容易把整个 TA 转录单元一起压下去

把运行 GuideFinder 后输出表里与这两个 locus_tag 对应的前几条候选(包含坐标、链方向、序列、是否 off-target)贴出来,可以快速做二次筛选: 哪一条最适合做“单 guide 一把压住”,哪一对最适合做“paired guides 增强版”。