Manuscript extension (Data_JuliaBerger_RNASeq_SARS-CoV-2)

Figure 2 各子图生成工具与方法详解(中文)

根据您上传的 README.txtREADME_R.txtN-variant_MS.pdf,以下是 Figure 2 各子图的生成工具与流程:


📊 Figure 2A:蛋白质组热图(Proteomic Heatmap)

步骤 工具/软件 功能说明
原始数据处理 FragPipe (v23.0) 蛋白质鉴定与定量,使用 MSFragger 搜索引擎,控制 FDR < 1%
差异分析 FragPipe 内置统计模块 单因素方差分析(one-way ANOVA),p < 0.05 筛选显著蛋白
数据标准化 Python (Pandas v2.1.4) 计算 Z-score,整合 3 个生物学重复的平均值
可视化 Python: Matplotlib (v3.10.7) + Seaborn (v0.13.2) 绘制热图,基于欧氏距离进行层次聚类

🫧 Figure 2B:整合蛋白质组与转录组分裂气泡图(Split-Bubble Plot)

步骤 工具/软件 功能说明
数据整合 自定义 Python 脚本 合并蛋白质组(左侧)与转录组(右侧)的 Z-score
功能术语筛选 GO/Reactome 富集分析结果 筛选 padj < 0.01,Jaccard 相似性 ≤ 0.25 去冗余
可视化 Python: Matplotlib/Seaborn 绘制分裂气泡图:颜色=活性(红高蓝低),大小=−log₁₀(校正 P 值)

🔍 注:该图为定制化可视化,代码未在 README 中完整展示,但基于项目整体技术栈推断为 Python 实现。


🔥 Figure 2C:转录组热图(Transcriptomic Heatmap)

步骤 工具/软件 功能说明
原始定量 Salmon (via nf-core/rnaseq) 转录本水平定量,输出 quant.sf
基因水平汇总 tximport (R 包) 将转录本计数汇总至基因水平
差异表达分析 DESeq2 (R 包) 负二项分布模型,Wald 检验 + BH 校正
数据转换 DESeq2::rlogTransformation() 正则化对数转换,稳定方差
可视化 R: gplots::heatmap.2() 绘制热图,基于 Spearman 相关系数聚类

📌 关键 R 代码片段(来自 README_R.txt):

library(gplots)
library(RColorBrewer)
distsRL <- dist(t(assay(rld)))  # 计算样本间距离
hc <- hclust(distsRL)            # 层次聚类
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col=rev(hmcol))

📈 Figure 2E/F:病毒 RNA 转录与基因组载量分析

步骤 工具/软件 功能说明
病毒读段定量 SAMtools + 自定义脚本 从 BAM 文件提取比对到病毒基因组的读段
标准化 自定义 R/Python 脚本 以 PBS 对照为基准计算相对丰度
统计检验 R: stats::t.test() 或 DESeq2 Bonferroni 校正,p < 0.05 判定显著性
可视化 R: ggplot2 或 Python: Matplotlib 绘制柱状图/箱线图展示比值与载量

🔑 整体工作流程总结

原始数据
   │
   ├─ 转录组 ─► Trimmomatic → STAR+Salmon (Nextflow) → tximport → DESeq2 → R (gplots/ggplot2)
   │
   └─ 蛋白质组 ─► LC-MS/MS → FragPipe (MSFragger) → Python (Pandas/Seaborn/Matplotlib)
                          │
                          └─► 整合分析 → 自定义 Python 脚本 → Figure 2B

💡 补充说明

  1. 重复性保障:所有分析均基于 3 个生物学重复(n=3),统计检验采用多重校正(Benjamini-Hochberg 或 Bonferroni)。
  2. 参数可追溯:关键参数(如 Trimmomatic 的 SLIDINGWINDOW:4:15、DESeq2 的 padj ≤ 0.05)均在 README 中明确记录。
  3. 可视化工具选择
    • 热图/聚类 → R (gplots) 或 Python (Seaborn)
    • 气泡图/定制化图表 → Python (Matplotlib)
    • 统计图/出版级绘图 → R (ggplot2) 或 Python (Matplotlib)

如需我帮您整理某一张图的完整复现代码,或适配特定期刊的绘图风格,请随时告知!🧬📊

德国私立大学作为”过渡方案”的深度分析

🎓 德国私立大学作为”过渡方案”的深度分析

针对您朋友(国内运动康复本科背景)的情况,我将私立大学作为过渡方案的利弊、具体选项及路径设计梳理如下:


⚖️ 私立大学:核心利弊对比

维度 ✅ 优势 ⚠️ 风险/挑战
入学门槛 背景匹配度要求灵活,运动康复本科通常可申 部分项目仍需补修商科/心理学先修课
授课语言 英语授课项目丰富,无需德语基础即可入学 生活中仍需基础德语,建议同步学习
开学时间 滚动录取+多学期入学(冬/夏+部分春季),时间灵活 需提前3-6个月准备材料,避免错过截止
课程实践 校企合作紧密,Capstone项目常对接真实企业需求[[知识库]] 实践质量依赖学校资源,需提前调研合作机构
学历认证 国家认可私立大学学位可认证(需查教育部名单) 关键:必须确认目标院校在教育部涉外监管网名单内
费用成本 无隐性收费,服务响应快,签证支持完善 学费€8,000-15,000/年 + 生活费€11,208/年(保证金)
就业移民 毕业享18个月找工作签证,管理类岗位需求稳定 康复技术岗仍偏好德语+本地资质,需提前规划

🏫 推荐私立大学及专业(英语授课+健康相关)

🔹 首选:University of Europe for Applied Sciences (UE)

您已上传其专业资料,以下基于知识库深度分析:

专业 学位 地点 学制 匹配逻辑 注意事项
Prevention & Therapy Management M.Sc. Iserlohn 4学期 ✅ 商学院下设,课程含健康管理+创业+市场分析,适合”技术转管理” 不涉及康复技术深化,纯临床路线慎选
Clinical Psychology (Rehab & Geronto) M.Sc. Hamburg/Iserlohn 4学期 ✅ 名称含Rehabilitation,课程聚焦康复心理学+老年干预 需确认是否接受运动康复本科背景(建议邮件咨询)
International Public Health Management M.Sc. Iserlohn 4学期 ✅ 公共卫生+国际视角,适合政策/组织发展方向 偏宏观,一线康复技术关联度低

📌 关键优势(基于知识库)

  • Capstone Project模块(15学分)可对接康复机构/健康科技公司实战项目[[知识库]]
  • 通用考试条例(ASPO)明确支持英语授课+灵活考核方式(论文/演示/项目等)[[知识库]]
  • 商学院背景,课程含Startup Management、Health Economics,创业友好

🔹 其他高性价比私立选项

学校 专业 年学费 特色 认证状态
SRH Hochschule Heidelberg M.Sc. in Health Management ~€12,000 模块化教学+在线混合选项,适合在职过渡 ✅ 可认证
IU International University M.Sc. Public Health / Health Management ~€9,000 100%在线可选,时间极灵活,适合先拿学位再赴德 ✅ 可认证(需选线下/混合模式)
GISMA Business School MSc International Health Management ~€13,500 与英国University of Plymouth双学位,国际认可度高 ✅ 可认证
Berlin School of Business and Innovation (BSBI) MA Health & Social Care Management ~€10,500 柏林区位+行业嘉宾讲座,网络资源强 ✅ 可认证

🔍 认证查询方式
访问 教育部教育涉外监管信息网 → “德国” → 查看”私立高等院校”名单,确认目标院校在列。


🗺️ “过渡方案”三条务实路径设计

🔄 路径A:私立硕士 → 德国就业 → 补充临床资质(推荐⭐)

📅 时间线:
第1-2年:私立大学英语硕士(如UE的Prevention & Therapy Management)
   • 同步学习德语至B1(歌德学院/在线课程)
   • 通过Capstone项目积累德国健康机构实习经验
第3年:毕业申请18个月找工作签证
   • 目标岗位:健康机构项目协调员、康复中心运营助理、健康科技公司客户成功
   • 在职期间:考取德国康复治疗师补充资质(如"Physiotherapeut"进修课程,通常需德语C1+临床实习)
✅ 适合:目标是"管理岗"或"创业",不执着于一线治疗师身份的朋友

🔄 路径B:私立硕士 → 回国发展 → 国际背景加持

📌 优势:
• 英语硕士+德国学习经历,在国内高端康复机构/外资健康企业有差异化竞争力
• 若课程含数字化健康(如UE的ICT in Healthcare),可切入健康科技赛道
• 学费总投入约€25,000-35,000,低于英美同类项目
⚠️ 注意:
• 提前调研目标企业招聘偏好(部分公立医院仍偏好德语+本地学历)
• 建议在读期间争取国内康复机构远程实习,保持技术敏感度
✅ 适合:计划回国进入高端私立康复中心、健康咨询公司、跨国药企的朋友

🔄 路径C:私立硕士 → 申请公立博士 → 学术转型

🎯 逻辑:
私立硕士(管理/公共卫生方向)→ 积累研究经验 → 申请公立大学健康服务研究/卫生政策博士
📚 关键准备:
• 硕士期间主动参与教授研究项目,争取发表论文
• 强化定量研究方法(UE课程含Advanced Research Methodologies+SPSS/MAXQDA[[知识库]])
• 提前联系公立大学导师,说明"临床背景+管理硕士"的交叉优势
✅ 适合:对健康政策、服务管理研究有兴趣,愿意走学术路线的朋友

📝 行动清单:降低决策风险

  1. 认证核查(第一步!)
    → 访问 教育部涉外监管网,确认目标私立大学在认可名单

  2. 背景匹配预沟通
    → 用英文邮件联系UE招生办(模板可帮您起草),核心问:
    “My bachelor’s degree is in Sports Rehabilitation from China. Could you confirm if it meets the subject-related entry requirements for the M.Sc. Prevention & Therapy Management?”

  3. 成本测算

    💰 2年总投入估算(以UE为例):
    • 学费:€12,000-15,000/年 × 2 = €24,000-30,000  
    • 生活费保证金:€11,208/年 × 2 = €22,416(可月取使用)  
    • 保险+注册费+材料:~€2,000  
    • 合计:≈ €48,000-55,000(约¥38-44万)
  4. 备选对冲
    → 同步申请1-2所公立大学德语授课项目(如German Sport University Cologne),作为”保底+长期备选”


🌟 一句话总结
私立大学作为过渡方案可行且务实,尤其适合希望快速入学+英语环境+管理转型的朋友。关键在于:① 选对认证院校;② 明确职业目标与专业匹配度;③ 在读期间主动弥补德语/实践短板。

如果您需要,我可以:
🔹 帮您起草英文咨询邮件给UE招生办
🔹 提供德国私立大学认证名单快速查询链接
🔹 分析朋友简历与目标专业的匹配度优化建议

请告诉我下一步您希望聚焦哪个环节?😊



🇩🇪 德国公立大学应用科学类硕士项目(英语授课)

根据搜索结果,为您整理以下信息:


⚠️ 重要前提:英语授课 + 公立应用科学大学 = 选择有限

现实情况 说明
🎓 公立大学优势 免学费(仅收注册费€150-400/学期)[[2]]
🗣️ 语言限制 运动康复/物理治疗类硕士绝大多数为德语授课[[21]][[27]]
🔍 英语项目分布 健康管理、公共卫生方向英语选项相对较多[[31]]
🏫 应用科学大学特色 实践导向强,但英语硕士项目集中在商科/工程类[[12]]

✅ 公立应用科学大学:英语授课硕士推荐(健康相关)

🏥 健康管理方向(部分英语/双语)

学校 专业 学位 语言 学费 备注
Osnabrück UAS Health Management MBA 英/德双语 ~€390/学期 公立免学费,需确认英语比例[[39]]
Neu-Ulm UAS Digital Healthcare Management M.A. 英语 免学费 数字健康+管理,新兴方向[[31]]
Aalen University Gesundheitsmanagement M.A. 部分英语 免学费 需确认具体授课语言比例
Hochschule Ludwigshafen Health Care Management M.Sc. 部分英语 免学费 研究导向,管理+政策[[32]]

💡 提示:以上项目需逐一确认”英语授课比例”,部分为”德语为主+英语模块”。


🏃 运动康复/运动科学方向(英语选项极少)

学校 专业 学位 语言 学费 匹配度
German Sport University Cologne Sport and Exercise Science M.Sc. 🇩🇪 德语 ~€2,600/学期 ⭐⭐⭐⭐ 专业强,但语言门槛
University of Potsdam (UAS) Sports Therapy and Movement Therapy M.A. 🇩🇪 德语 免学费 ⭐⭐⭐⭐ 对口,但需德语[[24]]
DHfPG Sports and Movement Therapy M.Sc. 🇩🇪 德语 私立收费 ⭐⭐⭐ 私立,非公立[[22]]

关键提醒:目前德国几乎没有公立应用科学大学开设全英语授课的运动康复/物理治疗硕士[[21]][[27]]。


🔎 如何精准搜索适合的项目?

推荐工具(免费+官方)

  1. DAAD国际课程数据库daad.de/en/international-programmes

    • 筛选:Master + English only + “Health”/”Sport”/”Rehabilitation”[[19]]
  2. MyGermanUniversitymygermanuniversity.com

    • 可筛选”Public University” + “English” + 专业方向[[6]]
  3. Higher Education Compasshochschulkompass.de

    • 德国高校官方数据库,信息最权威

搜索关键词建议

"Public Health" AND "English" AND "Master" AND "Germany"
"Sports Science" AND "English-taught" AND "Fachhochschule"
"Health Management" AND "no tuition" AND "Master"

💡 给运动康复背景朋友的务实建议

方案A:接受德语,选择更对口专业

✅ 优势:专业匹配度高 + 公立免学费 + 就业认可度高
📚 行动:
   • 报名德福/歌德学院,目标B2-C1
   • 申请:German Sport University Cologne, University of Potsdam等
   • 时间:德语学习+申请周期约1.5-2年

方案B:坚持英语,接受私立或管理方向

✅ 优势:快速入学 + 英语环境 + 管理技能加成
📚 行动:
   • 考虑之前分析的UE大学:Prevention & Therapy Management(英语+管理)
   • 或选择私立大学如:SRH, IU(学费€5,000-10,000/年)
   • 注意:私立学历回国认证需提前确认

方案C:曲线救国——先读英语硕士,再补临床资质

✅ 路径:
   英语健康管理硕士(公立)→ 德国工作签证 → 在职补康复治疗师资质
🎯 适合:目标是"健康机构管理"而非"一线治疗师"的朋友

📋 申请前必查清单

  1. 语言要求:确认是”全英语”还是”英德混合”
  2. 学历认证:中国运动康复本科是否被认可为”相关专业”
  3. 先修课要求:部分项目要求管理学/统计学基础
  4. 实习/工作经验:部分公立项目要求2年相关经验[[38]]
  5. 认证问题:私立大学学位回国是否能做学历认证(教育部留学服务中心查询)

🌟 一句话总结
如果朋友坚持英语+公立+运动康复技术方向,目前德国选择非常有限;建议要么接受德语学习以获取更对口专业,要么转向健康管理/公共卫生方向(英语选项更多),要么考虑私立大学作为过渡方案。

需要我帮您用DAAD数据库具体筛选几个符合条件的英语硕士项目链接吗?或者帮您起草一封德语/英语咨询邮件,向目标学校确认语言要求和背景匹配度?



🎓 UE大学健康类硕士专业对比分析

根据UE大学官网信息[[13]],以下是与运动康复背景可能相关的硕士专业汇总:


📋 专业对比总表

专业名称 学位 地点 学制 核心方向 匹配度
Prevention & Therapy Management M.Sc. Iserlohn 2-4学期 健康管理+商业运营 ⭐⭐⭐⭐(管理转型)
Clinical Psychology (Rehab & Geronto) M.Sc. Hamburg/Iserlohn 4学期 康复心理学+老年心理 ⭐⭐⭐⭐(技术深化)
International Public Health Management M.Sc. Iserlohn 2-4学期 公共卫生政策+国际视角 ⭐⭐⭐(政策/研究)
Psychology (Coaching & Counseling) M.Sc. Berlin 4学期 心理咨询+行为干预 ⭐⭐⭐(软技能补充)

🔍 重点推荐:两个最对口方向

✅ 选项1:Prevention & Therapy Management M.Sc.(已分析)

🎯 适合:想从"治疗师"转向"管理者/创业者"的朋友
✅ 优势:
   • 商科+健康复合背景,就业面广
   • 课程含创业管理、健康市场分析,适合开康复工作室
   • 英语授课,国际认可度高
⚠️ 注意:不涉及康复技术深化,纯技术路线慎选

✅ 选项2:Clinical Psychology (Rehabilitation & Gerontopsychology) M.Sc.

🎯 适合:想深化"康复技术+心理干预"能力的朋友
✅ 优势:
   • 名称含"Rehabilitation",与运动康复专业衔接度高
   • 聚焦康复心理学、老年康复,符合老龄化社会需求
   • 心理学+康复的交叉背景,在临床/社区康复领域有差异化优势
⚠️ 注意:
   • 属于心理学范畴,需确认是否接受运动康复本科背景申请
   • 课程可能侧重心理评估与干预,而非运动治疗技术
   • 建议提前联系学校确认先修课要求

🎯 决策建议:按职业目标选择

如果朋友想...
│
├─ 🏢 开康复机构/做健康管理创业 → Prevention & Therapy Management ✅
├─ 🏥 做临床康复治疗师(技术路线)→ Clinical Psychology (Rehab) ⚠️需确认背景匹配
├─ 🌍 进国际组织/做公共卫生政策 → International Public Health Management ✅
├─ 💬 做康复咨询/患者心理支持 → Psychology (Coaching & Counseling) ✅
└─ 🔬 读博做康复科研 → 建议同步关注公立大学的Sport Science/Rehabilitation Science

📝 下一步行动清单

  1. 联系学校招生办(关键!)

    • 邮件咨询:运动康复本科学历是否符合 Clinical Psychology (Rehab) 的入学要求?
    • 询问是否需要补修心理学基础课程?
  2. 对比课程大纲

    • 下载两个专业的Module Handbook,重点看:
      • 是否有运动科学/康复技术相关模块?
      • 实践环节是否包含临床实习?
  3. 调研就业去向

    • 通过LinkedIn搜索该校毕业生,看康复背景学生的实际就业路径
  4. 备选方案

    • 如果以上专业匹配度都不理想,可同步关注德国公立大学的应用科学类项目,如:
      • Sports Therapy and Movement Therapy (University of Applied Sciences for Sports and Management, Potsdam) [[9]]
      • Sports and Movement Therapy (German University of Prevention and Health Management) [[3]]

💡 一句话总结
如果朋友想保留康复技术底色+补充心理/管理能力,建议优先确认 Clinical Psychology (Rehab) 的申请可行性;如果明确想跨界转型管理/创业,之前分析的 Prevention & Therapy Management 依然是稳妥选择。

需要我帮你起草一封英文咨询邮件,向学校确认专业匹配度吗?



🎓 Prevention & Therapy Management (M.Sc.) 专业分析

📋 专业基本定位

维度 内容
所属院系 商学院(Department of Business)⚠️
学位类型 理学硕士(M.Sc.)
学制学分 120 ECTS(通常2年)
授课语言 英语
核心方向 健康管理 + 商业运营,非临床技术方向

🔍 课程结构概览

📚 核心模块(15个必修+1个选修):
├── 健康科学基础(预防/治疗理论)
├── 管理类:战略管理、人力资源管理、项目管理、质量管理
├── 商业类:健康市场分析、健康经济金融、创业营销
├── 前沿类:医疗信息化、行业趋势、领导力、CSR伦理
├── 实践类:顶点项目(15学分)+ 硕士论文(25学分)
└── 方法类:高级研究方法

💡 关键特点:这是一个”用商科思维管理健康服务机构“的项目,而非培养临床治疗师。


✅ 对运动康复背景申请者的匹配度分析

🟢 优势/适配点

  1. 专业背景衔接:运动康复属于健康科学范畴,能更好地理解”预防与治疗”的学科逻辑
  2. 实践经验加成:若有临床/康复机构实习经历,在”服务管理””质量管理”等模块中更有发言权
  3. 职业转型友好:如果目标是从技术岗转向管理岗/创业/咨询,这个项目能提供系统的商业知识框架
  4. 差异化竞争力:「康复技术背景 + 健康管理硕士」组合在健康产业政策、机构运营领域有独特优势

🔴 需要注意的挑战

  1. 专业方向转变:课程几乎不涉及运动康复的专业技术深化(如评估方法、干预技术等)
  2. 商科基础要求:虽无硬性先修要求,但健康经济、财务管理、战略分析等内容对纯工科背景可能有学习曲线
  3. 职业目标错配风险:若想继续做一线治疗师或深耕康复科研,此专业帮助有限

🎯 决策建议:先问清楚朋友的职业目标

如果朋友想… 是否推荐 理由
🏢 进入健康机构做管理/运营/项目协调 ✅ 推荐 课程直接对口,管理+行业知识双加持
💼 创业开康复工作室/健康咨询公司 ✅ 推荐 创业管理+营销+健康市场模块高度相关
🌍 在国际健康组织/政策部门发展 ✅ 推荐 战略思维+伦理+跨文化管理培养全局视野
🔬 继续做康复治疗师/深化技术能力 ❌ 不推荐 无临床技术课程,建议考虑运动科学/康复科学硕士
🎓 读博走学术科研路线 ⚠️ 谨慎 除非研究方向是健康服务管理/卫生政策,否则方法论训练可能不够聚焦

📝 行动建议

  1. 与朋友深入沟通:明确其3-5年职业规划,是”技术专家”还是”管理者/创业者”路径?
  2. 联系学校确认
    • 运动康复本科学历是否被认可为相关专业?
    • 是否建议补充商科先修知识(如基础经济学、管理学)?
  3. 调研就业去向:询问该项目往届毕业生(尤其有健康背景的学生)的就业情况
  4. 备选方案:若偏向技术深化,可同步关注德国高校的 Sport Science, Rehabilitation Science, Physical Therapy 等硕士项目

🌟 一句话总结
如果朋友想用管理思维赋能健康行业,这个专业是很好的”跨界跳板”;如果想继续精进康复技术,则建议寻找更对口的专业方向。

需要我帮你进一步搜索德国其他运动康复相关硕士项目做对比吗?

Used samples (Manuscript_Marius_Karoline_2026)

分析脚本与论文图表的映射

🔹 Group3 vs Group4 分析

脚本MicrobiotaProcess_PCA_Group3-4.R
对应图表:论文 图 4A–C

图号 内容 分析方法
4A 实验设计示意图
4B PCoA 图(Bray-Curtis 距离) mp_cal_pcoa() + mp_plot_ord()
4C 差异丰度 OTU 气泡图 DESeq2 + 气泡图可视化

该分析比较的是老年雄性 vs 老年雌性小鼠的稳态肠道微生物组成


🔹 Group9_10_11 vs pre-FMT 分析

脚本MicrobiotaProcess_Group9_10_11_PreFMT.R + Phyloseq_Group9_10_11_pre-FMT.Rmd
对应图表:论文 图 5B、5C、5D、5E

图号 内容 分析方法
5B FMT 后受体小鼠的 PCoA 图 mp_cal_pcoa() + mp_plot_ord()
5C 科水平相对丰度堆叠图 mp_plot_abundance(taxa.class = Class)
5D 差异丰度 OTU 气泡图(FMT 组间比较) DESeq2/LEfSe + 气泡图
5E 脑内 IL-17A⁺ γδ T 细胞流式定量 流式细胞术 + 统计检验

该分析验证粪便微生物移植(FMT)对年轻受体小鼠卒中后免疫反应的影响


✅ 关键分析要点确认

  1. 距离矩阵:均使用 Bray-Curtis + Hellinger 转换
  2. 统计检验:PERMANOVA (mp_adonis) + 事后两两比较
  3. 可视化mp_plot_ord 用于序贯图,mp_plot_abundance 用于组成图
  4. 输出格式:PNG/PDF/SVG 多格式保存,便于论文投稿

  • 🔍 检查某段代码的逻辑或参数设置?
  • 📊 解释某个统计结果(如 PERMANOVA 的 R²、p 值)?
  • 🎨 优化图表的美学参数(颜色、字体、图例)?
  • 📝 撰写方法部分或图注的英文描述?


🐭 实验小鼠组别详解(中文版)

以下是本研究中使用的全部 14 个实验组别的详细说明,按功能分类整理:


🔹 第一类:中风模型组(用于图 4 和补充图 3)

组号 样本前缀 完整样本 性别/年龄 状态 用途
1 sample-A* A1–A11 ♂ 老年 中风后 3 天 图 4D–F(血液/脑组织 SCFA 检测)
2 sample-B* B1–B16 ♀ 老年 中风后 3 天 图 4D–F(血液/脑组织 SCFA 检测)

📌 说明:这两组用于比较中风后老年雄性和雌性小鼠的微生物代谢物(短链脂肪酸)水平差异。


🔹 第二类:基线供体组(用于图 4、补充图 4、图 5C)

组号 样本前缀 完整样本 性别/年龄 状态 用途
3 sample-C* C1–C10 ♀ 老年 基线,FMT 供体 图 4A–C(16S 测序)、补充图 4、图 5C(Boxplot 3)
4 sample-E* E1–E10 ♂ 老年 基线,FMT 供体 图 4A–C(16S 测序)、补充图 4、图 5C(Boxplot 2)
5 sample-F* F1–F5 ♂ 年轻 基线,FMT 供体 对照供体,未在主图中展示

📌 关键说明

  • 组 3 和组 4 是粪菌移植(FMT)的供体小鼠,用于提供老年雌/雄肠道菌群
  • 图 4 和补充图 4 中实际使用的样本为:♀供体 C1–C6(n=6),♂供体 E1–E8(n=8),其余样本因年龄偏小或测序深度不足被排除

🔹 第三类:FMT 预处理组(用于图 5B 紫色点)

组号 样本前缀 完整样本 性别/年龄 状态 用途
6 sample-G* G1–G6 ♂ 老年 FMT 前,抗生素处理前,批次 I 图 5B(紫色,pre-FMT 基线)
7 sample-H* H1–H6 ♀ 老年 FMT 前,抗生素处理前,批次 I 图 5B(紫色,pre-FMT 基线)
8 sample-I* I1–I6 ♂ 年轻 FMT 前,抗生素处理前,批次 II 图 5B(紫色,pre-FMT 基线)

📌 说明:这三组合并为”pre-FMT”基线组(n=18),代表年轻雄性受体小鼠在接受粪菌移植之前的肠道菌群状态。


🔹 第四类:FMT 受体组(用于图 5)

组号 样本前缀 完整样本 性别/年龄 接受供体 状态 用途
9 sample-J* J1–J4, J10, J11 ♂ 年轻 老年♂供体 FMT 后,中风前 图 5B🔵、5C(Boxplot 4)、5D、5E
10 sample-K* K1–K6 ♂ 年轻 老年♀供体 FMT 后,中风前 图 5B🔴、5C(Boxplot 5)、5D、5E
11 sample-L* L2–L6 ♂ 年轻 年轻♂供体 FMT 后,中风前 图 5B🟢、5E(对照)

📌 关键说明

  • 所有受体均为年轻雄性小鼠,仅供体来源不同
  • “aged♂ FMT” = 接受老年雄性供体粪便的年轻受体(不是受体本身是老年!)
  • 图 5C 的 5 个箱线图 = pre-FMT 基线 + 2 个供体组 + 2 个受体组(不含年轻♂供体受体组

🔹 第五类:FMT + 中风后组(未在主图展示)

组号 样本前缀 完整样本 性别/年龄 接受供体 状态 用途
12 sample-M* M1–M8 ♂ 老年 老年♂供体 FMT 后,中风后 补充分析
13 sample-N* N1–N10 ♀ 老年 老年♀供体 FMT 后,中风后 补充分析
14 sample-O* O1–O8 ♂ 年轻 年轻♂供体 FMT 后,中风后 补充分析

📌 说明:这三组用于探索性分析,未出现在主论文图表中。


🧭 快速记忆口诀

✅ "FMT 标签 = 供体特征,不是受体特征"
   • aged♂ FMT = 供体是老年雄性
   • 受体永远是年轻雄性(本实验设计)

✅ 图 4 = 老年供体(组 3/4)+ 老年中风小鼠(组 1/2)
✅ 图 5 = FMT 实验:受体(组 6–11)+ 供体(组 3/4)
✅ 补充图 4 = 仅老年供体(组 3/4,筛选后 C1–C6, E1–E8)

⚠️ 样本排除说明

组别 排除样本 排除原因
组 3(♀供体) C7, C8, C9, C10 C8–C9 年龄偏小;C10 为离群值/测序深度低
组 4(♂供体) E9, E10 测序深度低/离群值
组 9(受体) J5, J6, J7, J8, J9 测序深度不足或质量控制排除
组 10(受体) K7–K15 测序深度不足或质量控制排除
组 11(受体) L1, L7–L15 测序深度不足或质量控制排除

📌 最终用于分析的样本数以各图图例标注为准(如:图 5 中 aged♂ FMT n=6, aged♀ FMT n=6)


TODO: 导出完整的样本–组别映射 CSV 文件,or 提供某张图的精确样本列表🎯



关于 “aged♂ FMT” 的明确解释

aged♂ FMT = 接受了老年雄性供体粪便的年轻雄性受体小鼠


🔹 实验设计核心逻辑

角色 年龄/性别 说明
受体(接受粪便) 🐭 年轻雄性(4周龄起始) 所有 FMT 组的受体都是相同的年轻雄性小鼠
供体(提供粪便) 🐭 老年雄性 / 老年雌性 / 年轻雄性 供体的年龄/性别是实验变量

🔹 样本分组详解

🟣 Purple (pre-FMT, n=18): G1–G6, H1–H6, I1–I6
   → FMT前的基线年轻雄性小鼠(未接受移植)

🔵 Blue (aged♂ FMT, n=6): J1, J2, J3, J4, J10, J11
   → 年轻雄性受体 + 接受【老年雄性】供体粪便

🔴 Red (aged♀ FMT, n=6): K1–K6
   → 年轻雄性受体 + 接受【老年雌性】供体粪便

🟢 Green (young♂ FMT, n=5): L2–L6
   → 年轻雄性受体 + 接受【年轻雄性】供体粪便(对照组)

🔹 文献依据

来自 260311_LTPaper.pdf Figure 5 图例:

“Principal coordinates analysis (PCoA) of young male mice before (purple) (n=18), and after FMT of aged male (n=6) (blue) or female (n=6) (red) or young male (n=5) (green) stool donors.”

→ 明确说明分析对象是 young male mice,括号内描述的是 stool donors(粪便供体)的特征。

来自 Supplemental Methods “Microbiota eradication and FMT”:

“4 weeks old male mice were treated for 2 weeks with an antibiotic cocktail… recipient mice were gavaged with donor stool four times over two weeks.”

→ 受体小鼠起始年龄为 4周龄(年轻)

Figure 5 小标题:

“FMT of aged male microbiota increases IL-17A-producing γδ T cells in the post-ischemic brain of young recipient mice

→ 再次确认受体是 young recipient mice


🔹 为什么这样设计?

这个实验的核心科学问题是:

“供体微生物的年龄/性别特征,能否通过移植’传递’给受体,并影响受体的免疫反应?”

通过保持受体一致(年轻雄性),仅改变供体来源,可以:

  1. 排除受体自身年龄/性别的混杂效应
  2. 直接评估供体微生物对受体免疫表型(如 IL-17A⁺ γδ T 细胞)的因果影响
  3. 验证”微生物介导的年龄/性别差异”假说

✅ 快速记忆口诀

“FMT 标签 = 供体特征,不是受体特征”

  • aged♂ FMT = 供体是老年雄性
  • 受体永远是年轻雄性(本实验中)


🔹 Figure 5B: PCoA of FMT Experiment

“Principal coordinates analysis (PCoA) of young male mice before (purple) (n=18), and after FMT of aged male (n=6) (blue) or female (n=6) (red) or young male (n=5) (green) stool donors.”

  • 🟣 Purple (pre-FMT, n=18): Groups 6+7+8 → G1–G6, H1–H6, I1–I6
  • 🔵 Blue (aged♂ FMT, n=6): Group9 → J1, J2, J3, J4, J10, J11
  • 🔴 Red (aged♀ FMT, n=6): Group10 → K1–K6
  • 🟢 Green (young♂ FMT, n=5): Group11 → L2–L6 (L1, L7–L15 excluded for low depth/QC)

🔹 Figure 5C=Figure 5B+C1-7+E1-10 (Need to be confirmed?): Family-Level Relative Abundance Boxplots (5 panels)

Based on your co-author’s note: “Figure 5C uses the Figure 5B recipient samples PLUS the aged donor samples (Groups 3 & 4).”

  • Boxplot 1 (pre-FMT baseline, n=18): Groups 6+7+8 → G1–G6, H1–H6, I1–I6
  • Boxplot 2 (aged♂ stool donors, n=8): Group4 → E1–E10
  • Boxplot 3 (aged♀ stool donors, n=6): Group3 → C1–C7
  • Boxplot 4 (aged♂ FMT recipients, n=6): Group9 → J1, J2, J3, J4, J10, J11
  • Boxplot 5 (aged♀ FMT recipients, n=6): Group10 → K1–K6
  • !!No Group11 (L2-L6)!!

⚠️ Key difference: Group11 (young♂ FMT recipients, L2–L6) is shown in Figure 5B but is NOT included in Figure 5C, since Figure 5C focuses on comparing the effect of aged donor microbiota.

🔹 Figure 5D: Bubble Plot of Differentially Abundant Taxa (DESeq2)

“Bubble plot showing differentially abundant Operational Taxonomic Units (OTUs) between young male recipients of aged female vs. aged male FMT. x-axis = log₂ fold change, y-axis = bacterial family, bubble size = adjusted p-value, color = bacterial order.”

  • 🔵 Aged♂ FMT recipients (Group9, n=6): J1, J2, J3, J4, J10, J11 → Reference group (log₂FC < 0 = enriched in this group)
  • 🔴 Aged♀ FMT recipients (Group10, n=6): K1–K6 → Comparison group (log₂FC > 0 = enriched in this group)
Key families highlighted in the plot: Direction Family (Order) Enriched in Biological note
🔴 Positive log₂FC Lachnospiraceae (Clostridiales) Aged♀ FMT SCFA producer
🔴 Positive log₂FC Ruminococcaceae (Clostridiales) Aged♀ FMT SCFA producer
🔴 Positive log₂FC Muribaculaceae (Bacteroidales) Aged♀ FMT SCFA producer
🔴 Positive log₂FC Desulfovibrionaceae (Desulfovibrionales) Aged♀ FMT Sulfate-reducing
🔵 Negative log₂FC Erysipelotrichaceae (Erysipelotrichales) Aged♂ FMT Pro-inflammatory association
🔵 Negative log₂FC Rikenellaceae (Bacteroidales) Aged♂ FMT Context-dependent
🔵 Negative log₂FC Clostridiales vadinBB60 group Aged♂ FMT Function unclear

⚠️ Note: This analysis uses DESeq2 on non-rarefied integer counts from ps_filt, with taxa prefiltered (total counts ≥10). Only taxa with Benjamini–Hochberg adjusted p < 0.05 are shown. The same ASVs/OTUs appear in Figure 4C and Supplementary Figure 4B, but Figure 5D specifically compares FMT recipient outcomes (Groups 9 vs. 10), not baseline donor differences.


🔹 Supplementary_Figure4=Figure4B-C: Aged Donors (Homeostatic)

“(A) Bray-Curtis distances between aged male-male, female-female and female-male stool samples under homeostatic conditions (nmale=8 and nfemale=6). (B) Cladogram showing differentially abundant OTUs…”

  • 👨 Aged male donors (n=8): Group4 → E1–E8 (E9, E10 excluded for low sequencing depth/outliers)
  • 👩 Aged female donors (n=6): Group3 → C1–C6 (C7–C10 excluded; C8–C9 younger mice, C10 outlier)

🔹 Figure 4B-C: Sex Differences in Aged Mice (16S rRNA-seq panels B–C)

“We profiled the gut bacterial composition of aged male and female mice by 16S rRNA-seq…”

  • Baseline aged female donors: Group3 → C1–C6
  • Baseline aged male donors: Group4 → E1–E8

(Note: Figure 4D–F show SCFA concentrations measured by targeted UHPLC-MS/MS, not 16S data.)


✅ PICRUSt2 NOT used in Figure 4D–F

Your observation is CORRECT: PICRUSt2 results are NOT used in Figure 4D–F.

Question Answer Evidence
Are PICRUSt2 results used in Figure 4? No Figure 4D–F legend explicitly states: “measured by targeted mass spectrometry”
Are PICRUSt2 results used anywhere in the manuscript? No evidence README_PICRUSt2.txt files contain exploratory pipeline notes, but no PICRUSt2 figures, tables, or text appear in 260311_LTPaper.pdf or 260310_Supplements.pdf
Is the SCFA data in Figure 4D–F experimentally measured? Yes Supplemental Methods (pages 12–13) describe UHPLC-MS/MS quantification with internal standards, derivatization, and MRM parameters

Key distinction:

  • PICRUSt2Predicts functional potential (gene/pathway abundances) from 16S sequences; outputs are relative, unitless values.
  • Figure 4D–FMeasures actual SCFA concentrations (acetate, butyrate, etc.) in µmol/l via targeted mass spectrometry; outputs are absolute, quantitative values.

Here is the merged quick reference table combining Figure 5B, 5C, and 5D with related figures, formatted for easy copy-paste:


🔹 Quick Reference: All Figure 5 Panels vs. Related Figures

Figure Comparison Sample IDs (exact) n Purpose
Figure 4B-C Aged♀ vs. aged♂ donors (homeostatic) C1–C6 vs. E1–E8 6 vs. 8 Baseline sex differences in microbiota (DESeq2 bubble plot)
Suppl Fig 4B Same as Fig 4C C1–C6 vs. E1–E8 6 vs. 8 Phylogenetic context of differential taxa (cladogram)
Figure 5B Pre-FMT vs. post-FMT recipients (4-group PCoA) G1–G6, H1–H6, I1–I6 (pre-FMT); J1, J2, J3, J4, J10, J11 (aged♂ FMT); K1–K6 (aged♀ FMT); L2–L6 (young♂ FMT) 18, 6, 6, 5 PCoA: microbiome shift after FMT (Bray–Curtis)
Figure 5C Donors vs. recipients (5 boxplots, family-level) G1–G6, H1–H6, I1–I6 (pre-FMT); E1–E8 (aged♂ donors); C1–C6 (aged♀ donors); J1, J2, J3, J4, J10, J11 (aged♂ FMT); K1–K6 (aged♀ FMT) 18, 8, 6, 6, 6 Taxonomic composition: donors vs. recipients (relative abundance)
Figure 5D Aged♀ vs. aged♂ FMT recipients (DESeq2) K1–K6 vs. J1, J2, J3, J4, J10, J11 6 vs. 6 Effect of donor microbiota on recipient immune response (differential abundance)
Figure 5E Same recipients as Fig 5D (+ young♂ control) K1–K6 vs. J1, J2, J3, J4, J10, J11 (+ L2–L6) 6 vs. 6 (+5) IL-17A+ γδ T cells in brain post-FMT (flow cytometry)

🔹 Sample-ID Master List for Figure 5

Group # Description Sample Prefix Full IDs Used In
3 Aged female, baseline FMT donor sample-C* C1–C10 (C1–C6 used in Fig 4B-C, Suppl Fig 4, Fig 5C) Fig 4C, Suppl Fig 4, Fig 5C
4 Aged male, baseline FMT donor sample-E* E1–E10 (E1–E8 used in Fig 4B-C, Suppl Fig 4, Fig 5C) Fig 4B-C, Suppl Fig 4, Fig 5C
6 Aged male, pre-antibiotics FMT batch I sample-G* G1–G6 Fig 5B (purple), Fig 5C (Boxplot 1)
7 Aged female, pre-antibiotics FMT batch I sample-H* H1–H6 Fig 5B (purple), Fig 5C (Boxplot 1)
8 Young male, pre-antibiotics FMT batch II sample-I* I1–I6 Fig 5B (purple), Fig 5C (Boxplot 1)
9 Young male, post-FMT aged male stool sample-J* J1–J4, J10, J11 (J5–J9 excluded) Fig 5B (blue), Fig 5C (Boxplot 4), Fig 5D, Fig 5E
10 Young male, post-FMT aged female stool sample-K* K1–K6 Fig 5B (red), Fig 5C (Boxplot 5), Fig 5D, Fig 5E
11 Young male, post-FMT young male stool sample-L* L2–L6 (L1, L7–L15 excluded) Fig 5B (green), Fig 5E (not in Fig 5C/D)

🔹 Key Notes for Interpretation

  1. Figure 5B vs. 5C: Figure 5B shows beta-diversity (PCoA) of all FMT groups; Figure 5C shows taxonomic composition (boxplots) of donors + recipients. Group11 (young♂ FMT) is in 5B but not in 5C.
  2. Figure 5D: Uses DESeq2 on non-rarefied counts from ps_filt (taxa prefiltered: total counts ≥10). Only taxa with BH-adjusted p < 0.05 are shown.
  3. Figure 5E: Includes the same recipients as Fig 5D plus the young♂ FMT control group (Group11, L2–L6) for comparison of IL-17A+ γδ T cells.
  4. Sample exclusions: C7–C10, E9–E10, J5–J9, K7–K15, L1, L7–L15 were excluded for low depth, outliers, or QC reasons (see README files).

Let me know if you’d like me to:

  1. Export the exact DESeq2 results table for Figure 5D as CSV/Excel,
  2. Provide the R code snippet that generates the bubble plot for Figure 5D, or
  3. Draft the full email reply to your colleague with these merged tables integrated. 🎯


Draft Reply to M.


Subject: Re: Manuscript Review (Lines 276-348) & NCBI SRA Citation

Thank you for sending the manuscript and for the opportunity to review the specified sections. I have carefully reviewed lines 276–348 covering the microbiota composition analysis and FMT experiments.

✅ Text Review: Minor Corrections Suggested

I noticed a few minor typographical inconsistencies in the taxonomic nomenclature that may warrant correction before submission:

Line Current Text Suggested Correction
295 Muribaculae (order Bacteroidalis) Muribaculaceae (order Bacteroidales)
298 Ruminococcae Ruminococcaceae
334 Muribaculae Muribaculaceae

These appear to be minor spelling variations; please confirm if these align with your intended taxonomic references.

The scientific content, logic flow, and figure references (Fig. 4A–D, Fig. 5A–E) are clear and well-integrated with our analysis scripts (MicrobiotaProcess_Group3-4.R and MicrobiotaProcess_Group9_10_11_PreFMT.R).

🗂️ NCBI SRA Data Submission

Regarding the NCBI SRA citation:

  1. Data readiness: The 16S rRNA sequencing data (Group 3/4 and Group 9/10/11/pre-FMT) are processed and ready for upload.
  2. Next steps:
    • I can prepare the metadata table (sample IDs, Group, Sex_age, pre_post_stroke) in the format required by NCBI BioProject.
    • Once uploaded, we will receive a BioProject/BioSample accession number (e.g., PRJNAxxxxxx) to cite in the manuscript.
  3. Suggested placement for citation:
    Dataset: 16S rRNA-seq data are deposited in the NCBI Sequence Read Archive (SRA) 
    under BioProject accession number [TO BE ADDED].

If you confirm, I can proceed with preparing the submission files this week so we meet your timeline.

📅 Timeline

I am flexible and ready to assist with final revisions or SRA submission as needed. Please let me know the exact submission date once confirmed, and I will prioritize accordingly.

Thank you again, and I wish you a pleasant weekend as well!


Note for me: Before sending,:

  1. Double-check the taxonomic spellings against your reference database (SILVA/GTDB)
  2. Confirm whether Marius prefers to handle the SRA upload himself or delegate it
  3. Attach the prepared metadata template if you want to expedite the process

Would you like me to help draft the NCBI BioProject metadata table or refine any part of this reply?

Generating manhattan plots for [MKL-1|WaGa] wt cells vs [MKL-1|WaGa] wt EV (Data_Ute_exceRpt_workspace)

  • exceRpt_DiagnosticPlots

  • pca2
  • pca
  • manhattan_plot_MKL1_vs_EV
  • manhattan_plot_WaGa_vs_EV
  1. Input data

     #Data_Ute_smallRNA_3/               Data_Ute_smallRNA_4/               Data_Ute_smallRNA_7/
     #Data_Ute_smallRNA_3_nextflow/      Data_Ute_smallRNA_6_size_selected/ Data_Ute_smallRNA_7_out/
    
     See samples in Sequencing_overview.xlsx
    
     Batch,Date,Status
     1,August 2021,⚪ Experimental (nf655, nf657)
     2,February 2022,❌ Not found in my processed datasets (nf701, nf705)
     3,August 2022,⚪ Experimental (nf774, nf780)
     4,November 2022,⚪ Experimental (nf796, nf797)
     5,June 2023,⚪ Experimental (nf876, nf887 (MKL-1 but scr_DMSO!))
     6,August 2023,⚪ Experimental (nf916, nf917, nf918 (MKL-1 but scr_DMSO!), nf919 (MKL-1 but scr_DMSO!))
    
     #? Maybe we can include the parental cells, EVs (see 3 MKL-1_wt, 1 MKL_1_derived_EV_miRNA in the point 'BATCH_1, 3, 4') and scr_DMSO_EVs for MKL-1 (3 samples in BATCH_5 and BATCH_6) for exceRpt running!
    
     #BATCH_1: (plot-numpy1) jhuang@WS-2290C:/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_3$ find . -name "*fastq.gz" (6 samples)
     ./210817_NB501882_0294_AHW5Y2BGXJ/nf656/MKL_1_derived_EV_RNA_S13_R1_001.fastq.gz
     ./210817_NB501882_0294_AHW5Y2BGXJ/nf658/WaGa_derived_EV_RNA_S14_R1_001.fastq.gz
     #ERROR_wrong_demulti ./210817_NB501882_0294_AHW5Y2BGXJ_neu/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz
     #ERROR_wrong_demulti ./210817_NB501882_0294_AHW5Y2BGXJ_neu/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz
    
     #BATCH_3:
     ./220617_NB501882_0371_AH7572BGXM/nf774/0403_WaGa_wt_S20_R1_001.fastq.gz #*
     ./220617_NB501882_0371_AH7572BGXM/nf780/0505_MKL1_wt_S26_R1_001.fastq.gz
    
     # newDemulti of BATCH_1, 3, 4:
     *  ./230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz *
     *  ./230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz
     *  ./230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz
     *  ./230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz *
     *  ./230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz *
     *  ./230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz *
    
     #BATCH_5: (plot-numpy1) jhuang@WS-2290C:/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_4$ find . -name "*.fastq.gz" ()
     #ERROR_wrong_adapter ./230602_NB501882_0428_AHKG53BGXT/demulti_wrong_adapter_trimming/nf876/1002_WaGa_sT_Dox_S1_R1_001.fastq.gz
     #ERROR_wrong_adapter ./230602_NB501882_0428_AHKG53BGXT/demulti_wrong_adapter_trimming/nf887/2312_MKL_1_scr_DMSO_S2_R1_001.fastq.gz
     ./230602_NB501882_0428_AHKG53BGXT/demulti_new/nf876/1002_WaGa_sT_Dox_S1_R1_001.fastq.gz
     * NOT_USED: MKL-1 but scr_DMSO: ./230602_NB501882_0428_AHKG53BGXT/demulti_new/nf887/2312_MKL_1_scr_DMSO_S2_R1_001.fastq.gz
    
     #BATCH_6: (plot-numpy1) jhuang@WS-2290C:/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_6_size_selected$ find . -name "*.fastq.gz"
     ./230811_NB501882_0432_AHMW2CBGXT/nf916/1002_WaGa_wt_S1_R1_001.fastq.gz
     ./230811_NB501882_0432_AHMW2CBGXT/nf917/1002_WaGa_sT_Dox_S2_R1_001.fastq.gz
     * NOT_USED: MKL-1 but scr_DMSO: ./230811_NB501882_0432_AHMW2CBGXT/nf918/2312_MKL_1_scr_DMSO_S3_R1_001.fastq.gz
     * NOT_USED: MKL-1 but scr_DMSO: ./230811_NB501882_0432_AHMW2CBGXT/nf919/2403_MKL_1_scr_DMSO_S4_R1_001.fastq.gz
    
     # BATCH_100
     # Data_Ute_smallRNA_7/$ find . -name "*fastq.gz" (10 samples)
     ./231016_NB501882_0435_AHG7HMBGXV/nf930/01_0505_WaGa_wt_EV_RNA_S1_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf931/02_0505_WaGa_sT_DMSO_EV_RNA_S2_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf932/03_0505_WaGa_sT_Dox_EV_RNA_S3_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf933/04_0505_WaGa_scr_DMSO_EV_RNA_S4_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf934/05_0505_WaGa_scr_Dox_EV_RNA_S5_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf935/06_1905_WaGa_wt_EV_RNA_S6_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf936/07_1905_WaGa_sT_DMSO_EV_RNA_S7_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf937/08_1905_WaGa_sT_Dox_EV_RNA_S8_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf938/09_1905_WaGa_scr_DMSO_EV_RNA_S9_R1_001.fastq.gz
     ./231016_NB501882_0435_AHG7HMBGXV/nf939/10_1905_WaGa_scr_Dox_EV_RNA_S10_R1_001.fastq.gz
    
     #I can’t find the nf940/11_control_MKL1_S11_R1_001.fastq.gz sample so I am not sure what kind of sample or condition it is.
     #NOT_USED_SINCE_IT_DOES_NOT_BELONG_TO_THE_PROJECT: ./231016_NB501882_0435_AHG7HMBGXV/nf940/11_control_MKL1_S11_R1_001.fastq.gz
     #NOT_USED_SINCE_IT_DOES_NOT_BELONG_TO_THE_PROJECT: ./231016_NB501882_0435_AHG7HMBGXV/nf941/12_control_WaGa_S12_R1_001.fastq.gz
    
     # Data_Ute_smallRNA_7$ find . -name "*.fastq.gz" (6 samples)
     ./250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R1_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R1_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R1_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R1_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R1_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R1_001.fastq.gz
    
     ./250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R2_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R2_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R2_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R2_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R2_001.fastq.gz
     ./250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R2_001.fastq.gz
    
     #941,07 € 1051,79*0.85=894.0215
    
     #FINAL DATASETS for running: newDemulti_of_BATCH_1_3_4 + BATCH_100 - Samples[nf941, nf940]!, Using directly with the orignal sample names nf*.
    
             MKL-1 wt cells (nf780*, nf796*, nf797*)
             MKL-1 wt_EV_RNA (nf655*)
    
             WaGa wt cells (nf774* (Considering to be deleted, due to possibly be an outlier, but in the current version, it is still included in the analysis), nf961, nf962)
             WaGa wt_EV_RNA (nf657* (The sample was excluded, since it is obviously a outlier, not clustered with the other 2 samples), nf930, nf935)
             WaGa_sT_DMSO_EV_RNA (nf931, nf936, nf971)
             WaGa_sT_Dox_EV_RNA (nf932, nf937, nf972)
             WaGa_scr_DMSO_EV_RNA (nf933, nf938, nf973)
             WaGa_scr_Dox_EV_RNA (nf934, nf939, nf974)
  2. Adapter trimming

     #some common adapter sequences from different kits for reference:
     #    - TruSeq Small RNA (Illumina): TGGAATTCTCGGGTGCCAAGG
     #    - Small RNA Kits V1 (Illumina): TCGTATGCCGTCTTCTGCTTGT
     #    - Small RNA Kits V1.5 (Illumina): ATCTCGTATGCCGTCTTCTGCTTG
     #    - NEXTflex Small RNA Sequencing Kit v3 for Illumina Platforms (Bioo Scientific): TGGAATTCTCGGGTGCCAAGG
     #    - LEXOGEN Small RNA-Seq Library Prep Kit (Illumina): TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC *
     mkdir Data_Ute_exceRpt_workspace/trimmed; cd Data_Ute_exceRpt_workspace/trimmed
    
     echo "------------------------------------ cutadapting nf774 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf774.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf657 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf657.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf655 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf655.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf780 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf780.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf796 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf796.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf797 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf797.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf930 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf930.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf930/01_0505_WaGa_wt_EV_RNA_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf931 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf931.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf931/02_0505_WaGa_sT_DMSO_EV_RNA_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf932 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf932.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf932/03_0505_WaGa_sT_Dox_EV_RNA_S3_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf933 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf933.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf933/04_0505_WaGa_scr_DMSO_EV_RNA_S4_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf934 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf934.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf934/05_0505_WaGa_scr_Dox_EV_RNA_S5_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf935 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf935.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf935/06_1905_WaGa_wt_EV_RNA_S6_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf936 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf936.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf936/07_1905_WaGa_sT_DMSO_EV_RNA_S7_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf937 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf937.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf937/08_1905_WaGa_sT_Dox_EV_RNA_S8_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf938 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf938.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf938/09_1905_WaGa_scr_DMSO_EV_RNA_S9_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf939 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf939.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf939/10_1905_WaGa_scr_Dox_EV_RNA_S10_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf940 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf940.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf940/11_control_MKL1_S11_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf941 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf941.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf941/12_control_WaGa_S12_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf961 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf961.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf962 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf962.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf971 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf971.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf972 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf972.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf973 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf973.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf974 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf974.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R1_001.fastq.gz >> LOG
  3. Install exceRpt (https://github.gersteinlab.org/exceRpt/)

     docker pull rkitchen/excerpt
     mkdir MyexceRptDatabase
     cd /mnt/nvme0n1p1/MyexceRptDatabase
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
     tar -xvf exceRptDB_v4_hg38_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg19_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_mm10_lowmem.tgz
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOmiRNArRNA.tgz
     tar -xvf exceRptDB_v4_EXOmiRNArRNA.tgz
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOGenomes.tgz
     tar -xvf exceRptDB_v4_EXOGenomes.tgz
  4. Run exceRpt

     #[---- REAL_RUNNING_COMPLETE_DB ---->]
     #NOTE that if not renamed in the input files, then have to RENAME all files recursively by removing "_cutadapted.fastq" in all names in _CORE_RESULTS_v4.6.3.tgz (first unzip, removing, then zip, mv to ../results_g).
     cd trimmed
     for file in *.fastq.gz; do
         echo "mv \"$file\" \"${file/.fastq/}\""
     done
    
     mkdir results
     for sample in nf780 nf796 nf797  nf655    nf774 nf961 nf962  nf657 nf930 nf935  nf931 nf936 nf971  nf932 nf937 nf972  nf933 nf938 nf973  nf934 nf939 nf974; do
         docker run -v ~/DATA/Data_Ute_exceRpt_workspace/trimmed:/exceRptInput \
                    -v ~/DATA/Data_Ute_exceRpt_workspace/results:/exceRptOutput \
                   -v /mnt/nvme0n1p1/MyexceRptDatabase:/exceRpt_DB \
                   -t rkitchen/excerpt \
                   INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on
     done
    
     #DEBUG the excerpt env
     docker inspect rkitchen/excerpt:latest
     # Without /bin/bash → May run and exit immediately
     #docker run -it rkitchen/excerpt
     # With /bin/bash → Stays open for interaction
     docker run -it --entrypoint /bin/bash rkitchen/excerpt
  5. Processing exceRpt output from multiple samples

     cd ~/DATA/Data_Ute_exceRpt_workspace/exceRpt-master
     mamba activate r_env
     mamba install -c conda-forge -c bioconda \
         bioconductor-marray \
         bioconductor-rgraphviz \
         r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx r-rcurl r-xml \
         -y
     mamba install -c conda-forge -c bioconda \
         r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx \
         bioconductor-marray bioconductor-rgraphviz \
         -y
    
     #EXCLUDE some isolates since they have total different pattern --> outliner, e.g. nf657 from WaGa wt EV:
     sudo mv nf657* ../results_EXCLUDED/
     mkdir summaries summariesWithSampleGroupInfo
    
     (r_env) jhuang@WS-2290C:~/DATA/Data_Ute_exceRpt_workspace/exceRpt-master$ R
     #WARNING: need to reload the R-script after each change of the script.
     source("mergePipelineRuns_functions.R")
     processSamplesInDir("../results/", "../summaries")
    
     #mkdir summariesWithSampleGroupInfo; cp summaries/*.RData summariesWithSampleGroupInfo; rm summariesWithSampleGroupInfo/exceRpt_sampleGroupDefinitions.txt;
     source("mergePipelineRuns_functions_addSampleGroupInfo.R")
     processSamplesInDir("../results/", "../summariesWithSampleGroupInfo")
     #!!!!! IMPORTANT: REPORT summariesWithSampleGroupInfo/exceRpt_DiagnosticPlots.pdf !!!!!
     #POSSIBLE_TODO (MAYBE BETTER NOT DO THIS, SINCE I HAVE TO GENERATE PCA- and MANHATTAN PLOTS!!): Wir need to delete nf774 from WaGa_wt_cells, since it is very far to another two samples!
    
     #~/Tools/csv2xls-0.4/csv_to_xls.py exceRpt_miRNA_ReadsPerMillion.txt exceRpt_tRNA_ReadsPerMillion.txt exceRpt_piRNA_ReadsPerMillion.txt -d$'\t' -o exceRpt_results_detailed.xls
  6. Downstream analyis using R for miRNAs (4 MKL-1 + 17 WaGa samples)

     #Input file
     #exceRpt_miRNA_ReadCounts.txt
     #exceRpt_piRNA_ReadCounts.txt
    
     ## MKL-1 wild-type controls
     #MKL-1_wt_cells (nf780, nf796, nf797)
     #MKL-1_wt_EV (nf655)
     #
     ## WaGa experimental groups (scr = scramble control; sT = target knockdown)
     #WaGa_scr_DMSO_EV (nf933, nf938, nf973)
     #WaGa_scr_Dox_EV (nf934, nf939, nf974)
     #WaGa_sT_DMSO_EV (nf931, nf936, nf971)
     #WaGa_sT_Dox_EV (nf932, nf937, nf972)
     #
     ## WaGa wild-type controls
     #WaGa_wt_cells (nf774, nf961, nf962)
     #WaGa_wt_EV (nf657 (deleted), nf930, nf935)
    
     cd ~/DATA/Data_Ute_exceRpt_workspace/summaries
     mamba activate r_env
     R
     #> .libPaths()
     #[1] "/home/jhuang/mambaforge/envs/r_env/lib/R/library"
    
     #BiocManager::install("AnnotationDbi")
     #BiocManager::install("clusterProfiler")
     #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
     #BiocManager::install("limma")
     #BiocManager::install("sva")
     #install.packages("writexl")
     #install.packages("openxlsx")
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library("org.Hs.eg.db")
     library(DESeq2)
     library(gplots)
     library(limma)
     library(sva)
     #library(writexl)  #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx");
     library(openxlsx)
    
     setwd("../summaries/")
     d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1)
    
     # Desired column order
     desired_order <- c(
         "nf780", "nf796", "nf797",
         "nf655",
         "nf933", "nf938", "nf973",
         "nf934", "nf939", "nf974",
         "nf931", "nf936", "nf971",
         "nf932", "nf937", "nf972",
         "nf774", "nf961", "nf962",
         "nf930", "nf935"  #delete "nf657",
     )
    
     # Reorder columns
     d.raw <- d.raw[, desired_order]
     setdiff(desired_order, colnames(d.raw))  # Shows missing or misnamed columns
     #sapply(d.raw, is.numeric)
     d.raw[] <- lapply(d.raw, as.numeric)
     #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x)))
     d.raw <- round(d.raw)
     write.csv(d.raw, file ="d_raw.csv")
     write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE)
    
     # ------ Code sent to Ute ------
     #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1)
     Cell_or_EV = as.factor(c("Cell","Cell","Cell",  "EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "Cell","Cell","Cell",  "EV","EV"))  #delete "EV",
    
     replicates = as.factor(c("MKL-1_wt_cells","MKL-1_wt_cells","MKL-1_wt_cells",  "MKL-1_wt_EV",  "WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV",     "WaGa_scr_Dox_EV","WaGa_scr_Dox_EV","WaGa_scr_Dox_EV",  "WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV",  "WaGa_sT_Dox_EV","WaGa_sT_Dox_EV","WaGa_sT_Dox_EV",  "WaGa_wt_cells", "WaGa_wt_cells","WaGa_wt_cells",  "WaGa_wt_EV", "WaGa_wt_EV"))  #delete "WaGa_wt_EV",
     ids = as.factor(c("nf780", "nf796", "nf797",
         "nf655",
         "nf933", "nf938", "nf973",
         "nf934", "nf939", "nf974",
         "nf931", "nf936", "nf971",
         "nf932", "nf937", "nf972",
         "nf774", "nf961", "nf962",
         "nf930", "nf935"))  #delete "nf657",
     cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV)
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     # Filter low-count miRNAs
     dds <- dds[ rowSums(counts(dds)) > 10, ]  #1322-->903
     rld <- rlogTransformation(dds)
    
     # -- before pca --
     png("pca.png", 1200, 800)
     plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     #plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
     png("pca2.png", 1200, 800)
     #plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
    
     # Batch Effect Removal Methods (Non-batch effect removal applied!)
    
     #### STEP2: DEGs ####
     #- Heatmap untreated/wt vs parental; 1x for WaGa cell line
     #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line
     #- Manhattan plot miRNAs; 1x for WaGa cell line
     #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line
     #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line
    
     #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
     sizeFactors(dds)
     #NULL
     dds <- estimateSizeFactors(dds)
     sizeFactors(dds)
     normalized_counts <- counts(dds, normalized=TRUE)
     write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
     write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE)
    
     #---- untreated, scr_control, DMSO_control, scr_DMSO_control, sT_knockdown to parental_cells ----
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     dds$replicates <- relevel(dds$replicates, "untreated")
     dds = DESeq(dds, betaPrior=FALSE)
     resultsNames(dds)
     clist <- c("DMSO_control_vs_untreated", "scr_control_vs_untreated", "scr_DMSO_control_vs_untreated", "sT_knockdown_vs_untreated")
    
     dds$replicates <- relevel(dds$replicates, "DMSO_control")
     dds = DESeq(dds, betaPrior=FALSE)
     resultsNames(dds)
     clist <- c("sT_knockdown_vs_DMSO_control")
    
     dds$replicates <- relevel(dds$replicates, "scr_control")
     dds = DESeq(dds, betaPrior=FALSE)
     resultsNames(dds)
     clist <- c("sT_knockdown_vs_scr_control")
    
     dds$replicates <- relevel(dds$replicates, "scr_DMSO_control")
     dds = DESeq(dds, betaPrior=FALSE)
     resultsNames(dds)
     clist <- c("sT_knockdown_vs_scr_DMSO_control")
    
     dds$replicates <- relevel(dds$replicates, "WaGa_wt_cells")
     dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
     resultsNames(dds)
     clist <- c("WaGa_wt_EV_vs_WaGa_wt_cells")
    
     dds$replicates <- relevel(dds$replicates, "MKL-1_wt_cells")
     dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
     resultsNames(dds)
     clist <- c("MKL.1_wt_EV_vs_MKL.1_wt_cells")
    
     #NOTE that the results sent to Ute is |padj|<=0.1.
     for (i in clist) {
         contrast = paste("replicates", i, sep="_")
         res = results(dds, name=contrast)
         res <- res[!is.na(res$log2FoldChange),]
         #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
         res$padj <- ifelse(is.na(res$padj), 1, res$padj)
         res_df <- as.data.frame(res)
         write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
         up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
         down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
         write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
         write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
     }
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     untreated_vs_parental_cells-all.txt \
     untreated_vs_parental_cells-up.txt \
     untreated_vs_parental_cells-down.txt \
     -d$',' -o untreated_vs_parental_cells.xls;
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     DMSO_control_vs_untreated-all.txt \
     DMSO_control_vs_untreated-up.txt \
     DMSO_control_vs_untreated-down.txt \
     -d$',' -o DMSO_control_vs_untreated.xls;
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     scr_control_vs_untreated-all.txt \
     scr_control_vs_untreated-up.txt \
     scr_control_vs_untreated-down.txt \
     -d$',' -o scr_control_vs_untreated.xls;
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     scr_DMSO_control_vs_untreated-all.txt \
     scr_DMSO_control_vs_untreated-up.txt \
     scr_DMSO_control_vs_untreated-down.txt \
     -d$',' -o scr_DMSO_control_vs_untreated.xls;
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     sT_knockdown_vs_untreated-all.txt \
     sT_knockdown_vs_untreated-up.txt \
     sT_knockdown_vs_untreated-down.txt \
     -d$',' -o sT_knockdown_vs_untreated.xls;
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     sT_knockdown_vs_DMSO_control-all.txt \
     sT_knockdown_vs_DMSO_control-up.txt \
     sT_knockdown_vs_DMSO_control-down.txt \
     -d$',' -o sT_knockdown_vs_DMSO_control.xls;
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     sT_knockdown_vs_scr_control-all.txt \
     sT_knockdown_vs_scr_control-up.txt \
     sT_knockdown_vs_scr_control-down.txt \
     -d$',' -o sT_knockdown_vs_scr_control.xls;
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     sT_knockdown_vs_scr_DMSO_control-all.txt \
     sT_knockdown_vs_scr_DMSO_control-up.txt \
     sT_knockdown_vs_scr_DMSO_control-down.txt \
     -d$',' -o sT_knockdown_vs_scr_DMSO_control.xls;
    
     # ------------------- volcano_plot -------------------
     library(ggplot2)
     library(ggrepel)
    
     geness_res <- read.csv(file = paste("untreated_vs_parental_cells", "all.txt", sep="-"), row.names=1)
    
     external_gene_name <- rownames(geness_res)
     geness_res <- cbind(geness_res, external_gene_name)
     #top_g are from ids
     top_g <- c("hsa-let-7b-5p","hsa-let-7g-5p","hsa-let-7i-5p","hsa-miR-103a-3p","hsa-miR-107","hsa-miR-1224-5p","hsa-miR-122-5p","hsa-miR-1226-5p","hsa-miR-1246","hsa-miR-127-3p","hsa-miR-1290","hsa-miR-130a-3p","hsa-miR-139-3p","hsa-miR-141-3p","hsa-miR-143-3p","hsa-miR-148b-3p","hsa-miR-155-5p","hsa-miR-15a-5p","hsa-miR-17-5p","hsa-miR-184","hsa-miR-18a-3p","hsa-miR-18a-5p","hsa-miR-190a-5p","hsa-miR-191-5p","hsa-miR-193b-5p","hsa-miR-197-5p","hsa-miR-200a-3p","hsa-miR-200b-5p","hsa-miR-206","hsa-miR-20a-5p","hsa-miR-210-3p","hsa-miR-2110","hsa-miR-21-5p","hsa-miR-218-5p","hsa-miR-219a-1-3p","hsa-miR-221-3p","hsa-miR-23b-3p","hsa-miR-27a-3p","hsa-miR-27b-3p","hsa-miR-27b-5p","hsa-miR-28-3p","hsa-miR-30a-5p","hsa-miR-30c-5p","hsa-miR-30e-5p","hsa-miR-3127-5p","hsa-miR-3131","hsa-miR-3180|hsa-miR-3180-3p","hsa-miR-320a","hsa-miR-320b","hsa-miR-320c","hsa-miR-320d","hsa-miR-330-3p","hsa-miR-335-3p","hsa-miR-33b-5p","hsa-miR-340-5p","hsa-miR-342-5p","hsa-miR-3605-5p","hsa-miR-361-3p","hsa-miR-365a-5p","hsa-miR-374b-5p","hsa-miR-378i","hsa-miR-379-5p","hsa-miR-3940-5p","hsa-miR-409-3p","hsa-miR-411-5p","hsa-miR-423-3p","hsa-miR-423-5p","hsa-miR-4286","hsa-miR-429","hsa-miR-432-5p","hsa-miR-4326","hsa-miR-451a","hsa-miR-4520-3p","hsa-miR-454-3p","hsa-miR-4646-5p","hsa-miR-4667-5p","hsa-miR-4748","hsa-miR-483-5p","hsa-miR-486-5p","hsa-miR-5010-5p","hsa-miR-504-3p","hsa-miR-5187-5p","hsa-miR-590-3p","hsa-miR-6128","hsa-miR-625-5p","hsa-miR-6726-5p","hsa-miR-6730-5p","hsa-miR-676-3p","hsa-miR-6767-5p","hsa-miR-6777-5p","hsa-miR-6780a-5p","hsa-miR-6794-5p","hsa-miR-6817-3p","hsa-miR-708-5p","hsa-miR-7-5p","hsa-miR-766-5p","hsa-miR-7854-3p","hsa-miR-873-3p","hsa-miR-885-3p","hsa-miR-92b-5p","hsa-miR-93-5p","hsa-miR-937-3p","hsa-miR-9-5p","hsa-miR-98-5p")
     subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0))
     geness_res$Color <- "NS or log2FC < 2.0"
     geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
     geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
     geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
    
     write.csv(geness_res, "untreated_vs_parental_cells_with_Category.csv")
     geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
    
     geness_res <- geness_res[, -1*ncol(geness_res)]
     png("volcano_plot_untreated_vs_parental_cells.png",width=1200, height=1400)
     #svg("untreated_vs_parental_cells.svg",width=12, height=14)
     ggplot(geness_res,       aes(x = log2FoldChange, y = -log10(pvalue),           color = Color, label = external_gene_name)) +       geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +       geom_hline(yintercept = -log10(0.05), lty = "dashed") +       geom_point() +       labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") +       scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) +       geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) +       theme_bw(base_size = 16) +       theme(legend.position = "bottom")
     dev.off()
    
     # ------------------ differentially_expressed_miRNAs_heatmap -----------------
     # prepare all_genes
     rld <- rlogTransformation(dds)
     mat <- assay(rld)
     mm <- model.matrix(~replicates, colData(rld))
     mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
     assay(rld) <- mat
     RNASeq.NoCellLine <- assay(rld)
    
     # reorder the columns
     #colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
     #col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
     #RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]
    
     #Option4: manully defining
     #for i in untreated_vs_parental_cells    sT_knockdown_vs_untreated DMSO_control_vs_untreated scr_control_vs_untreated scr_DMSO_control_vs_untreated    sT_knockdown_vs_DMSO_control sT_knockdown_vs_scr_control sT_knockdown_vs_scr_DMSO_control; do
     #  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
     #  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
     #done
     #cat *.id | sort -u > ids
     ##add Gene_Id in the first line, delete the ""
     GOI <- read.csv("ids")$Gene_Id
     datamat = RNASeq.NoCellLine[GOI, ]
    
     # clustering the genes and draw heatmap
     #datamat <- datamat[,-1]  #delete the sample "control MKL1"
     #datamat <- datamat[, 1:5]
    
     #parental_cells_1 parental_cells_2 parental_cells_3    untreated_1 untreated_2    scr_control_1 scr_control_2 scr_control_3     DMSO_control_1 DMSO_control_2 DMSO_control_3    scr_DMSO_control_1 scr_DMSO_control_2 scr_DMSO_control_3    sT_knockdown_1 sT_knockdown_2 sT_knockdown_3 -->
     #parental cells 1 parental cells 2 parental cells 3    untreated 1 untreated 2    scr control 1 scr control 2 scr control 3    DMSO control 1 DMSO control 2 DMSO control 3    scr DMSO control 1 scr DMSO control 2 scr DMSO control 3    sT knockdown 1 sT knockdown 2 sT knockdown 3
     colnames(datamat)[1] <- "parental cells 1"
     colnames(datamat)[2] <- "parental cells 2"
     colnames(datamat)[3] <- "parental cells 3"
     colnames(datamat)[4] <- "untreated 1"
     colnames(datamat)[5] <- "untreated 2"
     colnames(datamat)[6] <- "scr control 1"
     colnames(datamat)[7] <- "scr control 2"
     colnames(datamat)[8] <- "scr control 3"
     colnames(datamat)[9] <- "DMSO control 1"
     colnames(datamat)[10] <- "DMSO control 2"
     colnames(datamat)[11] <- "DMSO control 3"
     colnames(datamat)[12] <- "scr DMSO control 1"
     colnames(datamat)[13] <- "scr DMSO control 2"
     colnames(datamat)[14] <- "scr DMSO control 3"
     colnames(datamat)[15] <- "sT knockdown 1"
     colnames(datamat)[16] <- "sT knockdown 2"
     colnames(datamat)[17] <- "sT knockdown 3"
    
     write.csv(datamat, file ="gene_expression_keeping_replicates.txt")
     write.xlsx(datamat, file = "gene_expression_keeping_replicates.xlsx", rowNames = TRUE)
     #"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
     hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
     hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
     mycl = cutree(hr, h=max(hr$height)/1.1)
     mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
     mycol = mycol[as.vector(mycl)]
    
     rownames(datamat) <- sub("\\|.*", "", rownames(datamat))
    
     png("DEGs_heatmap_keeping_replicates.png", width=1000, height=1400)
     #svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
     heatmap.2(as.matrix(datamat),
         Rowv=as.dendrogram(hr),
         Colv=NA,
         dendrogram='row',
         labRow=row.names(datamat),
         scale='row',
         trace='none',
         col=bluered(75),
         RowSideColors=mycol,
         srtCol=30,
         lhei=c(1,8),
         cexRow=1.4,   # Increase row label font size
         cexCol=1.7,    # Increase column label font size
         margin=c(8, 12)
         )
     dev.off()
    
     # ----------------------------------------
     # ----------- manhattan_plot -------------
    
     Rscript manhattan_plot_Carmen_custom_labels.R  #exceRpt_miRNA_ReadCounts.txt

exceRpt流程热图分析 (Data_Ute_exceRpt_workspace)

exceRpt_DiagnosticPlots

exceRpt_DiagnosticPlots

exceRpt流程热图分析(中文翻译)

行标签详细说明

根据您提供的mapping_heatmap3.pdf热图,以下是各类别的详细解释:

1. 外源性序列(Exogenous Sequences)

  • exogenous_genomes: 比对到外源基因组(病毒/细菌)的reads
  • exogenous_rRNA: 比对到外源rRNA的reads
  • exogenous_miRNA: 比对到外源miRNA的reads

2. 内源性基因组特征

  • genome: 比对到参考基因组的reads(这是主要类别
  • endogenous_gapped: 需要间隔比对的内源性reads(剪接reads)
  • repetitiveElements: 比对到重复序列的reads
  • not_mapped_to_genome_or_libs: 未比对到基因组或小RNA文库的reads

3. 编码/非编码RNA(GENCODE)

  • gencode_sense: 比对到GENCODE注释基因的正链reads(mRNA、lncRNA等)
  • gencode_antisense: 反义链reads

4. 小RNA类别

  • miRNA_sense: 成熟miRNA正链(重要类别
  • miRNA_antisense: 成熟miRNA反义链
  • miRNAprecursor_sense: miRNA前体正链
  • miRNAprecursor_antisense: miRNA前体反义链
  • tRNA_sense/antisense: 转运RNA
  • piRNA_sense/antisense: PIWI互作RNA
  • circularRNA_sense/antisense: 环状RNA

不同组别的主要内容变化趋势

📊 关键发现:

1. 基因组比对效率(genome行)

  • MKL-1 wt: 82.9%, 61.4%, 97.4% – 变异很大
  • MKL-1 wt EV: 97.9% – 非常高
  • WaGa wt: 99.0%, 99.0%, 91.9% – 最高且稳定
  • WaGa wt EV: 91.0%, 90.6%, 89.2% – 略低于wt

趋势: WaGa野生型样本显示最高的基因组比对效率(91-99%)

2. miRNA含量(miRNA_sense行)- 🔴 最重要发现

  • MKL-1 wt: 41.0%, 21.1%, 3.7% – 剧烈变化
  • MKL-1 wt EV: 1.9% – 极低
  • WaGa scr DMSO EV: 1.6%, 1.0%, 7.1% – 低miRNA
  • WaGa wt: 88.7%, 86.6%, 81.1% – 异常高的miRNA!
  • WaGa wt EV: 79.3%, 77.9%, 70.9% – 仍然非常高

关键发现: WaGa野生型细胞的miRNA含量是MKL-1的20-40倍!这表明细胞系之间存在根本性的生物学差异。

3. 未比对reads(not_mapped_to_genome_or_libs)

  • MKL-1 wt: 14.8%, 21.7%, 0.5% – 变异大
  • WaGa wt: 2.9%, 3.0%, 1.3% – 最低
  • WaGa wt EV: 1.4%, 1.2%, 1.7% – 非常低

趋势: WaGa样本总体比对效率更好(<3%未比对)

4. gencode(mRNA/lncRNA)

  • MKL-1 wt: 0.0-1.0% – 极低
  • WaGa wt: 0.4%, 0.5%, 1.6% – 较高
  • WaGa wt EV: 1.4%, 1.1%, 0.9% – EV中mRNA升高

5. 外源性内容

  • 所有组: exogenous_miRNA, exogenous_rRNA, exogenous_genomes均为0.0%

✅ 好消息: 所有样本均未检测到病毒或细菌污染


🎯 生物学意义总结

1. 细胞系特异性miRNA谱

  • WaGa细胞: miRNA丰富(70-89%的reads)
  • MKL-1细胞: miRNA贫乏(2-41%的reads)
  • 意义: WaGa可能是更好的miRNA研究模型

2. EV vs 细胞RNA

  • MKL-1 wt EV: miRNA降低(1.9%)vs 细胞(3.7-41%)
  • WaGa wt EV: 维持高miRNA(71-79%)vs 细胞(81-89%)
  • 解释: WaGa细胞优先将miRNA包装到EV中

3. 质量控制

  • 最佳质量: WaGa wt和WaGa wt EV(低未比对,高基因组比对)
  • 变异质量: MKL-1 wt(样本间变异大)

4. 建议

  1. 调查MKL-1变异: 为什么nf796只有21% miRNA而nf780有41%?
  2. WaGa作为miRNA模型: WaGa细胞更适合miRNA/EV-miRNA研究
  3. EV包装机制: WaGa细胞显示高效的miRNA包装到EV中 – 值得进一步研究

TODO: 进行统计分析或创建可视化图表来突出这些趋势!



Detailed Explanation of exceRpt Pipeline Row Labels

Based on your heatmap from the exceRpt pipeline, let me explain each category and specifically address your question about the miRNA percentages.

Pipeline Flow Overview

The exceRpt pipeline processes reads through sequential filtering and alignment stages. Each percentage is normalized to reads_used_for_alignment (the denominator after quality control).


Row Label Categories Explained

1. Initial Processing Stages

  • input: Total raw reads entering the pipeline (100%)
  • successfully_clipped: Reads with adapters successfully removed
  • failed_quality_filter: Reads failing quality thresholds (Phred score, length)
  • failed_homopolymer_filter: Reads with excessive homopolymer stretches
  • calibrator: Reads mapping to spike-in calibrator controls
  • UniVec_contaminants: Reads identified as vector/contaminant sequences
  • rRNA: Reads mapping to ribosomal RNA (typically removed)
  • reads_used_for_alignment: Clean reads proceeding to alignment (this is the new 100% for downstream percentages)

2. Endogenous Alignment (Human/Host Genome)

These reads map to the reference genome of the host organism:

  • genome: Reads mapping to the reference genome
  • miRNA_sense/antisense: Reads mapping to endogenous miRNAs in sense/antisense orientation
  • miRNAprecursor_sense/antisense: Reads mapping to miRNA precursor hairpins
  • tRNA_sense/antisense: Transfer RNA mappings
  • piRNA_sense/antisense: PIWI-interacting RNA mappings
  • gencode_sense/antisense: Reads mapping to GENCODE gene annotations (mRNA, lncRNA, etc.)
  • circularRNA_sense/antisense: Circular RNA (circRNA) mappings
  • not_mapped_to_genome_or_libs: Reads that didn’t map to genome or small RNA libraries
  • repetitiveElements: Reads mapping to repetitive regions (LINE, SINE, etc.)
  • endogenous_gapped: Reads requiring gapped alignment (spliced reads)

3. Exogenous Alignment (Foreign/Viral Sequences)

This is the key to your question!

After endogenous alignment, unmapped reads are tested against exogenous (foreign) databases:

  • input_to_exogenous_miRNA: Reads sent to exogenous miRNA alignment
  • exogenous_miRNA: Reads mapping to viral/foreign miRNAs
  • input_to_exogenous_rRNA: Reads sent to exogenous rRNA alignment
  • exogenous_rRNA: Reads mapping to foreign rRNA (bacterial, viral)
  • input_to_exogenous_genomes: Reads sent to exogenous genome alignment
  • exogenous_genomes: Reads mapping to viral/bacterial genomes

🔍 Answering Your Specific Question

For sample nf796:

miRNA_sense:              1.3%
miRNA_antisense:          0.0%
input_to_exogenous_miRNA: 4.4%
exogenous_miRNA:          0.0%

Why this pattern?

The pipeline works sequentially:

  1. Endogenous miRNA alignment (first pass):

    • 1.3% of reads mapped to human endogenous miRNAs (sense strand)
    • 0.0% mapped to antisense miRNAs
    • These are known human miRNAs from miRBase
  2. Exogenous miRNA alignment (second pass):

    • The 4.4% represents reads that:
      • Did NOT map to endogenous human miRNAs
      • Were sent to search viral/foreign miRNA databases
    • The 0.0% means none of these 4.4% actually matched viral miRNAs
    • These are likely:
      • Degraded RNA fragments
      • Novel miRNAs not in databases
      • Sequencing artifacts
      • miRNAs from organisms not in the exogenous database

Visual Flow:

reads_used_for_alignment (100%)
         ↓
   [Endogenous alignment]
         ↓
   miRNA_sense (1.3%) ← Human miRNAs
   miRNA_antisense (0.0%)
         ↓
   Unmapped reads
         ↓
   [Exogenous alignment]
         ↓
   input_to_exogenous_miRNA (4.4%) ← Sent to viral DB
         ↓
   exogenous_miRNA (0.0%) ← Actually matched viral miRNAs
         ↓
   Remaining 4.4% → discarded or used for other exogenous searches

Key Distinction

Category Database Example
miRNA_sense Human miRBase hsa-miR-21-5p
exogenous_miRNA Viral miRNA DB EBV-miR-BART1, KSHV-miR-K12

Why is input_to_exogenous_miRNA > miRNA_sense?

This is normal and indicates:

  1. Most reads did NOT map to endogenous miRNAs
  2. The pipeline casts a wide net, sending many unmapped reads to exogenous searches
  3. The 4.4% represents “candidate” reads that might be foreign miRNAs
  4. The 0.0% final match means no viral miRNAs were detected in nf796

If you had a virus-infected sample, you’d see:

  • input_to_exogenous_miRNA: e.g., 5%
  • exogenous_miRNA: e.g., 3.2% (actual viral miRNA matches)

Other Notable Patterns in Your Heatmap

Looking at your samples:

  • nf796, nf797, nf655: Very low genome mapping (~21-45%), high “not_mapped” (~55-79%)
  • nf930-nf974: High genome mapping (~66-89%), suggesting better quality or different sample type
  • MKL-1 vs WaGa: Different cell lines showing distinct mapping profiles

This suggests nf796/nf797/nf655 might be:

  • Lower quality RNA
  • Different library preparation
  • Higher contamination
  • Or genuinely different biology (e.g., more extracellular vesicles with fragmented RNA)

Does this clarify the pipeline stages and the miRNA percentages? Let me know if you need more details about any specific category!

Interhost variant calling (Data_Tam_DNAseq_2026_19606wt_dAB_dIJ_mito_flu_on_ATCC19606)

  1. Input data:

     mkdir bacto; cd bacto;
     mkdir raw_data; cd raw_data;
    
     # ── Fluoxetine Dataset ──
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz flu_dAB_cef_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz flu_dAB_cef_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz flu_dAB_cipro_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz flu_dAB_cipro_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEdori-2/19606△ABfluEdori-2_1.fq.gz flu_dAB_dori_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEdori-2/19606△ABfluEdori-2_2.fq.gz flu_dAB_dori_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEnitro-3/19606△ABfluEnitro-3_1.fq.gz flu_dAB_nitro_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEnitro-3/19606△ABfluEnitro-3_2.fq.gz flu_dAB_nitro_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpip-1/19606△ABfluEpip-1_1.fq.gz flu_dAB_pip_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpip-1/19606△ABfluEpip-1_2.fq.gz flu_dAB_pip_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpolyB-3/19606△ABfluEpolyB-3_1.fq.gz flu_dAB_polyB_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpolyB-3/19606△ABfluEpolyB-3_2.fq.gz flu_dAB_polyB_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEtet-1/19606△ABfluEtet-1_1.fq.gz flu_dAB_tet_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEtet-1/19606△ABfluEtet-1_2.fq.gz flu_dAB_tet_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcef-4/19606△IJfluEcef-4_1.fq.gz flu_dIJ_cef_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcef-4/19606△IJfluEcef-4_2.fq.gz flu_dIJ_cef_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcipro-3/19606△IJfluEcipro-3_1.fq.gz flu_dIJ_cipro_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcipro-3/19606△IJfluEcipro-3_2.fq.gz flu_dIJ_cipro_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEdori-1/19606△IJfluEdori-1_1.fq.gz flu_dIJ_dori_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEdori-1/19606△IJfluEdori-1_2.fq.gz flu_dIJ_dori_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEnitro-3/19606△IJfluEnitro-3_1.fq.gz flu_dIJ_nitro_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEnitro-3/19606△IJfluEnitro-3_2.fq.gz flu_dIJ_nitro_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpip-4/19606△IJfluEpip-4_1.fq.gz flu_dIJ_pip_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpip-4/19606△IJfluEpip-4_2.fq.gz flu_dIJ_pip_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpolyB-4/19606△IJfluEpolyB-4_1.fq.gz flu_dIJ_polyB_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpolyB-4/19606△IJfluEpolyB-4_2.fq.gz flu_dIJ_polyB_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcef-1/19606wtfluEcef-1_1.fq.gz flu_wt_cef_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcef-1/19606wtfluEcef-1_2.fq.gz flu_wt_cef_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcipro-2/19606wtfluEcipro-2_1.fq.gz flu_wt_cipro_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcipro-2/19606wtfluEcipro-2_2.fq.gz flu_wt_cipro_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEdori-1/19606wtfluEdori-1_1.fq.gz flu_wt_dori_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEdori-1/19606wtfluEdori-1_2.fq.gz flu_wt_dori_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEnitro-1/19606wtfluEnitro-1_1.fq.gz flu_wt_nitro_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEnitro-1/19606wtfluEnitro-1_2.fq.gz flu_wt_nitro_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpip-4/19606wtfluEpip-4_1.fq.gz flu_wt_pip_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpip-4/19606wtfluEpip-4_2.fq.gz flu_wt_pip_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpolyB-4/19606wtfluEpolyB-4_1.fq.gz flu_wt_polyB_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpolyB-4/19606wtfluEpolyB-4_2.fq.gz flu_wt_polyB_R2.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEtet-2/19606wtfluEtet-2_1.fq.gz flu_wt_tet_R1.fastq.gz
     ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEtet-2/19606wtfluEtet-2_2.fq.gz flu_wt_tet_R2.fastq.gz
    
     # ── Mitomycin C Dataset ──
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_cipro_1/MitoC_E_AB_cipro_1_1.fq.gz mito_dAB_cipro_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_cipro_1/MitoC_E_AB_cipro_1_2.fq.gz mito_dAB_cipro_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_dori_1/MitoC_E_AB_dori_1_1.fq.gz mito_dAB_dori_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_dori_1/MitoC_E_AB_dori_1_2.fq.gz mito_dAB_dori_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_nitro_2/MitoC_E_AB_nitro_2_1.fq.gz mito_dAB_nitro_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_nitro_2/MitoC_E_AB_nitro_2_2.fq.gz mito_dAB_nitro_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_tet_2/MitoC_E_AB_tet_2_1.fq.gz mito_dAB_tet_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_tet_2/MitoC_E_AB_tet_2_2.fq.gz mito_dAB_tet_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_trime_4/MitoC_E_AB_trime_4_1.fq.gz mito_dAB_trime_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_trime_4/MitoC_E_AB_trime_4_2.fq.gz mito_dAB_trime_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_cipro_1/MitoC_E_IJ_cipro_1_1.fq.gz mito_dIJ_cipro_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_cipro_1/MitoC_E_IJ_cipro_1_2.fq.gz mito_dIJ_cipro_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_dori_4/MitoC_E_IJ_dori_4_1.fq.gz mito_dIJ_dori_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_dori_4/MitoC_E_IJ_dori_4_2.fq.gz mito_dIJ_dori_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_nitro_2/MitoC_E_IJ_nitro_2_1.fq.gz mito_dIJ_nitro_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_nitro_2/MitoC_E_IJ_nitro_2_2.fq.gz mito_dIJ_nitro_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_polyB_3/MitoC_E_IJ_polyB_3_1.fq.gz mito_dIJ_polyB_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_polyB_3/MitoC_E_IJ_polyB_3_2.fq.gz mito_dIJ_polyB_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_tet_3/MitoC_E_IJ_tet_3_1.fq.gz mito_dIJ_tet_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_tet_3/MitoC_E_IJ_tet_3_2.fq.gz mito_dIJ_tet_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_trime_1/MitoC_E_IJ_trime_1_1.fq.gz mito_dIJ_trime_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_trime_1/MitoC_E_IJ_trime_1_2.fq.gz mito_dIJ_trime_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_cipro_1/MitoC_E_wt_cipro_1_1.fq.gz mito_wt_cipro_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_cipro_1/MitoC_E_wt_cipro_1_2.fq.gz mito_wt_cipro_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_nitro_1/MitoC_E_wt_nitro_1_1.fq.gz mito_wt_nitro_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_nitro_1/MitoC_E_wt_nitro_1_2.fq.gz mito_wt_nitro_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_polyB_1/MitoC_E_wt_polyB_1_1.fq.gz mito_wt_polyB_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_polyB_1/MitoC_E_wt_polyB_1_2.fq.gz mito_wt_polyB_R2.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_trime_2/MitoC_E_wt_trime_2_1.fq.gz mito_wt_trime_R1.fastq.gz
     ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_trime_2/MitoC_E_wt_trime_2_2.fq.gz mito_wt_trime_R2.fastq.gz
  2. Call variant calling using snippy

     ln -s ~/Tools/bacto/db/ .;
     ln -s ~/Tools/bacto/envs/ .;
     ln -s ~/Tools/bacto/local/ .;
     cp ~/Tools/bacto/Snakefile .;
     cp ~/Tools/bacto/bacto-0.1.json .;
     cp ~/Tools/bacto/cluster.json .;
    
     #download CP059040.gb from GenBank
     #mv ~/Downloads/sequence\(2\).gb db/CP059040.gb
    
     mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
     (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ which snakemake
     /home/jhuang/miniconda3/envs/bengal3_ac3/bin/snakemake
     (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ snakemake -v
     4.0.0 --> CORRECT!
    
     #NOTE_1: modify bacto-0.1.json keeping only steps assembly, typing_mlst, possibly pangenome and variants_calling true; setting cpu=20 in all used steps.
         #setting the following in bacto-0.1.json
         "fastqc": false,
         "taxonomic_classifier": false,
         "assembly": true,
         "typing_ariba": false,
         "typing_mlst": true,
         "pangenome": true,
         "variants_calling": true,
         "phylogeny_fasttree": false,
         "phylogeny_raxml": false,
         "recombination": false, (due to gubbins-error set false)
    
         "prokka": {
           "genus": "Acinetobacter",
           "kingdom": "Bacteria",
           "species": "baumannii",
           "cpu": 10,
           "evalue": "1e-06",
           "other": ""
         },
    
         "mykrobe": {
           "species": "abaum"
         },
    
         "reference": "db/CP059040.gb"
    
     #NOTE_2: needs disk Titisee since the pipeline needs /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
     snakemake --printshellcmds
  3. Checking the contaminated samples based on the size of genome

     find . -maxdepth 2 -name "contigs.fa" -exec du -h {} \;
     find . -maxdepth 2 -name "contigs.fa" -printf "%s\t%p\n" | sort -nr | \
             while IFS=$'\t' read -r bytes path; do
             sample=$(basename "$(dirname "$path")")
             printf "%-20s %8s\t%s\n" "$sample" "$(numfmt --to=iec-i --suffix=B "$bytes")" "$path"
     done
    
     mito_wt_polyB          8,6MiB   ./mito_wt_polyB/contigs.fa
     mito_wt_trime          8,2MiB   ./mito_wt_trime/contigs.fa
     mito_dAB_trime         8,2MiB   ./mito_dAB_trime/contigs.fa
     flu_wt_cipro           6,7MiB   ./flu_wt_cipro/contigs.fa
     mito_dIJ_nitro         6,6MiB   ./mito_dIJ_nitro/contigs.fa
     flu_dAB_cipro          4,7MiB   ./flu_dAB_cipro/contigs.fa
    
     flu_wt_nitro           3,9MiB   ./flu_wt_nitro/contigs.fa
     flu_wt_polyB           3,9MiB   ./flu_wt_polyB/contigs.fa
     flu_wt_cef             3,9MiB   ./flu_wt_cef/contigs.fa
     flu_wt_dori            3,8MiB   ./flu_wt_dori/contigs.fa
     flu_dIJ_pip            3,8MiB   ./flu_dIJ_pip/contigs.fa
     flu_dIJ_dori           3,8MiB   ./flu_dIJ_dori/contigs.fa
     flu_wt_pip             3,8MiB   ./flu_wt_pip/contigs.fa
     flu_dAB_dori           3,8MiB   ./flu_dAB_dori/contigs.fa
     flu_dIJ_cipro          3,8MiB   ./flu_dIJ_cipro/contigs.fa
     flu_dAB_nitro          3,8MiB   ./flu_dAB_nitro/contigs.fa
     flu_dAB_pip            3,8MiB   ./flu_dAB_pip/contigs.fa
     flu_dAB_polyB          3,8MiB   ./flu_dAB_polyB/contigs.fa
     flu_dIJ_nitro          3,8MiB   ./flu_dIJ_nitro/contigs.fa
     flu_dIJ_cef            3,8MiB   ./flu_dIJ_cef/contigs.fa
     flu_dAB_tet            3,8MiB   ./flu_dAB_tet/contigs.fa
     flu_dIJ_polyB          3,8MiB   ./flu_dIJ_polyB/contigs.fa
     flu_dAB_cef            3,8MiB   ./flu_dAB_cef/contigs.fa
     mito_dIJ_polyB         3,8MiB   ./mito_dIJ_polyB/contigs.fa
     mito_dIJ_cipro         3,8MiB   ./mito_dIJ_cipro/contigs.fa
     mito_dIJ_dori          3,8MiB   ./mito_dIJ_dori/contigs.fa
     flu_wt_tet             3,8MiB   ./flu_wt_tet/contigs.fa
     mito_wt_nitro          3,8MiB   ./mito_wt_nitro/contigs.fa
     mito_wt_cipro          3,8MiB   ./mito_wt_cipro/contigs.fa
     mito_dIJ_tet           3,8MiB   ./mito_dIJ_tet/contigs.fa
     mito_dIJ_trime         3,8MiB   ./mito_dIJ_trime/contigs.fa
     mito_dAB_dori          3,8MiB   ./mito_dAB_dori/contigs.fa
     mito_dAB_cipro         3,7MiB   ./mito_dAB_cipro/contigs.fa
     mito_dAB_tet           3,7MiB   ./mito_dAB_tet/contigs.fa
     mito_dAB_nitro         3,7MiB   ./mito_dAB_nitro/contigs.fa
  4. Summarize all SNPs and Indels from the snippy result directory.

     cp ~/Scripts/summarize_snippy_res_ordered.py .
     # IMPORTANT_ADAPT the array in script should be adapted; deleting the isolates "mito_wt_polyB","mito_wt_trime", "mito_dAB_trime", "flu_wt_cipro", "mito_dIJ_nitro", and "flu_dAB_cipro"
     isolates = ["flu_wt_cef", "flu_wt_dori", "flu_wt_nitro", "flu_wt_pip", "flu_wt_polyB", "flu_wt_tet",    "flu_dAB_cef", "flu_dAB_dori", "flu_dAB_nitro", "flu_dAB_pip", "flu_dAB_polyB", "flu_dAB_tet",    "flu_dIJ_cef", "flu_dIJ_cipro", "flu_dIJ_dori", "flu_dIJ_nitro", "flu_dIJ_pip", "flu_dIJ_polyB",         "mito_dIJ_trime",    "mito_wt_cipro", "mito_wt_nitro",   "mito_dAB_cipro", "mito_dAB_dori", "mito_dAB_nitro", "mito_dAB_tet",     "mito_dIJ_cipro", "mito_dIJ_dori",  "mito_dIJ_polyB", "mito_dIJ_tet"]
    
     mamba activate plot-numpy1
     python3 ./summarize_snippy_res_ordered.py snippy
     #--> Summary CSV file created successfully at: snippy/summary_snps_indels.csv
     cd snippy
     #REMOVE_the_line? I don't find the sence of the line:    grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
  5. Using spandx calling variants (almost the same results to the one from viral-ngs!)

     mamba deactivate
     mamba activate /home/jhuang/miniconda3/envs/spandx
    
     # PREPARE the inputs for the options ref and database (NOT ALWAYS NEED, PREPARE ONlY ONCE)
     mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
     cp PP810610.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
     vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
     /home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610    #-d
     ~/Scripts/genbank2fasta.py PP810610.gb
     mv PP810610.gb_converted.fna PP810610.fasta    #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file
    
     ln -s /home/jhuang/Tools/spandx/ spandx
     # Deleting the contaminated samples flu_wt_cipro*.fastq and flu_dAB_cipro*.fastq
     (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CP059040.fasta --annotation --database CP059040 -resume
    
     # RERUN SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation, resulting in all impacts are MODIFIER --> IT WORKS!
     cd Outputs/Master_vcf
     conda activate /home/jhuang/miniconda3/envs/spandx
     (spandx) cp -r ../../snippy/flu_wt_cef/reference . # Eigentlich irgendein directory, all directories contains the same reference.
     (spandx) cp ../../spandx/bin/SNP_matrix.sh ./
     #Note that ${variant_genome_path}=CP059040 in the following command, but it was not used after the following command modification.
     #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
     "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
     (spandx) bash SNP_matrix.sh CP059040 .

Cross-Caller SNP/Indel Concordance & Invariant Variant Analyzer; Multi-Isolate Variant Intersection, Annotation Harmonization & Caller Discrepancy Report; Comparative Genomic Variant Profiling: Concordance, Invariance & Allele Mismatch Analysis; VarMatch: Cross-Tool Variant Concordance Pipeline

  1. Calling inter-host variants by merging the results from snippy+spandx

     mamba activate plot-numpy1
     cd bacto
     cp Outputs/Master_vcf/All_SNPs_indels_annotated.txt .
     cp snippy/summary_snps_indels.csv .
    
     cp ~/Scripts/process_variants_snippy_alleles_spandx_annotations.py .
    
     #Configuring
     #Delete "mito_wt_polyB", "mito_wt_trime", "mito_dAB_trime", "flu_wt_cipro", "mito_dIJ_nitro", and "flu_dAB_cipro"
             ISOLATES = [
             "flu_wt_cef", "flu_wt_dori", "flu_wt_nitro", "flu_wt_pip", "flu_wt_polyB", "flu_wt_tet",
             "flu_dAB_cef", "flu_dAB_dori", "flu_dAB_nitro", "flu_dAB_pip", "flu_dAB_polyB", "flu_dAB_tet",
             "flu_dIJ_cef", "flu_dIJ_cipro", "flu_dIJ_dori", "flu_dIJ_nitro", "flu_dIJ_pip", "flu_dIJ_polyB",
             "mito_dIJ_trime",
             "mito_wt_cipro", "mito_wt_nitro",
             "mito_dAB_cipro", "mito_dAB_dori", "mito_dAB_nitro", "mito_dAB_tet",
             "mito_dIJ_cipro", "mito_dIJ_dori", "mito_dIJ_polyB", "mito_dIJ_tet"
             ]
    
     (plot-numpy1) python process_variants_snippy_alleles_spandx_annotations.py
    
     # -- Indeed, the generated files are the common SNPs generated by snippy+spandx, therefore, we need to modify the filenames. --
     mv common_variants_all_snippy_annotated.csv common_variants_snippy+spandx_annotated.csv
     mv common_variants_invariant_snippy_annotated.csv common_invariants_snippy+spandx_annotated.csv
     mv common_variants_all_snippy_annotated.xlsx common_variants_snippy+spandx_annotated.xlsx
     mv common_variants_invariant_snippy_annotated.xlsx common_invariants_snippy+spandx_annotated.xlsx
  2. Manully checking each of the 6 records by comparing them to the results from SPANDx; three are confirmed!

     #The file will only contain rows where at least one of the 29 samples had a different allele between the two files. You can open allele_differences.xlsx side-by-side with your original common_variants_all_snippy_annotated.xlsx for fast, column-aligned manual verification.
     mamba activate plot-numpy1
     (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_variants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt
     #❌ Found 13 mismatched positions. Extracting & formatting...
    
     # The export_mismatch_alleles.py always generated the same output-file, avoid to recoved by the following commands, modify the generated filename.
     (plot-numpy1) mv allele_differences.xlsx allele_differences_for_cmp.xlsx
    
     (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_invariants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt
     #✅ All alleles match perfectly across all common positions and samples!
    
     cp common_variants_all_snippy_annotated.xlsx common_variants_all_snippy_annotated_for_cmp.xlsx
     #Then marked the corresponding rows in yellow, then compare it to allele_differences.xlsx.
    
     # IMPORTANT_MANUAL_CHECK: checking all variants of common_variants_snippy+spandx_annotated.xlsx and common_invariants_snippy+spandx_annotated.xlsx by comparing allele_differences_for_cmp.xlsx.
  3. (Optional) Run nextflow bacass

     # Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz
     # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056
     #--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz
     #--kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz
     nextflow run nf-core/bacass -r 2.5.0 -profile docker \
     --input samplesheet.tsv \
     --outdir bacass_out \
     --assembly_type short \
     --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
     --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
     -resume
     #Possibly the chraracter '△' is a problem.
     #Solution: 19606△ABfluEcef-1 → 19606delABfluEcef-1
    
     #SAVE bacass_out/Kmerfinder/kmerfinder_summary.csv to bacass_out/Kmerfinder/An6?/An6?_kmerfinder_results.xlsx
    
     samplesheet.tsv
     sample,fastq_1,fastq_2
     flu_dAB_cef,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz
     flu_dAB_cipro,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz
    
     #busco example results:
     Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
     wt_cef.scaffolds.fa     bacteria_odb10  98.4    98.4    0.0     1.6     0.0     124     285852  285852  0.000%  45
     wt_cipro.scaffolds.fa   bacteria_odb10  90.3    89.5    0.8     8.1     1.6     124     7434    7434    0.000%  1699
  4. (Optional) Run bactmap

     nextflow run nf-core/bactmap -r 1.0.0 -profile docker \
     --input samplesheet.csv \
     --reference CP059040.fasta \
     --outdir bactmap_out \
     -resume
    
     nextflow run nf-core/bactmap -r 1.0.0 -profile docker \
     --input samplesheet.csv \
     --reference CU459141.fasta \
     --outdir bactmap_out \
     -resume
    
     sample,fastq_1,fastq_2
     G18582004,fastqs/G18582004_1.fastq.gz,fastqs/G18582004_2.fastq.gz
     G18756254,fastqs/G18756254_1.fastq.gz,fastqs/G18756254_2.fastq.gz
     G18582006,fastqs/G18582006_1.fastq.gz,fastqs/G18582006_2.fastq.gz
    
     mkdir bactmap_workspace
     #Prepare reference.fasta (example for CP059040.fasta) in bactmap_workspace and samplesheet.csv in bactmap_workspace
     nextflow run nf-core/bactmap -r 1.0.0 -profile docker \
     --input samplesheet.csv \
     --reference CP059040.fasta \
     --outdir bactmap_out \
     -resume
  5. Structural variant calling

     conda activate sv_assembly
    
     nucmer --maxmatch -l 100 -c 500 bacto/CP059040.fasta bacto/shovill/flu_wt_cef/contigs.fa -p flu_wt_cef
     delta-filter -1 -q flu_wt_cef.delta > flu_wt_cef.filtered.delta
     Assemblytics flu_wt_cef.filtered.delta flu_wt_cef_assemblytics 1000 100 50000
    
     #reference  ref_start  ref_stop  ID                size  strand  type                ref_gap_size  query_gap_size  query_coordinates          method
     CP059040    3124916    3125037   Assemblytics_b_3  198   +       Tandem_contraction  121           -77             contig00010:34383-34460:-  between_alignments
    
     for sample in flu_wt_cef  flu_wt_dori  flu_wt_nitro  flu_wt_pip  flu_wt_polyB  flu_wt_tet  flu_dAB_cef  flu_dAB_dori  flu_dAB_nitro  flu_dAB_pip  flu_dAB_polyB  flu_dAB_tet  flu_dIJ_cef  flu_dIJ_cipro  flu_dIJ_dori  flu_dIJ_nitro  flu_dIJ_pip  flu_dIJ_polyB  mito_dIJ_trime  mito_wt_cipro  mito_wt_nitro  mito_dAB_cipro  mito_dAB_dori  mito_dAB_nitro  mito_dAB_tet  mito_dIJ_cipro  mito_dIJ_dori  mito_dIJ_polyB  mito_dIJ_tet; do
             nucmer --maxmatch -l 100 -c 500 bacto/CP059040.fasta bacto/shovill/${sample}/contigs.fa -p ${sample};
             delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta;
             Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 1000 100 50000;
     done
    
     #NOTE that We excluded 6 samples with abnormal assembly sizes (>4.5 Mb or <3.7 Mb), which likely reflect contamination or fragmentation. The final merged SV table (merged_sv_results_filtered.tsv) contains 29 high-quality isolates and is ready for downstream analysis.
     #mito_wt_polyB          8,6MiB   ./mito_wt_polyB/contigs.fa
     #mito_wt_trime          8,2MiB   ./mito_wt_trime/contigs.fa
     #mito_dAB_trime         8,2MiB   ./mito_dAB_trime/contigs.fa
     #flu_wt_cipro           6,7MiB   ./flu_wt_cipro/contigs.fa
     #mito_dIJ_nitro         6,6MiB   ./mito_dIJ_nitro/contigs.fa
     #flu_dAB_cipro          4,7MiB   ./flu_dAB_cipro/contigs.fa
    
     mito_wt_polyB
     # Adapt the EXCLUDE_SAMPLES names in the following script and run it.
     merge_sv_filtered.sh  # merged_sv_results.txt
     #✅ Merge complete: merged_sv_results_filtered.txt
     #• Included: 29 samples
     #• Excluded: 4 samples (Indeed, 6 samples should not include in the calculation, but 2 samples was not run in the last step.)
  6. Using PHASTEST

     https://phastest.ca/
    
     Downloads ZZ_f98bf12de9.PHASTEST.zip
     python ~/Scripts/phaster_clean_to_excel.py summary.txt detail.txt PHASTEST_SV_mito_large_del1_CP059040_1189156-1236440.xlsx
    
     Manually edit the generate Excel-files
     * Delete lines 2-8
     * MOST_COMMON_PHAGE_NAME --> MOST_COMMON_PHAGE_NAME (# hit genes count)
     * Replace all common phage names from the website.
    
     sed -E 's/ {2,}/\t/g' detail.txt > detail_tabs.txt
     In detail_tabs.txt, delete the first line and the line '---------', one empty line, edit some bacterial proteins by replacing one '\t' to ' ', copy it to sheet Detail. Make the header bold font.
  7. Report

Thank you for your excellent question. You are absolutely right—the limited gene entries in our initial SV table reflect conservative RefSeq GFF3 annotation, which under-represents prophage regions (many genes labeled “hypothetical” or omitted). Your observation prompted re-analysis with PHASTEST, and the results strongly support your hypothesis.

🔹 PHASTEST validation: Three prophage regions confirmed We ran PHASTEST on the three large SV regions previously annotated as deletions/contractions. Based on PHASTEST’s scoring criteria (intact >90, questionable 70–90, incomplete <70), we confirm:

SV ID Coordinates PHASTER Score Most Common Phage Match Key Features
SV_mito_tandem 2494563–2536071 (41.5 kb) Questionable (90) Bordetella phage BPP-1 (NC_005357) integrase, terminase, tail, capsid
SV_mito_large_del1 1189156–1236440 (47.3 kb) Intact (>90) Acinetobacter phage vB_AbaS_TRS1 (NC_031098) tyrosine integrase, structural modules
SV_mito_large_del2 2621714–2664046 (42.4 kb) Intact (>90) Acinetobacter phage Bphi_B1251 (NC_019541) terL, DNA methyltransferase, tail proteins

Key observations:

* All three regions contain hallmark phage genes (integrases, terminases, capsid/tail proteins)
* The most common phage matches correspond to published reference genomes; complete lists of all matched phages are provided in the "Summary" sheet of each attached Excel file
* Reference manuscripts:
   - NC_005357: Genomic and genetic analysis of Bordetella bacteriophages encoding reverse transcriptase-mediated tropism-switching cassettes
   - NC_031098: Genome Sequence of vB_AbaS_TRS1, a Viable Prophage Isolated from Acinetobacter baumannii Strain A118
   - NC_019541: Complete Genome Sequence of the Podoviral Bacteriophage YMC/09/02/B1251 ABA BP, Which Causes the Lysis of an OXA-23-Producing Carbapenem-Resistant Acinetobacter baumannii Isolate from a Septic Patient
* More detailed annotations, including Excel tables and figures, are provided in the attachments

In my 2021 PLOS Pathogens paper (attached), we identified three prophage regions significantly associated with invasive S. epidermidis, where ~94% of infection-associated genes mapped to the mobilome. Mapping Wan Yu’s differential expression data to our PHASTEST-annotated prophage coordinates would be a logical next step to test for condition-specific prophage gene expression.