Gasthof Knappenwirt in Mariahof

bezirkskarte-winter

https://upload.wikimedia.org/wikipedia/commons/e/e2/Reliefkarte_Steiermark.png

“Sau” 根据不同的语言,有不同的发音:德语(母猪):读作 /zaʊ̯/。发音就像英语里的 “ow” 一样,发 “萨奥”(sào)的音。英语:美式音标为 /saʊ/。读法像 “扫”(sǎo)。

karte-oesterreich
Austria_Physical
Kärnten_Physical
Municipalities_Bezirk_Murau.svg

https://www.outdoorcenter-skischool.at/en/childrens-ski-course/

Our course times: Mo – Fr from 09:45 a.m.-11:45 a.m. and 01:15 p.m.-03:15 p.m. For those staying just a few days and not up to the whole week, we do offer a half-week course or the attendance on just single days.

Please arrive with your children at least 10 minutes before the start of the course to our ski school. Prices for skiing courses Winter season 2025/2026 Week-long course (6 days) 250€ Week-long course (5 days) 235€ Half week-long course (3 days) 205€ Day course (4 hours in a group) 100€ Lunchtime supervision

Group lessons take place with a minimum of 5 participants, otherwise we do offer workshop-prices or prices will be set by agreement.

The above mentioned prices are listed in EURO and per person. The prices do not include skiing or snowboard equipment and ski pass.

https://www.leistbare-auszeit.at/winterurlaub-in-oesterreich/

Die Talstation der Kreischberg Murau bietet rund 1.000 kostenlose Parkplätze. Die befestigten Flächen liegen direkt an der Talstation der 10er-Gondelbahn, sodass Sie ohne großen Fußweg direkt von der Straße auf die Piste oder zur Kasse gelangen.

Anfahrt, Parken Kreischberg

Das Skigebiet Kreischberg liegt im Murtal in der Steiermark. Der Kreischberg ist von Westen über die A10 Tauernautobahn und von Osten über die S36 Murtal Schnellstraße gut erreichbar. Danach geht es ohne nennenswerte Steigungen über gut ausgebaute Straßen direkt zum Skigebiet. Die 1000 Parkplätze direkt bei der Talstation der 10er-Gondelbahn sind befestigt und kostenlos.

Hoferdorf 113, 8812 Mariahof Österreich

to

Kreischberg Talstation, Kreischbergstraße 1, 8861 St. Lorenzen am Kreischberg, Österreich

32 min (30,8 km) über B96 und Murauer Str./B97

Day ski passes Winterseason 2025/26 Tageskarten – Winterseason 2025/26  Main Season 25.12.2025 – 15.03.2026 Early and Late Season Adults Children Adolescents Adults Children Adolescents Day tickets from 08:30 68,00 34,00 54,50 61,00 30,50 49,00 from 11:00 64,00 32,00 51,00 57,50 29,00 46,00 from 12:00 58,00 29,00 46,50 52,00 26,00 42,00 Morning tickets till 13:00 60,00 30,00 48,00 54,00 27,00 43,00 Hourly tickets 2 hours 47,50 24,00 38,00 43,00 21,50 34,00 3 hours 53,00 26,50 42,50 47,50 24,00 38,50 4 hours 60,00 30,00 48,00 54,00 27,00 43,00 Single run Kreischberg 10er, 2 Sections 18,00 9,00 14,50 18,00 9,00 14,50 dog 5,00

Children up to 6 years (born in 2020 or after) ski for FREE. *) Children: born 2010 – 2019

  • ) Young people: born 2000 – 2009 For details see Informations & Cut-off dates All prices in Euro and incl. VAT.

核心回答:今天可以被保吗?

是的,完全可以被保!

根据您提供的条款,如果您今天刚刚预订了旅行,然后今天购买这份保险,是符合规定的。

原因在于条款中的这一条:

  • Abschlussfrist Last-Minute: Bis max. 3 Tage nach Buchung (最后时刻投保期限:预订后的 3 天内)

这意味着,只要您是在预订旅行后的 3 天内购买这份保险,就可以获得保障。您今天预订、今天购买,完全在这个时间窗口内。


其他重要条款中文详细解释

为了让您买得更放心,我把您提供的这份保险(旅行退订险 Reise-Rücktrittsversicherung)的其他关键条款用中文为您梳理如下:

1. 投保与合同期限 (Fristen)

  • 常规投保期限 (Abschlussfrist):最晚必须在旅行开始前 15 天购买。(如果您不是今天预订的旅行,而是早就预订好了,那么必须满足提前15天买好保险的条件)。
  • 最短合同期限 (Mindestvertragslaufzeit)1年。这说明这是一份“年度保险”,在这一年内您多次旅行都可以受保(需满足每次旅行的定义)。
  • 解约期 (Kündigungsfrist)1 天。如果您不想续保,只需提前1天通知即可取消下一年的合同,非常灵活。

2. 旅行是怎么定义的?(Reisedefinition)

  • 怎样才算一次“旅行”:必须至少包含一晚住宿,或者预订了交通工具(如机票、火车票等)。
  • 包含商务旅行 (Inkl. Geschäftsreise)。如果您出差,这份保险同样适用。
  • 单次最长旅行时间 (Maximale Reisedauer)无限制
  • 旅行区域 (Reiseregion)全球 (Weltweit)

3. 家庭与儿童是怎么定义的?(Familiendefinition)

  • 家庭的定义非常宽松:2名成年人 + 同行的儿童。不限亲属关系,也不限是否同住。这意味着即使您带的是朋友的孩子、侄子侄女,只要一起同行就可以算作家庭险。
  • 最多可保儿童数5 名
  • 儿童最大年龄限制20 岁(超过20岁通常就需要单独买成人险了)。

4. 客户服务与确认 (Kundenservice)

  • 即时确认 (Sofortige Bestätigung)。您今天买完,系统会立刻生成保单,马上生效。
  • 客户评分 (Kundenbewertung)4.3 / 5 星(评价相当不错)。
  • 客服电话:周一至周五 9:00 – 18:00 提供回调服务。

💡 购买建议与注意事项

  1. 保留凭证:请务必保存好您今天预订旅行的确认信(Buchungsbestätigung),上面会有今天的日期。如果将来需要理赔,保险公司会核对您的“预订日期”和“购买保险日期”是否在3天之内。
  2. 确认保单生效:因为条款写了“Sofortige Bestätigung (即时确认)”,购买后请检查邮箱,确保收到了正式的保单(Versicherungsschein)。
  3. 退订险的作用:Rücktrittsversicherung 主要保的是:在您出发前,如果因为意外生病、严重意外事故、失业、房屋火灾等不可控的意外原因导致您无法出行,保险公司会赔偿您支付给旅行社或航空公司的取消费用(Stornokosten)


根据提取的信息,我为您整理了这两个住宿的详细比较:

两个住宿综合比较表

比较项目 Bauernhof Lehen (QV4YEH) Gasthof Knappenwirt (TNSRP9)
基本信息
位置 Großhöch, Österreich Mariahof, Steiermark, Österreich
评分 9.6 (Außergewöhnlich) – 36条评价 9.0 (Hervorragend) – 307条评价
住宿类型 传统农庄(Bauernhaus) 旅馆/民宿(Gasthof)
房东经验 自2019年开始接待 自2024年开始接待
容量与面积
最大人数 15人 18人
居住面积 180 m² 186 m²
卧室 4间 6间
浴室 3间 3间
价格(7晚)
总价 3.263,00 € 3.257,37 € ✓ (便宜5,63€)
人均/晚 约30,98 € 约25,77 € ✓ (更便宜)
押金 250 € (现金) 1 € (现金) ✓
额外费用 367,50 € (税费) 262,50 € (电费)
设施配置
桑拿 ✓ 包含 ✓ 包含
温泉/按摩浴缸 ✓ Whirlpool
花园 ✓ (共用)
露台
洗碗机
洗衣机
WLAN ✓ 免费 ✓ 免费
停车场 ✓ (可能收费) ✓ 免费私人停车
烧烤设施
壁炉 ✓ Kamin
山景
无烟房
餐饮设施
厨房 ✓ 完整厨房 △ 部分房间有厨房
餐厅 ✓ 有餐厅
早餐 ✓ 提供早餐
滑雪相关
距离滑雪场 St. Johann/Alpendorf 13km
Skizirkus Gastein 18km
Kreischberg 30分钟车程
滑雪巴士站 400m
滑雪储藏室
位置与交通
距中心 4km (St. Veit) 1.4km (Bäcker)
公交站 450m
火车站 5km (St. Veit)
购物 4km 3.3km (Bank/Apotheke)
餐厅 2.7km
特色服务
儿童设施 ✓ 高脚椅、婴儿床免费 ✓ 儿童游乐场
宠物 ✗ 不允许 ✓ 可带(需申请,可能收费)
其他 农场动物,新鲜牛奶 24小时前台,餐厅,酒吧
取消政策
免费取消 ✓ 至2026.10.20 (提前60天) △ 部分退款(14%)至2026.12.19
综合评价
清洁度 9.8 9.2
设施 8.8 8.8
位置 8.6
性价比 9.0

总结建议

选择 Bauernhof Lehen,如果您更看重:

更高的评分 (9.6 vs 9.0) ✅ 传统农庄体验,有新鲜牛奶 ✅ 更灵活的取消政策(提前60天免费取消) ✅ 靠近大型滑雪区 (200km雪道) ✅ 私密性(整栋独立农庄)

选择 Gasthof Knappenwirt,如果您更看重:

容纳更多人 (18人 vs 15人) ✅ 更多卧室 (6间 vs 4间) ✅ 更完善的设施(Whirlpool、洗衣机、餐厅) ✅ 更低的人均价格提供早餐和餐饮服务可以带宠物

总体推荐:如果是朋友家庭滑雪度假,Bauernhof Lehen 更适合,因为它评分更高、位置更靠近知名滑雪场、取消政策更灵活。但如果您的团体超过15人或需要更多卧室,Gasthof Knappenwirt 是更好的选择。



Category English Chinese翻译
Location Near Mariahof in Styria (Austria) 位于奥地利施泰尔马克州的玛丽亚霍夫附近
Main Ski Resort Kreischberg Ski Resort (St. Georgen am Kreischberg) 克雷施贝格滑雪场(圣格奥尔根阿姆克雷施贝格)
Distance Approximately 24 km (20-25 minutes drive) from Mariahof 距离玛丽亚霍夫约24公里(驾车20-25分钟)
Ski Kilometers Around 42 km (17 km easy, 16 km intermediate, 9 km difficult) 约42公里雪道(17公里初级,16公里中级,9公里高级)
Special Features Modern 10-person gondola, large snow park for freestylers, special children’s areas 现代化10人缆车、大型自由式滑雪公园、专门的儿童区域
Second Resort Grebenzen Ski Resort (St. Lambrecht) 格雷本岑滑雪场(圣兰布雷希特)
Distance Only a few kilometers south of Mariahof 玛丽亚霍夫以南仅几公里
Ski Kilometers Around 12 km of slopes and 13 km of ski routes 约12公里雪道和13公里滑雪路线
Elevation Slopes range from 1,010 to 1,870 meters altitude 雪道海拔从1,010米到1,870米
Special Features Particularly popular with families and ski tourers; often features toboggan runs 特别受家庭和滑雪登山者欢迎;通常设有雪橇滑道
Third Resort Lachtal Ski Resort (Das Lachtal liegt im Bundesland Steiermark in Österreich. Es gehört zur steirischen Stadtgemeinde Oberwölz im Bezirk Murau.Zuvor war das Gebiet eine eigenständige Gemeinde namens Schönberg-Lachtal, die im Jahr 2015 mit Oberwölz zusammengelegt wurde.) 拉赫塔尔滑雪场
Distance Easily accessible by car (approximately 40-50 minutes drive) 驾车便利(约40-50分钟车程)
Ski Kilometers Approximately 40 km of slopes 约40公里雪道
Elevation Up to 2,222 meters altitude 最高海拔2,222米
Special Features Known for its wide, open slopes; snow-sure and family-friendly 以宽阔开阔的斜坡闻名;雪量充足且适合家庭
Snow Reliability Ski resorts in Styria offer reliable snow conditions. Slopes are artificially snowed and well-groomed until spring. 施泰尔马克州的滑雪场提供可靠的雪况。雪道通常会人工造雪并精心维护至春季。
Planning Tools Use J2Ski Resort Guide for detailed weather and slope reports for Mariahof 使用J2Ski度假村指南获取玛丽亚霍夫的详细天气和雪道报告
Accommodation Use Booking.com Ski Resort Guide for Mariahof to find suitable accommodations 使用Booking.com玛丽亚霍夫滑雪度假村指南寻找合适的住宿

TODO: 9月初报名,截止到2026年9月24日!PRIMA-Initiative der Universität Hamburg

PriMa-Elternbrief_2026 Talentsuche Mathematik

Uni-Zirkel_PriMa_und_PriSMa_BzMU22_1325

人才发掘与培养是汉堡大学、学校家庭与职业教育局MINT(数学、信息、自然科学和技术)部门、学校资质与质量发展州研究所特殊天赋咨询处以及威廉·斯特恩协会的合作项目。 汉堡,2026年6月 亲爱的家长们: 作为PriMa项目的一部分,汉堡大学25年多来一直开展一项针对对数学特别感兴趣且有天赋的三年级儿童的培养与研究项目,即所谓的“大学兴趣小组”(Uni-Zirkel)。大学兴趣小组与学校数学兴趣小组相结合,提供了全国独一无二的拔尖与普及相结合的培养模式。在大学,我们每年以小组形式培养大约60名儿童,直到他们四年级结束。更多信息请访问 www.prima-mathematik.uni-hamburg.de。 为了选拔60名儿童参加大学兴趣小组,我们进行了一次人才选拔活动,每个孩子只能参加一次。该活动面向在11月份至少满8岁的三年级儿童,以及提前入学或跳过一年级的四年级儿童。所有未能获得大学名额的儿童,都将获得一个地区数学兴趣小组的名额。 人才选拔的报名需通过在线表格进行,您可以在 www.prima-mathematik.uni-hamburg.de 的“最新动态”(Aktuelles)栏目中找到该表格。该表格将于2026年7月1日开放,并持续开放至2026年9月24日。 报名成功后,我们将向您发送一个准备任务的链接。您的孩子需要在家独立解决这些题目。如果您的孩子不能解决所有问题,也没关系。孩子们应该通过尝试解题来发现自己是否喜欢这类题目。 为了完成正式报名,您必须将您孩子完成的准备任务提交给我们。提交解答的截止日期是2026年9月24日。请尽量通过电子邮件发送准备任务的答案:mathe-treff.ew@uni-hamburg.de,或通过邮寄:PriMa-Projekt z.Hd. von Frau Kraußer, Von-Melle-Park 8, 20146 Hamburg。如果您在报名时遇到困难,请给我们发送电子邮件。 随后,我们将在11月份在大学举行的一次聚会——“数学爱好者交流会”(Mathe-Treff für Mathe-Fans)上与孩子们讨论这些题目。该活动将在周五下午或周六举行。关于此活动的详细信息,您最迟将在秋假后的那一周收到。 2027年1月将进行数学测试。随后,BbB(汉堡州研究所特殊天赋咨询处)将为大约250名儿童进行智力测试。 您可以在我们的主页上找到Nolte教授博士提供的包含更多信息的介绍幻灯片。她将于2026年9月16日为家长提供关于培养项目和人才选拔的答疑时间(18:30-19:30,https://bbb1.physnet.uni-hamburg.de/b/mar-6zx-9ek)。 截至2026年9月24日: 通过以下网址的在线表格报名:www.prima-mathematik.uni-hamburg.de 上的“最新动态” 并提交准备任务的答案至:mathe-treff.ew@uni-hamburg.de 如有任何疑问,项目负责人兼协调人Kirsten Pamperien博士(教师)将在每周三上午9:00至11:00接听电话,号码为+4940 239525524,或发送邮件至kirsten.pamperien@uni-hamburg.de。 诚挚的问候, Prof. Dr. Marianne Nolte (汉堡大学)



Hamburg, Juni 2026 Liebe Eltern, im Rahmen der Maßnahme PriMa wird seit über 25 Jahren an der Universität Hamburg ein Förder- und Forschungsvorhaben durchgeführt, das sich an mathematisch besonders interessierte und begabte Kinder der dritten Klassen wendet, die sogenannten Uni-Zirkel. Die Uni-Zirkel in Verbindung mit Mathe-Zirkeln an den Schulen bieten eine bundesweit einmalige Verbindung von Spitzen- und Breitenförderung. An der Universität fördern wir pro Jahrgang ca. 60 Kinder in Kleingruppen bis zum Ende der 4. Klasse. Weitere Informationen finden Sie unter www.prima-mathematik.uni-hamburg.de . Um 60 Kinder für die Uni-Zirkel auszuwählen, führen wir eine Talentsuche durch, an der jedes Kind nur einmal teilnehmen darf. Diese richtet sich an Kinder der dritten Klasse, die im November mindestens 8 Jahre alt sind, sowie an Kinder der vierten Klasse, die frühzeitig eingeschult wurden oder eine Klasse übersprungen haben. Alle Kinder, die keinen Platz an der Universität finden, erhalten einen Platz in einem regionalen Mathe-Zirkel. Die Anmeldung zur Talentsuche erfolgt über ein Onlineformular, welches Sie unter Aktuelles auf www.prima-mathematik.uni-hamburg.de finden. Dieses ist ab dem 01.07.2026 freigeschaltet und bis zum 24.09.2026 zugänglich. Wir schicken Ihnen nach erfolgter Anmeldung einen Link für eine Vorbereitungsaufgabe zu. Diese soll Ihr Kind selbständig zuhause lösen. Es ist nicht schlimm, wenn Ihr Kind nicht alles lösen kann. Die Kinder sollen durch ihre Lösungsversuche merken, ob sie Spaß an dieser Art Aufgaben haben. Für die verbindliche Anmeldung ist es zwingend erforderlich, dass Sie uns die Bearbeitung der Vorbereitungsaufgabe Ihres Kindes zuschicken. Einsendeschluss für die Bearbeitung ist der 24.09.2026. Bitte schicken Sie die Lösung der Vorbereitungsaufgabe möglichst per E-Mail: mathe-treff.ew@uni-hamburg.de oder per Post: PriMa-Projekt z.Hd. von Frau Kraußer, Von-Melle-Park 8, 20146 Hamburg. Sollten Sie Schwierigkeiten bei der Anmeldung haben, so schreiben Sie uns bitte eine E-Mail. Wir besprechen die Aufgabe mit den Kindern dann im November in einer Sitzung an der Universität, dem Mathe-Treff für Mathe-Fans. Dieser wird an einem Freitag-Nachmittag oder an einem Samstag angeboten. Genauere Informationen hierzu erhalten Sie spätestens in der Woche nach den Herbstferien. Im Januar 2027 schließt sich ein Mathematiktest an. Später führt die BbB (Beratungsstelle besondere Begabungen; Landesinstitut Hamburg) für etwa 250 Kinder einen Intelligenztest durch. Einführende Folien mit weiteren Informationen von Frau Prof. Dr. Nolte finden Sie auf unserer Homepage. Am 16.09.2026 bietet sie eine Fragestunde für die Eltern zum Förderprojekt und zur Talentsuche an (18:30-19:30 Uhr, https://bbb1.physnet.uni-hamburg.de/b/mar-6zx-9ek). Bis zum 24.09.2026: Anmeldung über Online-Formular unter: Aktuelles auf www.prima-mathematik.uni-hamburg.de und Einsendung der Lösung der Vorbereitungsaufgabe: mathe-treff.ew@uni-hamburg.de Für Nachfragen steht Ihnen Frau Dr. Kirsten Pamperien (Lehrerin, Projektleiterin und Projektkoordinatorin) mittwochs von 9.00 Uhr bis 11.00 Uhr unter der Nummer +4940 239525524 oder unter kirsten.pamperien@uni-hamburg.de zur Verfügung Mit freundlichen Grüßen Prof. Dr. Marianne Nolte (Universität Hamburg) Talentsuche und Förderung sind ein Kooperationsprojekt zwischen der Universität Hamburg, dem MINT-Referat der Behörde für Schule, Familie und Berufsbildung, der Beratungsstelle besondere Begabungen des Landesinstituts für Qualifizierungen und Qualitätsentwicklung in Schulen und der William-Stern-Gesellschaft.



https://www.amazon.de/Mathematik-ist-PriMa-F%C3%B6rderung-mathematischen/dp/3959873395

Analyzing WaGa and MKL-1 Cell Line miRNA (Data_Ute_smallRNA_via_exceRpt_workspace)

manhattan_plot_Carmen_custom_labels_WaGa.R

manhattan_plot_Carmen_custom_labels_MKL-1.R

For example, MKL-1 Cell Line miRNA Analysis Results are as follows.

* Raw count data (d_raw_MKL-1.xlsx): Contains the raw, unnormalized read counts for all miRNAs.
* Mapping heatmap (mapping_heatmap3_MKL-1.pdf)
* Volcano plot (MKL.1_wt_EV_vs_MKL.1_wt_cells.png and .svg)
* PCA plot (pca_MKL-1.png)
* Manhattan plot and data (manhattan_plot_MKL1_vs_EV.png, .svg, and manhattan_plot_MKL1_data.xlsx)
  1. Input data

     WaGa wt cells (nf774* (Considering to be deleted, due to possibly be an outlier, but in the current version, it is still included in the analysis), nf961, nf962)
     WaGa wt_EV_RNA (nf657* (The sample was EXCLUDED, since it is obviously a outlier, not clustered with the other 2 samples), nf930, nf935)
     WaGa_sT_DMSO_EV_RNA (nf931, nf936, nf971)
     WaGa_sT_Dox_EV_RNA (nf932, nf937, nf972)
     WaGa_scr_DMSO_EV_RNA (nf933, nf938, nf973)
     WaGa_scr_Dox_EV_RNA (nf934, nf939, nf974)
     # --> In total, 17 samples
    
     MKL-1 wt cells (nf780*, nf796*, nf797*)
     MKL-1 wt_EV_RNA (nf655* (The sample was EXCLUDED), 2404, 2608)
     MKL-1_sT_DMSO_EV_RNA (2608, 2701, 2802)
     MKL-1_sT_Dox_EV_RNA (2608, 2701, 2802)
     MKL-1_scr_DMSO_EV_RNA (2608, 2701, 2802)
     MKL-1_scr_Dox_EV_RNA (2608, 2701, 2802)
     # --> In total, 18 samples
    
     #Note that the real paths are as follows:
     #./20260506_AV243904_0073_A/2404_MKL1_wt_EVs/2404_MKL1_wt_EVs_R1.fastq.gz, ./20260506_AV243904_0073_A/2608_MKL1_wt_EVs/2608_MKL1_wt_EVs_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_sT_DMSO/2608_MKL1_sT_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_sT_DMSO/2701_MKL1_sT_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_sT_DMSO/2802_MKL1_sT_DMSO_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_sT_Dox/2608_MKL1_sT_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_sT_Dox/2701_MKL1_sT_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_sT_Dox/2802_MKL1_sT_Dox_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_scr_DMSO/2608_MKL1_scr_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_scr_DMSO/2701_MKL1_scr_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_scr_DMSO/2802_MKL1_scr_DMSO_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_scr_Dox/2608_MKL1_scr_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_scr_Dox/2701_MKL1_scr_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_scr_Dox/2802_MKL1_scr_Dox_R1.fastq.gz
  2. Adapter trimming

     #some common adapter sequences from different kits for reference:
     #    - TruSeq Small RNA (Illumina): TGGAATTCTCGGGTGCCAAGG
     #    - Small RNA Kits V1 (Illumina): TCGTATGCCGTCTTCTGCTTGT
     #    - Small RNA Kits V1.5 (Illumina): ATCTCGTATGCCGTCTTCTGCTTG
     #    - NEXTflex Small RNA Sequencing Kit v3 for Illumina Platforms (Bioo Scientific): TGGAATTCTCGGGTGCCAAGG
     #    - LEXOGEN Small RNA-Seq Library Prep Kit (Illumina): TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC *
     mkdir Data_Ute_smallRNA_via_exceRpt_workspace/trimmed; cd Data_Ute_smallRNA_via_exceRpt_workspace/trimmed
    
     echo "------------------------------------ cutadapting nf774 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf774.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf657 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf657.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf655 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf655.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf780 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf780.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf796 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf796.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf797 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf797.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf930 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf930.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf930/01_0505_WaGa_wt_EV_RNA_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf931 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf931.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf931/02_0505_WaGa_sT_DMSO_EV_RNA_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf932 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf932.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf932/03_0505_WaGa_sT_Dox_EV_RNA_S3_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf933 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf933.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf933/04_0505_WaGa_scr_DMSO_EV_RNA_S4_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf934 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf934.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf934/05_0505_WaGa_scr_Dox_EV_RNA_S5_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf935 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf935.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf935/06_1905_WaGa_wt_EV_RNA_S6_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf936 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf936.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf936/07_1905_WaGa_sT_DMSO_EV_RNA_S7_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf937 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf937.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf937/08_1905_WaGa_sT_Dox_EV_RNA_S8_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf938 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf938.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf938/09_1905_WaGa_scr_DMSO_EV_RNA_S9_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf939 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf939.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf939/10_1905_WaGa_scr_Dox_EV_RNA_S10_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf940 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf940.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf940/11_control_MKL1_S11_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf941 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf941.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf941/12_control_WaGa_S12_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf961 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf961.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf962 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf962.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf971 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf971.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf972 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf972.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf973 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf973.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf974 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf974.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2404_MKL1_wt_EVs -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2404_MKL1_wt_EVs.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2404_MKL1_wt_EVs/2404_MKL1_wt_EVs_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_wt_EVs -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_wt_EVs.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_wt_EVs/2608_MKL1_wt_EVs_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_sT_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_sT_DMSO/2608_MKL1_sT_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_sT_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_sT_DMSO/2701_MKL1_sT_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_sT_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_sT_DMSO/2802_MKL1_sT_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_sT_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_sT_Dox/2608_MKL1_sT_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_sT_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_sT_Dox/2701_MKL1_sT_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_sT_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_sT_Dox/2802_MKL1_sT_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_scr_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_scr_DMSO/2608_MKL1_scr_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_scr_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_scr_DMSO/2701_MKL1_scr_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_scr_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_scr_DMSO/2802_MKL1_scr_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_scr_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_scr_Dox/2608_MKL1_scr_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_scr_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_scr_Dox/2701_MKL1_scr_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_scr_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_scr_Dox/2802_MKL1_scr_Dox_R1.fastq.gz >> LOG
  3. Install exceRpt (https://github.gersteinlab.org/exceRpt/)

     docker pull rkitchen/excerpt
     mkdir MyexceRptDatabase
     cd /mnt/nvme0n1p1/MyexceRptDatabase
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
     tar -xvf exceRptDB_v4_hg38_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg19_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_mm10_lowmem.tgz
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOmiRNArRNA.tgz
     tar -xvf exceRptDB_v4_EXOmiRNArRNA.tgz
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOGenomes.tgz
     tar -xvf exceRptDB_v4_EXOGenomes.tgz
    
     # List extracted hg38 directory structure
     find hg38 -type f | sed 's|^hg38/||' | sort > extracted_hg38.txt
     comm -3 extracted_hg38.txt <(tar -tf exceRptDB_v4_hg38_lowmem.tgz | grep '^hg38/' | sed 's|^hg38/||' | sort)  # --> DIR hg38
     tar -tf exceRptDB_v4_EXOmiRNArRNA.tgz  # --> DIR ribosomeDatabase, NCBI_taxonomy_taxdump, miRBase
     tar -tf exceRptDB_v4_EXOGenomes.tgz  # --> Genomes_BacteriaFungiMammalPlantProtistVirus
  4. Run exceRpt

     #[---- REAL_RUNNING_COMPLETE_DB ---->]
     #NOTE that if not renamed in the input files, then have to RENAME all files recursively by removing "_cutadapted.fastq" in all names in _CORE_RESULTS_v4.6.3.tgz (first unzip, removing, then zip, mv to ../results_g).
     cd trimmed
     for file in *.fastq.gz; do
         echo "mv \"$file\" \"${file/.fastq/}\""
     done
    
     mkdir results
     for sample in nf780 nf796 nf797  nf655    nf774 nf961 nf962  nf657 nf930 nf935  nf931 nf936 nf971  nf932 nf937 nf972  nf933 nf938 nf973  nf934 nf939 nf974; do
         docker run -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/trimmed:/exceRptInput \
                    -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/results:/exceRptOutput \
                   -v /mnt/nvme0n1p1/MyexceRptDatabase:/exceRpt_DB \
                   -t rkitchen/excerpt \
                   INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on
     done
    
     for sample in 2404_MKL1_wt_EVs 2608_MKL1_wt_EVs    2608_MKL1_sT_DMSO 2701_MKL1_sT_DMSO 2802_MKL1_sT_DMSO    2608_MKL1_sT_Dox 2701_MKL1_sT_Dox 2802_MKL1_sT_Dox    2608_MKL1_scr_DMSO 2701_MKL1_scr_DMSO 2802_MKL1_scr_DMSO    2608_MKL1_scr_Dox 2701_MKL1_scr_Dox 2802_MKL1_scr_Dox; do
         docker run -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/trimmed:/exceRptInput \
                    -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/results:/exceRptOutput \
                   -v /mnt/nvme3n1p1/MyexceRptDatabase:/exceRpt_DB \
                   -t rkitchen/excerpt \
                   INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on
     done
    
     #DEBUG the excerpt env
     docker inspect rkitchen/excerpt:latest
     # Without /bin/bash → May run and exit immediately
     #docker run -it rkitchen/excerpt
     # With /bin/bash → Stays open for interaction
     docker run -it --entrypoint /bin/bash rkitchen/excerpt
  5. Processing exceRpt output from multiple samples

     cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/exceRpt-master
     mamba activate r_env
     mamba install -c conda-forge -c bioconda \
         bioconductor-marray \
         bioconductor-rgraphviz \
         r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx r-rcurl r-xml \
         -y
     mamba install -c conda-forge -c bioconda \
         r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx \
         bioconductor-marray bioconductor-rgraphviz \
         -y
    
     #mkdir summaries heatmap_all_WaGa+4_MKL-1
     mkdir results_WaGa_EXCLUDED results_MKL-1 summaries_WaGa summaries_MKL-1 heatmap_WaGa heatmap_MKL-1
     #! EXCLUDE some isolates since they have total different pattern or due to bad quality --> outliner, until now only one sample, namely nf657 from WaGa wt EV:
     sudo mv results/nf657* results_WaGa_EXCLUDED/
     sudo mv results/nf780* results_MKL-1/
     sudo mv results/nf796* results_MKL-1/
     sudo mv results/nf797* results_MKL-1/
     sudo mv results/nf655* results_MKL-1/
     for sample in 2404_MKL1_wt_EVs 2608_MKL1_wt_EVs    2608_MKL1_sT_DMSO 2701_MKL1_sT_DMSO 2802_MKL1_sT_DMSO    2608_MKL1_sT_Dox 2701_MKL1_sT_Dox 2802_MKL1_sT_Dox    2608_MKL1_scr_DMSO 2701_MKL1_scr_DMSO 2802_MKL1_scr_DMSO    2608_MKL1_scr_Dox 2701_MKL1_scr_Dox 2802_MKL1_scr_Dox; do
         echo "sudo mv results/${sample}* results_MKL-1/"
     done
     #Following our initial QC, I noticed that one of the MKL-1 wt-EV samples (nf655) is a clear outlier, clustering far apart from the other two wt-EV replicates in the PCoA plots. I recommend removing nf655 from the downstream MKL-1 analysis, which is similar to our earlier analysis for MKL-1, in which we removed the outlier nf657. Please see the attached figures for reference.
     mv results_MKL-1/nf655* results_MKL-1_EXCLUDED/
    
     (r_env) jhuang@WS-2290C:~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/exceRpt-master$ R
     #WARNING: need to reload the R-script after each change of the script.
     source("mergePipelineRuns_functions.R")
     processSamplesInDir("../results_WaGa/", "../summaries_WaGa")
     processSamplesInDir("../results_MKL-1/", "../summaries_MKL-1")
    
     #mkdir heatmap_WaGa; cp summaries_WaGa/*.RData heatmap_WaGa; rm heatmap_WaGa/exceRpt_sampleGroupDefinitions.txt;
     source("mergePipelineRuns_functions_addSampleGroupInfo_WaGa.R")
     processSamplesInDir("../results_WaGa/", "../heatmap_WaGa")
    
     #mkdir heatmap_MKL-1; cp summaries_MKL-1/*.RData heatmap_MKL-1; rm heatmap_MKL-1/exceRpt_sampleGroupDefinitions.txt;
     source("mergePipelineRuns_functions_addSampleGroupInfo_MKL-1.R")
     processSamplesInDir("../results_MKL-1/", "../heatmap_MKL-1")
    
     #!!!!! IMPORTANT: REPORT heatmap_MKL-1/exceRpt_DiagnosticPlots.pdf and heatmap_MKL-1/mapping_heatmap3.pdf (They are almost the same, mapping_heatmap3.pdf is better due to bigger font size) !!!!
     #CONSIDERING_TO_DEL_nf774 since it is very far to another two samples (MAYBE BETTER NOT DO THIS, SINCE I HAVE TO GENERATE PCA- and MANHATTAN PLOTS!!): now the sample nf774 was kept in the WaGa results.
    
     #~/Tools/csv2xls-0.4/csv_to_xls.py exceRpt_miRNA_ReadsPerMillion.txt exceRpt_tRNA_ReadsPerMillion.txt exceRpt_piRNA_ReadsPerMillion.txt -d$'\t' -o exceRpt_results_detailed.xls
    
     # Report summaries_WaGa/exceRpt_mapping_heatmaps_WaGa.xlsx or summaries_MKL-1/exceRpt_mapping_heatmaps_MKL-1.xlsx;
     #        summaries_WaGa/exceRpt_results_detailed_WaGa.xls or summaries_MKL-1/exceRpt_results_detailed_MKL-1.xls;
     #        heatmap_WaGa/mapping_heatmap3_WaGa.pdf or heatmap_MKL-1/mapping_heatmap3_MKL-1.pdf
  6. Downstream analyis using R for miRNAs (17 WaGa samples)

     #Input file
     #exceRpt_miRNA_ReadCounts.txt
     #exceRpt_piRNA_ReadCounts.txt
    
     ## WaGa experimental groups (scr = scramble control; sT = target knockdown)
     #WaGa_scr_DMSO_EV (nf933, nf938, nf973)
     #WaGa_scr_Dox_EV (nf934, nf939, nf974)
     #WaGa_sT_DMSO_EV (nf931, nf936, nf971)
     #WaGa_sT_Dox_EV (nf932, nf937, nf972)
     #
     ## WaGa wild-type controls
     #WaGa_wt_cells (nf774, nf961, nf962)
     #WaGa_wt_EV (nf930, nf935)
    
     cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/summaries_WaGa
     mamba activate r_env
     R
    
     #BiocManager::install("AnnotationDbi")
     #BiocManager::install("clusterProfiler")
     #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
     #BiocManager::install("limma")
     #BiocManager::install("sva")
     #install.packages("writexl")
     #install.packages("openxlsx")
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library("org.Hs.eg.db")
     library(DESeq2)
     library(gplots)
     library(limma)
     library(sva)
     #library(writexl)  #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx");
     library(openxlsx)
    
     d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1)
    
     # Desired column order
     desired_order <- c(
         "nf933", "nf938", "nf973",
         "nf934", "nf939", "nf974",
         "nf931", "nf936", "nf971",
         "nf932", "nf937", "nf972",
         "nf774", "nf961", "nf962",
         "nf930", "nf935"
     )
    
     # Reorder columns
     d.raw <- d.raw[, desired_order]
     setdiff(desired_order, colnames(d.raw))  # Shows missing or misnamed columns
     #sapply(d.raw, is.numeric)
     d.raw[] <- lapply(d.raw, as.numeric)
     #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x)))
     d.raw <- round(d.raw)
     write.csv(d.raw, file ="d_raw.csv")
     write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE)
    
     # ------ Code sent to Ute ------
     #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1)
     Cell_or_EV = as.factor(c("EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "Cell","Cell","Cell",  "EV","EV"))
     replicates = as.factor(c("WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV",     "WaGa_scr_Dox_EV","WaGa_scr_Dox_EV","WaGa_scr_Dox_EV",  "WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV",  "WaGa_sT_Dox_EV","WaGa_sT_Dox_EV","WaGa_sT_Dox_EV",  "WaGa_wt_cells", "WaGa_wt_cells","WaGa_wt_cells",  "WaGa_wt_EV", "WaGa_wt_EV"))
     ids = as.factor(c(
         "nf933", "nf938", "nf973",
         "nf934", "nf939", "nf974",
         "nf931", "nf936", "nf971",
         "nf932", "nf937", "nf972",
         "nf774", "nf961", "nf962",
         "nf930", "nf935"))
     cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV)
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     # Filter low-count miRNAs
     dds <- dds[ rowSums(counts(dds)) > 10, ]
     rld <- rlogTransformation(dds)
    
     # -- before pca --
     png("pca.png", 1200, 800)
     plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     #plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
     png("pca2.png", 1200, 800)
     #plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
    
     # Batch Effect Removal Methods (Non-batch effect removal applied!)
    
     #### STEP2: DEGs ####
     #- Heatmap untreated/wt vs parental; 1x for WaGa cell line
     #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line
     #- Manhattan plot miRNAs; 1x for WaGa cell line
     #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line
     #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line
    
     #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
     sizeFactors(dds)
     #NULL
     dds <- estimateSizeFactors(dds)
     sizeFactors(dds)
     normalized_counts <- counts(dds, normalized=TRUE)
     write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
     write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE)
    
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     dds$replicates <- relevel(dds$replicates, "WaGa_wt_cells")
     dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
     resultsNames(dds)
     clist <- c("WaGa_wt_EV_vs_WaGa_wt_cells")
    
     #NOTE that the results sent to Ute is |padj|<=0.1.
     for (i in clist) {
         contrast = paste("replicates", i, sep="_")
         res = results(dds, name=contrast)
         res <- res[!is.na(res$log2FoldChange),]
         #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
         res$padj <- ifelse(is.na(res$padj), 1, res$padj)
         res_df <- as.data.frame(res)
         write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
         up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
         down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
         write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
         write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
     }
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     WaGa_wt_EV_vs_WaGa_wt_cells-all.txt \
     WaGa_wt_EV_vs_WaGa_wt_cells-up.txt \
     WaGa_wt_EV_vs_WaGa_wt_cells-down.txt \
     -d$',' -o WaGa_wt_EV_vs_WaGa_wt_cells.xls;
    
     # ------------------- volcano_plot -------------------
     library(ggplot2)
     library(ggrepel)
    
     geness_res <- read.csv(file = paste("WaGa_wt_EV_vs_WaGa_wt_cells", "all.txt", sep="-"), row.names=1)
    
     external_gene_name <- rownames(geness_res)
     geness_res <- cbind(geness_res, external_gene_name)
     #top_g are from ids
     top_g <- c("hsa-let-7b-5p","hsa-let-7g-5p","hsa-let-7i-5p","hsa-miR-103a-3p","hsa-miR-107","hsa-miR-1224-5p","hsa-miR-122-5p","hsa-miR-1226-5p","hsa-miR-1246","hsa-miR-127-3p","hsa-miR-1290","hsa-miR-130a-3p","hsa-miR-139-3p","hsa-miR-141-3p","hsa-miR-143-3p","hsa-miR-148b-3p","hsa-miR-155-5p","hsa-miR-15a-5p","hsa-miR-17-5p","hsa-miR-184","hsa-miR-18a-3p","hsa-miR-18a-5p","hsa-miR-190a-5p","hsa-miR-191-5p","hsa-miR-193b-5p","hsa-miR-197-5p","hsa-miR-200a-3p","hsa-miR-200b-5p","hsa-miR-206","hsa-miR-20a-5p","hsa-miR-210-3p","hsa-miR-2110","hsa-miR-21-5p","hsa-miR-218-5p","hsa-miR-219a-1-3p","hsa-miR-221-3p","hsa-miR-23b-3p","hsa-miR-27a-3p","hsa-miR-27b-3p","hsa-miR-27b-5p","hsa-miR-28-3p","hsa-miR-30a-5p","hsa-miR-30c-5p","hsa-miR-30e-5p","hsa-miR-3127-5p","hsa-miR-3131","hsa-miR-3180|hsa-miR-3180-3p","hsa-miR-320a","hsa-miR-320b","hsa-miR-320c","hsa-miR-320d","hsa-miR-330-3p","hsa-miR-335-3p","hsa-miR-33b-5p","hsa-miR-340-5p","hsa-miR-342-5p","hsa-miR-3605-5p","hsa-miR-361-3p","hsa-miR-365a-5p","hsa-miR-374b-5p","hsa-miR-378i","hsa-miR-379-5p","hsa-miR-3940-5p","hsa-miR-409-3p","hsa-miR-411-5p","hsa-miR-423-3p","hsa-miR-423-5p","hsa-miR-4286","hsa-miR-429","hsa-miR-432-5p","hsa-miR-4326","hsa-miR-451a","hsa-miR-4520-3p","hsa-miR-454-3p","hsa-miR-4646-5p","hsa-miR-4667-5p","hsa-miR-4748","hsa-miR-483-5p","hsa-miR-486-5p","hsa-miR-5010-5p","hsa-miR-504-3p","hsa-miR-5187-5p","hsa-miR-590-3p","hsa-miR-6128","hsa-miR-625-5p","hsa-miR-6726-5p","hsa-miR-6730-5p","hsa-miR-676-3p","hsa-miR-6767-5p","hsa-miR-6777-5p","hsa-miR-6780a-5p","hsa-miR-6794-5p","hsa-miR-6817-3p","hsa-miR-708-5p","hsa-miR-7-5p","hsa-miR-766-5p","hsa-miR-7854-3p","hsa-miR-873-3p","hsa-miR-885-3p","hsa-miR-92b-5p","hsa-miR-93-5p","hsa-miR-937-3p","hsa-miR-9-5p","hsa-miR-98-5p")
     subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0))
     geness_res$Color <- "NS or log2FC < 2.0"
     geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
     geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
     geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
    
     write.csv(geness_res, "WaGa_wt_EV_vs_WaGa_wt_cells_with_Category.csv")
     geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
    
     geness_res <- geness_res[, -1*ncol(geness_res)]
     png("WaGa_wt_EV_vs_WaGa_wt_cells.png",width=1200, height=1400)
     #svg("WaGa_wt_EV_vs_WaGa_wt_cells.svg",width=12, height=14)
     ggplot(geness_res,       aes(x = log2FoldChange, y = -log10(pvalue),           color = Color, label = external_gene_name)) +       geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +       geom_hline(yintercept = -log10(0.05), lty = "dashed") +       geom_point() +       labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") +       scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) +       geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) +       theme_bw(base_size = 16) +       theme(legend.position = "bottom")
     dev.off()
    
     # ----------------------------------------
     # ----------- manhattan_plot -------------
    
     Rscript manhattan_plot_Carmen_custom_labels.R  #exceRpt_miRNA_ReadCounts.txt
  7. Downstream analyis using R for miRNAs (17 MKL-1 samples)

     #Input file
     #exceRpt_miRNA_ReadCounts.txt
     #exceRpt_piRNA_ReadCounts.txt
    
     #MKL-1_sT_DMSO_EV ("X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO")
     #MKL-1_sT_Dox_EV ("X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox")
     #MKL-1_scr_DMSO_EV ("X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO")
     #MKL-1_scr_Dox_EV ()"X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox")
     #MKL-1_wt_cells ("nf780","nf796","nf797")
     #MKL-1_wt_EV ("X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs")
    
     cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/summaries_MKL-1
     mamba activate r_env
     R
    
     #BiocManager::install("AnnotationDbi")
     #BiocManager::install("clusterProfiler")
     #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
     #BiocManager::install("limma")
     #BiocManager::install("sva")
     #install.packages("writexl")
     #install.packages("openxlsx")
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library("org.Hs.eg.db")
     library(DESeq2)
     library(gplots)
     library(limma)
     library(sva)
     #library(writexl)  #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx");
     library(openxlsx)
    
     d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1)
    
     # Desired column order
     desired_order <- c(
         "X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO", "X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox", "X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO", "X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox",
         "nf780","nf796","nf797", "X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs"
     )
    
     # Reorder columns
     d.raw <- d.raw[, desired_order]
     setdiff(desired_order, colnames(d.raw))  # Shows missing or misnamed columns
     #sapply(d.raw, is.numeric)
     d.raw[] <- lapply(d.raw, as.numeric)
     #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x)))
     d.raw <- round(d.raw)
     write.csv(d.raw, file ="d_raw.csv")
     write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE)
    
     #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1)
     Cell_or_EV = as.factor(c("EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "Cell","Cell","Cell",  "EV","EV"))
     replicates = as.factor(c("MKL-1_sT_DMSO_EV","MKL-1_sT_DMSO_EV","MKL-1_sT_DMSO_EV",     "MKL-1_sT_Dox_EV","MKL-1_sT_Dox_EV","MKL-1_sT_Dox_EV",  "MKL-1_scr_DMSO_EV","MKL-1_scr_DMSO_EV","MKL-1_scr_DMSO_EV",  "MKL-1_scr_Dox_EV","MKL-1_scr_Dox_EV","MKL-1_scr_Dox_EV",    "MKL-1_wt_cells", "MKL-1_wt_cells","MKL-1_wt_cells",  "MKL-1_wt_EV","MKL-1_wt_EV"))
     ids = as.factor(c("X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO", "X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox", "X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO", "X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox",
         "nf780","nf796","nf797", "X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs"))
     cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV)
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     # Filter low-count miRNAs
     dds <- dds[ rowSums(counts(dds)) > 10, ]
     rld <- rlogTransformation(dds)
    
     # -- before pca --
     png("pca.png", 1200, 800)
     plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     #plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
     png("pca2.png", 1200, 800)
     #plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
    
     # Batch Effect Removal Methods (Non-batch effect removal applied!)
    
     #### STEP2: DEGs ####
     #- Heatmap untreated/wt vs parental; 1x for WaGa cell line
     #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line
     #- Manhattan plot miRNAs; 1x for WaGa cell line
     #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line
     #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line
    
     #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
     sizeFactors(dds)
     #NULL
     dds <- estimateSizeFactors(dds)
     sizeFactors(dds)
     normalized_counts <- counts(dds, normalized=TRUE)
     write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
     write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE)
    
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     dds$replicates <- relevel(dds$replicates, "MKL-1_wt_cells")
     dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
     resultsNames(dds)
     clist <- c("MKL.1_wt_EV_vs_MKL.1_wt_cells")
    
     #NOTE that the results sent to Ute is |padj|<=0.1.
     for (i in clist) {
         contrast = paste("replicates", i, sep="_")
         res = results(dds, name=contrast)
         res <- res[!is.na(res$log2FoldChange),]
         #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
         res$padj <- ifelse(is.na(res$padj), 1, res$padj)
         res_df <- as.data.frame(res)
         write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
         up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
         down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
         write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
         write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
     }
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     MKL.1_wt_EV_vs_MKL.1_wt_cells-all.txt \
     MKL.1_wt_EV_vs_MKL.1_wt_cells-up.txt \
     MKL.1_wt_EV_vs_MKL.1_wt_cells-down.txt \
     -d$',' -o MKL.1_wt_EV_vs_MKL.1_wt_cells.xls;
    
     # ------------------- volcano_plot -------------------
     library(ggplot2)
     library(ggrepel)
    
     geness_res <- read.csv(file = paste("MKL.1_wt_EV_vs_MKL.1_wt_cells", "all.txt", sep="-"), row.names=1)
    
     external_gene_name <- rownames(geness_res)
     geness_res <- cbind(geness_res, external_gene_name)
     #top_g are from ids
    
     top_g <- c("hsa-miR-203a-3p","hsa-miR-6850-5p","hsa-miR-4511","hsa-miR-5187-5p","hsa-miR-133b","hsa-miR-1246","hsa-miR-625-3p","hsa-miR-6741-3p","hsa-miR-192-5p","hsa-miR-10b-5p","hsa-miR-885-5p","hsa-miR-30e-3p","hsa-miR-101-3p","hsa-miR-1307-5p","hsa-miR-95-3p","hsa-miR-889-3p","hsa-miR-206","hsa-miR-301a-3p","hsa-miR-1-3p","hsa-let-7c-5p","hsa-miR-196a-5p","hsa-let-7f-5p","hsa-let-7e-5p","hsa-miR-30c-5p","hsa-miR-30a-3p","hsa-miR-146b-5p","hsa-miR-25-3p","hsa-miR-182-5p","hsa-miR-98-5p","hsa-let-7a-5p","hsa-miR-149-5p","hsa-miR-148a-3p","hsa-miR-873-3p","hsa-miR-19b-3p","hsa-miR-320c","hsa-miR-375","hsa-miR-30a-5p","hsa-miR-877-5p","hsa-miR-34a-5p","hsa-miR-324-5p","hsa-miR-652-3p","hsa-miR-342-5p","hsa-miR-7706","hsa-miR-361-3p","hsa-miR-361-5p","hsa-miR-1180-3p","hsa-miR-217","hsa-miR-1307-3p","hsa-miR-1908-5p","hsa-miR-15b-5p","hsa-miR-92b-5p","hsa-miR-484","hsa-miR-197-3p","hsa-miR-200c-3p","hsa-miR-671-5p","hsa-miR-339-5p","hsa-miR-1301-3p","hsa-miR-769-5p","hsa-miR-328-3p","hsa-miR-93-5p","hsa-miR-103a-3p")
     subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0))
     geness_res$Color <- "NS or log2FC < 2.0"
     geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
     geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
     geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
    
     write.csv(geness_res, "MKL.1_wt_EV_vs_MKL.1_wt_cells_with_Category.csv")
     geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
    
     geness_res <- geness_res[, -1*ncol(geness_res)]
     png("MKL.1_wt_EV_vs_MKL.1_wt_cells.png",width=1200, height=1400)
     #svg("MKL.1_wt_EV_vs_MKL.1_wt_cells.svg",width=12, height=14)
     ggplot(geness_res,       aes(x = log2FoldChange, y = -log10(pvalue),           color = Color, label = external_gene_name)) +       geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +       geom_hline(yintercept = -log10(0.05), lty = "dashed") +       geom_point() +       labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") +       scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) +       geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) +       theme_bw(base_size = 16) +       theme(legend.position = "bottom")
     dev.off()
    
     # ----------------------------------------
     # ----------- manhattan_plot -------------
    
     Rscript manhattan_plot_Carmen_custom_labels.R  #exceRpt_miRNA_ReadCounts.txt

Comparing https://chat.z.ai/, https://chat.qwen.ai/, and https://chat.deepseek.com/ to explain Maaslin2

Maaslin2 解释

什么是Maaslin2?

Maaslin2(Microbiome Multivariable Association with Linear Models)是一个用于微生物组数据多变量关联分析的R语言包。

主要用途

  • 寻找微生物特征(如细菌丰度)与元数据(如年龄、饮食、疾病状态)之间的关联
  • 适用于16S rRNA测序、宏基因组等微生物组数据

核心特点

1. 统计方法

  • 使用线性模型广义线性模型
  • 支持固定效应和随机效应
  • 可处理重复测量数据

2. 数据标准化

  • 自动处理微生物数据的组成性问题
  • 提供 TSS、CSS、CLR 等多种标准化方法

3. 多变量分析

  • 可同时调整多个协变量
  • 控制混杂因素

基本工作流程

# 安装
library(Maaslin2)

# 运行分析
results <- Maaslin2(
  input_data = microbiome_data,    # 微生物丰度表
  input_metadata = metadata,       # 元数据
  output = "results",              # 输出目录
  fixed_effects = c("age", "diet"), # 固定效应
  random_effects = c("subject_id"), # 随机效应
  normalization = "CLR",           # 标准化方法
  transformation = "LOG"           # 数据转换
)

输出结果

  • 显著性关联表:包含p值、q值、效应大小
  • 可视化图表:热图、箱线图等
  • 汇总统计

优势

  1. 用户友好:简单的接口,默认参数合理
  2. 灵活性强:支持多种标准化和转换方法
  3. 统计严谨:多重检验校正(FDR)
  4. 可重复性:结果易于复现

应用场景

  • 疾病与肠道菌群关联研究
  • 饮食干预对微生物组影响
  • 药物处理前后菌群变化
  • 环境因素与微生物组关系


Here’s Python code to read a file and output to console:

# Method 1: Read entire file at once
with open('filename.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)

# Method 2: Read line by line
with open('filename.txt', 'r', encoding='utf-8') as file:
    for line in file:
        print(line.strip())  # strip() removes newline characters

# Method 3: Read all lines into a list
with open('filename.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()
    for line in lines:
        print(line.strip())

# With error handling
try:
    with open('filename.txt', 'r', encoding='utf-8') as file:
        content = file.read()
        print(content)
except FileNotFoundError:
    print("File not found!")
except Exception as e:
    print(f"An error occurred: {e}")

Key points:

  • Use with statement to automatically close the file
  • Specify encoding='utf-8' to handle special characters
  • strip() removes trailing newlines when reading line by line
  • Always include error handling for robust code

Replace 'filename.txt' with your actual file path.

Processing DNAseq for Data_Tam_DNAseq_2026_2605_2617_2631_2914_Acinetobacter_sp

  1. Run nextflow bacass

     conda deactivate
    
     # Downlod k2_standard_08_GB_20251015.tar.gz from https://benlangmead.github.io/aws-indexes/k2#kraken2--bracken
     # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056; 'tar xzf 20190108_kmerfinder_stable_dirs.tar.gz'  #The database does not work!
     # Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz  #The database works!
    
     # DEBUG: --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ not working!
    
     nextflow run nf-core/bacass -r 2.6.0 -profile docker --help
    
     # -- Hybrid assembly --
     nextflow run nf-core/bacass -r 2.6.0 -profile docker \
       --input samplesheet_bacass.tsv \
       --outdir bacass_out \
       --assembly_type hybrid \
       --assembler unicycler,dragonflye \
       --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
       --skip_kmerfinder \
       -resume \
       -work-dir bacass_out/work
    
     # -- Short assembly --
     #Maybe BUG is from '--skip_kmerfinder for -r 2.6.0, using db in 2.5.0'
     nextflow run nf-core/bacass -r 2.5.0 -profile docker \
       --input samplesheet.tsv \
       --outdir bacass_out \
       --assembly_type short \
       --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
       --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
       -resume \
       -work-dir bacass_out/work
  2. Verify if the genome is pure

     # 1. Go up one level to the main 'bacass_out' directory
     cd ..
    
     # 2. Create directories for CheckM inputs and outputs
     mkdir -p checkm_input checkm_output
    
     # 3. Copy all .fna files into the 'checkm_input' folder
     # (CheckM cannot search subdirectories, so they must be in one folder)
     find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;
    
     # 4. Run CheckM on all 4 assemblies
     checkm lineage_wf -x fna checkm_input checkm_output
  3. Species Identification: 快速筛查用 Mash → 精确分类用 GTDB-Tk → 种级验证用 FastANI,三者结合可最大限度提高物种鉴定的准确性和可解释性。

     # 1. 创建环境(推荐 mamba)
     mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk
     mamba activate gtdbtk
    
     # 2. 下载数据库(仅需首次,约 60GB)
     gtdbtk download --data_dir ./gtdb_data --release 220
    
     wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.g
     mamba env config vars set GTDBTK_DATA_PATH="/mnt/nvme4n1p1/gtdb_data/release232"
     # 先退出当前环境,再重新激活
     mamba deactivate
     mamba activate gtdbtk
    
     # 验证环境变量是否加载成功
     echo $GTDBTK_DATA_PATH
     # 应输出:/mnt/nvme4n1p1/gtdb_data/release232
    
     # 3. 运行分类(你提供的命令 + 实用参数)
     gtdbtk classify_wf \
       --genome_dir ./checkm_input \
       --out_dir gtdb_out \
       --cpus 64 \
       --extension .fna \
       --prefix mygenome
    
     # 4. 查看结果
     cat gtdb_out/classify/mygenome.bac120.summary.tsv   # 细菌结果
  4. Antimicrobial resistance gene profiling and Resistome and Virulence Profiling with Abricate and RGI (Reisistance Gene Identifier)

     conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
     abricate --list
    
     conda deactivate
    
     ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \
     ASM=bacass_out/checkm_input/2914_.fna \
     SAMPLE=2914 \
     OUTDIR=resistome_virulence_2914 \
     MINID=80 MINCOV=60 \
     THREADS=32 \
     ~/Scripts/run_abricate_resistome_virulome_one_per_gene.sh
    
     #ABRicate thresholds: MINID=80 MINCOV=60
     Database        Hit_lines       File
     MEGARes 24      resistome_virulence_2605/raw/2605.megares.tab
     CARD    21      resistome_virulence_2605/raw/2605.card.tab
     ResFinder       4       resistome_virulence_2605/raw/2605.resfinder.tab
     VFDB    0       resistome_virulence_2605/raw/2605.vfdb.tab
    
     # Database        Hit_lines       File
     # MEGARes 42      resistome_virulence_2631/raw/2631.megares.tab
     # CARD    37      resistome_virulence_2631/raw/2631.card.tab
     # ResFinder       16      resistome_virulence_2631/raw/2631.resfinder.tab
     # VFDB    0       resistome_virulence_2631/raw/2631.vfdb.tab
    
     Database        Hit_lines       File
     MEGARes 35      resistome_virulence_2914/raw/2914.megares.tab
     CARD    31      resistome_virulence_2914/raw/2914.card.tab
     ResFinder       11      resistome_virulence_2914/raw/2914.resfinder.tab
     VFDB    0       resistome_virulence_2914/raw/2914.vfdb.tab
    
     # #ABRicate thresholds: MINID=70 MINCOV=50
     # Database        Hit_lines       File
     # MEGARes 24      resistome_virulence_2605/raw/2605.megares.tab
     # CARD    21      resistome_virulence_2605/raw/2605.card.tab
     # ResFinder       4       resistome_virulence_2605/raw/2605.resfinder.tab
     # VFDB    3       resistome_virulence_2605/raw/2605.vfdb.tab
    
     conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
     #NEED_TO_ADAPT: OUTDIR = Path("resistome_virulence_An7")
     #NEED_TO_ADAPT: SAMPLE = "An7"
     #DEPRECATED_DUE_TO_NEED_MANULL_SETTING: python ~/Scripts/merge_amr_sources_by_gene.py
    
     python ~/Scripts/export_resistome_virulence_to_excel_py36.py \
       --workdir resistome_virulence_2914 \
       --sample 2914 \
       --out Resistome_Virulence_2914.xlsx
     # Delete the column 'COVERAGE_MAP' in all 'Raw_*' sheets
  5. Report

     Please find below a summary of genomic analyses for samples 2605, 2617, 2631 and 2914.
    
     ### 1. Assembly and checkM
    
             ------------------------------------------------------------------------------------------------------------------------------------------------------------------
             Bin Id            Completeness   Contamination   Strain heterogeneity
             ------------------------------------------------------------------------------------------------------------------------------------------------------------------
             2631_       100.00          100.00             78.57
             2617_          100.00          100.00             78.57
             2605_     100.00           0.00               0.00
             2914_         99.98            0.63               0.00
             ----------------------------------------------------------------------------------------------------------------------------------------------------------------
    
             From the results of checkM, we see the samples 2631_ and 2617_ both are genomes between 7.0-7.1 M. and the contamination is 100.00, which means the DNA sample contained two closely related strains of the same species from a non-clonal culture. If the true genome size is a standard ~3.7 Mb  and the assembler couldn't merge the two highly similar strains, it would build both side-by-side. This results in a ~7.0 Mb assembly where every gene is duplicated.
             The sample 2605_.fna is 3.7 M and 2914_.fna is about 3.9M. they are pure isolates.
    
             ### 1. Species Identification
    
             **Sample 2605_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.02% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.874 | ✅ 87.4% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2605_ is confidently assigned to *Acinetobacter baumannii*.
    
             ***
    
             **Sample 2617_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.00% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.859 | ✅ 85.9% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2617_ is confidently assigned to *Acinetobacter baumannii*.
    
             ***
    
             **Sample 2631_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.07% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.860 | ✅ 86.0% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2631_ is confidently assigned to *Acinetobacter baumannii*.
    
             ***
    
             **Sample 2914_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.11% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.873 | ✅ 87.3% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2914_ is confidently assigned to *Acinetobacter baumannii*.
    
     ### 3. Since 2631_ and 2617_ are not a pure isolates, they are the mixed of two strains. I exclude the two samples from AMR and VFDB analysis. AMR Genes and Virulence Factors (VFDB) Summary, see the Resistome_Virulence_2605.xlsx and Resistome_Virulence_2914.xlsx.


Since CheckM requires all the genome files (bins) to be in the same directory, you first need to gather your .fna files into a single folder before running the command.

Here are the exact bash commands to prepare your files and run CheckM for all four samples (2605, 2617, 2631, and 2914).

Step 1: Prepare the Input and Run CheckM

Run the following commands in your terminal:

# 1. Go up one level to the main 'bacass_out' directory
cd ..

# 2. Create directories for CheckM inputs and outputs
mkdir -p checkm_input checkm_output

# 3. Copy all .fna files into the 'checkm_input' folder
# (CheckM cannot search subdirectories, so they must be in one folder)
find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;

# 4. Run CheckM on all 4 assemblies
checkm lineage_wf -x fna checkm_input checkm_output

(Note: If you get an error saying the CheckM database is not found, you need to set the path to your CheckM data folder first using: checkm data setRoot /path/to/your/checkm_data)

Step 2: How to Interpret the Results

CheckM will print a summary table directly to your terminal, and it will also save detailed statistics in checkm_output/bin_stats_ext.tsv.

Since you are working with Acinetobacter sp. (which typically has a genome size of ~3.5 to 4.5 Mb), a 7.0 Mb assembly is almost certainly a mix of two closely related strains that the assembler failed to collapse.

Look specifically at these three columns in the CheckM output for your 2631_ sample:

  1. Completeness: This should be high (ideally > 95%). Because the assembler built two genomes, CheckM will likely still find all the marker genes, so completeness might look deceptively high.
  2. Contamination: This is the most critical metric. For a pure isolate, this should be < 5%. If your 2631 assembly is a mixed strain, CheckM will detect the duplicated marker genes and flag this as high contamination (often > 20-50%).
  3. Strain heterogeneity: This column specifically calculates if there are multiple strains of the same species present. A high value here confirms that your assembly contains a mixture of closely related strains.

What to do next?

  • If 2631 has high contamination/strain heterogeneity: You have confirmed it is a mixed assembly. You can either use a tool like purge_dups to separate the two strains, or map your raw reads back to the assembly and filter out the contigs with half the average coverage.
  • If 2631 has low contamination: The 7.0 Mb size might be real (some Acinetobacter strains have large plasmids or are naturally larger), but you should still verify this using a BlobPlot (GC vs. Coverage) to be absolutely sure there is no foreign DNA.
  • Check the other 3 samples: Compare the CheckM results of 2605, 2617, and 2914 to see if they are pure, high-quality genomes (high completeness, low contamination).

            ------------------------------------------------------------------------------------------------------------------------------------------------------------------
            Bin Id         Marker lineage         # genomes   # markers   # marker sets   0    1     2    3   4   5+   Completeness   Contamination   Strain heterogeneity
            ------------------------------------------------------------------------------------------------------------------------------------------------------------------
            2631_           root (UID1)              5656         56            24        0    0     56   0   0   0       100.00          100.00             78.57        
            2617_           root (UID1)              5656         56            24        0    0     56   0   0   0       100.00          100.00             78.57        
            2605_    f__Moraxellaceae (UID4680)       86         689           365        0   689    0    0   0   0       100.00           0.00               0.00        
            2914_    g__Acinetobacter (UID4685)       42         1193          254        1   1188   4    0   0   0       99.98            0.63               0.00        
            ----------------------------------------------------------------------------------------------------------------------------------------------------------------

Ah, I understand now! Thank you for clarifying. The CheckM results you showed earlier (with 100% contamination and 56 markers in the ‘2’ column) were indeed from this Unicycler assembly.

Looking at your Unicycler log, we can see exactly why it didn’t fix the problem: Total Length: 7,199,947 bp (~7.2 Mb) Main Chromosome: Component 1 consists of 416 fragmented contigs totaling 7,165,229 bp.

Why Unicycler Failed to Collapse the Genome

The log states: “Since you provided only short reads, Unicycler will essentially function as a SPAdes-optimiser.”

Because you only have short reads (Illumina), Unicycler builds a de Bruijn graph. When a bacterial sample contains a mixture of two closely related strains (a very common scenario in non-clonal isolates), the assembler sees the slight genetic differences between the two strains as “bubbles” in the graph.

Without long reads to span across these bubbles and prove which path belongs to which strain, the software conservatively outputs both paths. This results in an assembly that contains both strains side-by-side, doubling your genome size to ~7.2 Mb and causing the 100% duplication flagged by CheckM.

How to Fix This (Since Re-assembling Won’t Work)

Running another short-read assembler will just result in the exact same 7.2 Mb duplicated assembly. You now need to computationally “purge” the duplicate contigs.

Option 1: The Quick Fix using BBMap (dedupe.sh)

If you have the BBMap suite installed, it has a fantastic tool specifically for removing duplicate sequences. It will compare all your contigs to each other and remove the redundant ones.

dedupe.sh in=assembly.fasta out=assembly_purged.fasta minidentity=95
  • minidentity=95 tells it to remove contigs that are 95% identical or more to another contig. You can adjust this (e.g., 90 or 98) depending on how closely related the two strains are.
  • After running this, run CheckM again on assembly_purged.fasta. You should see the genome size drop to ~3.5 Mb and the contamination drop to 0%.

Option 2: The Thorough Fix using purge_dups

purge_dups is the gold standard for separating primary contigs from duplicated “haplotigs” in an assembly. It works by aligning your assembly to itself and using read depth to separate the true genome from the duplicates.

  1. Calculate coverage cutoffs: calcuts assembly.fasta > cutoffs 2>calcults.log
  2. Split the assembly: split_fa assembly.fasta > assembly.split.fa
  3. Align to itself: minimap2 -xasm5 assembly.split.fa assembly.split.fa > alignment.paf
  4. Get coverage: pbcstat alignment.paf and calcuts PB.base.cov
  5. Purge: purge_dups -2 -T 1024 -c CUTOFFS assembly.split.fa alignment.paf > dups.bed
  6. Extract pure genome: get_seqs dups.bed assembly.split.fa (This will output assembly.split.fa.purged.fa, which is your pure ~3.5 Mb genome).

Option 3: Do you have Long Reads?

If you happen to have Oxford Nanopore (ONT) or PacBio reads for this sample, you can solve this biologically rather than computationally. Running Unicycler with long reads (unicycler -1 short_R1.fastq -2 short_R2.fastq -l long_reads.fastq) will allow the assembler to easily span across the “bubbles” and collapse the two strains into a single, perfect, circular ~3.5 Mb chromosome.


Bonus Observation: You likely have 3 plasmids!

Looking at the end of your log, Unicycler successfully closed 3 circular contigs:

  • 8,731 bp (7.89x coverage)
  • 5,837 bp (5.27x coverage)
  • 2,120 bp (7.74x coverage)

These are almost certainly plasmids. The massive ~7.2 Mb “Component 1” is your duplicated chromosome. When you use purge_dups or dedupe.sh to fix the chromosomal duplication, these plasmids should remain intact in your final assembly!



是的,这种强制“去冗余”(Purging duplicates / Haplotype purging)的操作确实有可能引入组装错误或偏差。

在真核生物(如动植物)的组装中,分离单倍型是常规操作;但在细菌(单倍体) 基因组中,强行使用去冗余工具来“修复”7.2Mb的组装结果,是一把双刃剑。

以下是这种操作可能引入的具体错误和风险,以及背后的科学原因:

1. 误删真实的生物学重复序列(Over-purging / 假阴性缺失)

细菌基因组中天然存在许多真实的重复序列,例如:

  • rRNA 操纵子(通常有 5-10 个高度相似的拷贝)。
  • IS 元件(插入序列)和转座子(在 Acinetobacter 等细菌中非常活跃,经常有多个拷贝)。
  • 基因家族或旁系同源基因(Paralogs)。

风险:如果你使用 bbmap (dedupe.sh) 并设置了一个较高的相似度阈值(例如 minidentity=95),去冗余工具无法区分“组装错误导致的重复”和“基因组天然存在的重复”。它可能会把你基因组中真实存在的、具有重要功能(如耐药性、毒力)的 IS 元件或 rRNA 拷贝当作“冗余的单倍型”直接删除,导致你的最终基因组缺失关键基因

2. 产生“嵌合体”基因组(Chimeric Assembly)

你的 CheckM 结果显示完美的 1:1 重复(所有 marker 基因都在 ‘2’ 列)。这通常意味着两种可能:

  • 情况 A(组装软件的 Bug):样本是纯的,但 SPAdes/Unicycler 因为某些复杂的局部重复或测序偏好性,把同一段序列组装了两遍。
  • 情况 B(样本不纯/混合菌株):你的培养物中混入了两个亲缘关系极近的菌株(Mixed strains)。短读长无法跨越它们之间的 SNP/Indel 差异(即 de Bruijn 图中的 bubbles),所以软件把两套基因组都保留了下来。

风险:如果是情况 B,去冗余工具在决定“保留哪一套、丢弃哪一套”时,可能会在两个菌株的序列之间来回切换。最终你得到的并不是一个真实的单一基因组,而是一个自然界中不存在的“弗兰肯斯坦(Frankenstein)”嵌合体

  • 后果:这种嵌合体会严重影响后续的 SNP calling(产生大量假阳性突变)、进化树构建(Phylogeny)以及耐药基因(AMR)的准确定位。

3. 破坏基因组的连续性(Structural Breaks)

风险:在剥离“副拷贝”的过程中,去冗余算法可能会在原本连续的 contig 上造成人为的断裂。这会导致你的组装结果碎片化(Contig 数量增加,N50 降低),原本可以闭合的环状染色体可能会断裂成多个线性片段。


如何安全地处理 2631 和 2617?(最佳实践)

为了避免引入上述错误,不要盲目直接运行去冗余工具。建议按照以下步骤进行排查和处理:

第一步:回贴原始 Reads,查看覆盖度(Coverage)分布

将你的原始短读长(clean reads)比对回这 7.2Mb 的组装结果上(使用 bwabowtie2),然后统计每个 Contig 的平均覆盖度。

  • 如果所有 Contig 的覆盖度都非常均匀(例如都在 100x 左右):这说明是情况 A(组装软件的 Bug)。基因组是纯的,只是被错误地组装了两遍。此时去冗余是相对安全的,但仍需谨慎。
  • 如果 Contig 的覆盖度呈现明显的两个峰(例如一半是 100x,另一半是 50x):这证实了是情况 B(混合菌株)
  • 如果覆盖度极其混乱:说明可能存在更复杂的污染。

第二步:根据覆盖度结果选择策略

  • 如果是混合菌株(情况 B)
    • 最完美的生物学解决方案:重新划线分离单克隆(Single colony isolation),重新提取 DNA 测序。这是唯一能获得 100% 纯基因组的方法。
    • 计算解决方案:不要使用 dedupe.sh。你应该使用基于覆盖度的分离工具(如 metaSPAdes 的 binning 功能,或者 BlobTools),利用 100x 和 50x 的覆盖度差异,将两个菌株的 Contig 彻底分开,分别得到两个独立的基因组,然后再分别跑 CheckM。
  • 如果是组装 Bug(情况 A)
    • 你可以尝试使用 purge_dups,它比 dedupe.sh 更智能,会结合自比对(self-alignment)和覆盖度来区分“主序列(Primary)”和“副序列(Haplotig)”。
    • 关键操作:在运行 purge_dups 后,必须将提取出的“Primary”序列再次运行 CheckM 和 BUSCO
    • 验收标准:如果 CheckM 显示 Completeness > 95%,Contamination < 5%,并且基因组大小降到了 ~3.5 Mb,说明去冗余成功且没有误删核心基因。如果 Completeness 大幅下降,说明你误删了真实的基因,需要调高 purge_dups 的保守参数重新运行。

第三步:尝试重新组装(替代方案)

有时候,Unicycler 的默认参数对某些特定的 Acinetobacter 菌株不够友好。你可以尝试直接使用 SPAdes 并开启特定的模式:

spades.py --isolate --careful -1 2631__1.fastp.fastq.gz -2 2631__2.fastp.fastq.gz -o spades_2631_careful

--careful 参数会让 SPAdes 在组装后尝试纠正错配和 Indel,有时它内部的 repeat resolution 机制能更好地 collapse 那些被 Unicycler 错误保留的重复区域。

总结

强行去冗余确实有引入嵌合体或误删真实重复序列的风险。对于细菌基因组,“计算修补”永远是下策。最严谨的做法是通过 Read Mapping 确认覆盖度,如果是混合菌株则进行 Binning 分离或重新实验;如果是组装冗余,则使用 purge_dups严格依赖二次 CheckM 结果来验证基因组的完整性。

HUMAnN 通路丰度计算方法详解 (Data_Tam_Metagenomics_2026_Soil)

总体流程

HUMAnN (HMP Unified Metabolic Analysis Network) 是 bioBakery 工作流中用于分析宏基因组功能的核心工具[[12]]。通路丰度的计算是一个多步骤的递归过程:

计算步骤:

  1. 基因家族丰度 → 2. 反应丰度 → 3. 通路丰度

详细计算原理

第1步:基因家族丰度(Gene Family Abundance)

从原始测序 reads 开始:

  • 使用 BLASTX 将 reads 比对到参考数据库(如 UniRef)
  • 根据比对质量、覆盖度、序列长度进行加权
  • 生成 RPK(Reads Per Kilobase)值

公式:

基因丰度 = Σ(比对权重) / 基因长度(kb)

其中每个 read 的总权重为 1.0,根据比对质量分配到多个基因匹配上[[9]]。


第2步:反应丰度(Reaction Abundance)

每个生化反应由一个或多个基因催化:

反应丰度 = Σ(催化该反应的所有基因丰度)

第3步:通路丰度(Pathway Abundance)

这是最关键的一步。通路包含多个反应,反应之间有不同的关系:

核心原则: 通路丰度由”最弱环节”(weakest link)决定[[1]]

计算方法:

  • 串联反应(必须全部存在):使用调和平均数(harmonic mean)
  • 并联反应(可选路径):使用最大值(max)
  • 可选反应:只有当其丰度大于必需反应的调和平均数时才计入[[1]]

最终通路丰度 = 通路中丰度最低的关键反应


具体示例

示例场景:糖酵解通路(Glycolysis)

假设糖酵解通路包含 5 个关键反应(R1-R5):

葡萄糖 → R1 → G6P → R2 → F6P → R3 → ... → 丙酮酸

基因-反应关系:

  • R1: 由基因 GK1 和 GK2 催化(冗余)
  • R2: 由基因 PGI 催化
  • R3: 由基因 PFK 催化
  • R4: 由基因 ALDO 催化
  • R5: 由基因 GAPDH 催化

测序后得到的基因丰度(RPK单位):

GK1:  8.0
GK2:  4.0
PGI:  10.0
PFK:  6.0
ALDO: 7.0
GAPDH: 5.0

计算步骤:

① 计算反应丰度:

R1 = GK1 + GK2 = 8.0 + 4.0 = 12.0  (冗余基因相加)
R2 = PGI = 10.0
R3 = PFK = 6.0
R4 = ALDO = 7.0
R5 = GAPDH = 5.0

② 计算通路丰度: 由于糖酵解是串联反应(所有步骤必须完成),使用”最弱环节”原则:

通路丰度 = min(R1, R2, R3, R4, R5)
          = min(12.0, 10.0, 6.0, 7.0, 5.0)
          = 5.0 RPK

解释: 该样本中糖酵解通路的丰度为 5.0 RPK,意味着”最弱环节”(R5/GAPDH)的覆盖度为 5.0。这表示通路中至少有 5.0 个”完整拷贝”的活性[[1]]。


归一化处理

为什么需要归一化?

原始 RPK 值受测序深度影响,不能直接跨样本比较[[1]]。

示例:

  • 样本 A:总 reads = 1000万,通路丰度 = 5.0 RPK
  • 样本 B:总 reads = 2000万,通路丰度 = 5.0 RPK

虽然都是 5.0 RPK,但样本 A 的相对丰度更高!

归一化方法:

CPM(Counts Per Million)或 RPKM:

归一化丰度 = (原始RPK / 总RPK) × 1,000,000

HUMAnN 输出的 pathabundance_relab.tsv 文件已经是归一化后的相对丰度[[11]]。


输出文件说明

HUMAnN 生成两个关键文件:

文件 含义 取值范围
pathcoverage.tsv 通路覆盖度(存在/缺失概率) 0-1
pathabundance.tsv 通路丰度(相对拷贝数) 0 及以上

覆盖度 vs 丰度:

  • 覆盖度 = 0.8:表示 80% 置信度该通路存在
  • 丰度 = 5.0:表示该通路约有 5 个完整拷贝(在归一化前)[[9]]

分层分析(Stratification)

HUMAnN 还能计算每个物种对通路的贡献:

总通路丰度 = 物种A贡献 + 物种B贡献 + 未分类贡献

例如:

糖酵解通路总丰度: 5.0 RPK
├─ Escherichia coli:  3.0 RPK (60%)
├─ Bacteroides fragilis: 1.5 RPK (30%)
└─ Unclassified: 0.5 RPK (10%)

这帮助您理解哪些微生物在驱动特定功能[[21]]。


关键要点总结

通路丰度 = 最弱关键反应的丰度(木桶原理)
单位:RPK(归一化前)或 CPM/RPKM(归一化后)
跨样本比较:必须使用归一化值(pathabundance_relab.tsv)[[15]]
解释:丰度 5.0 表示通路中最弱环节有 5.0 RPK 的覆盖度
准确性:考虑了基因冗余、反应关系、可选步骤等生物学复杂性


参考文献

  1. bioBakery Forum – Pathway abundance calculation [[1]]
  2. HUMAnN SOP – HMP Data Coordination Center [[9]]
  3. HUMAnN3 Documentation – Huttenhower Lab [[12]]
  4. bioBakery 3 publication (Nature Methods, 2021) [[13]]


HUMAnN Pathway Abundance 计算方法

基本原理

HUMAnN (HMP Unified Metabolic Analysis Network) 是 biobakery 工具集的一部分,用于从宏基因组数据中计算功能通路的丰度。

计算步骤

1. 基因家族定量 (Gene Family Quantification)

  • 首先将测序 reads 比对到 UniRef 基因家族数据库
  • 计算每个基因家族的丰度(reads per kilobase, RPK)

2. 通路映射 (Pathway Mapping)

  • 将基因家族映射到 MetaCyc 通路
  • 一个通路通常包含多个反应步骤,每个步骤可能由多个基因家族催化

3. 通路丰度计算 (Pathway Abundance Calculation)

关键公式:

通路丰度 = min(该通路中所有反应步骤的丰度)

其中:

  • 每个反应步骤的丰度 = 该步骤中所有基因家族丰度的总和
  • 通路的最终丰度 = 所有反应步骤丰度的最小值(瓶颈原则)

具体示例

假设情况

假设有一个简单的代谢通路 “Glycolysis”(糖酵解),包含 3 个反应步骤:

反应步骤 1:葡萄糖 → 葡萄糖-6-磷酸

  • 由基因家族 UniRef90_A 和 UniRef90_B 催化
  • UniRef90_A 丰度 = 100 RPK
  • UniRef90_B 丰度 = 50 RPK
  • 步骤 1 丰度 = 100 + 50 = 150 RPK

反应步骤 2:葡萄糖-6-磷酸 → 果糖-6-磷酸

  • 由基因家族 UniRef90_C 催化
  • UniRef90_C 丰度 = 80 RPK
  • 步骤 2 丰度 = 80 RPK

反应步骤 3:果糖-6-磷酸 → 果糖-1,6-二磷酸

  • 由基因家族 UniRef90_D 和 UniRef90_E 催化
  • UniRef90_D 丰度 = 200 RPK
  • UniRef90_E 丰度 = 120 RPK
  • 步骤 3 丰度 = 200 + 120 = 320 RPK

通路丰度计算

Glycolysis 通路丰度 = min(步骤1, 步骤2, 步骤3)
                    = min(150, 80, 320)
                    = 80 RPK

为什么用最小值?

  • 这遵循”木桶原理”(瓶颈效应)
  • 通路的整体通量受限于最慢/最少的反应步骤
  • 步骤 2 只有 80 RPK,即使其他步骤丰度很高,整个通路的最大通量也只能是 80

归一化 (Normalization)

HUMAnN 还会进行归一化处理:

1. RPK (Reads Per Kilobase)

RPK = (比对到基因的 reads 数) / (基因长度 kb)

2. RPKM/CPM (可选)

RPKM = RPK / (总百万 reads)

3. 相对丰度 (Relative Abundance)

通路相对丰度 = (通路 RPKM) / (所有通路 RPKM 总和) × 1,000,000

单位:CPM (Counts Per Million)


您的数据示例

从您的输出看:

otu_table()   OTU Table:         [ 1097 taxa and 2 samples ]

这 1097 个 “taxa” 实际上是 1097 个 MetaCyc 通路,每个通路的丰度值是经过以下处理的:

  1. 原始丰度:基于上述最小值原则计算
  2. 归一化:转换为相对丰度(sum = 1 或 100%)
  3. 输出文件pathabundance_relab.tsv 中的值就是相对丰度

关键要点

特点 说明
计算方法 取通路中所有反应步骤的最小丰度
单位 通常是相对丰度(0-1 或 0-100%)
生物学意义 反映通路的潜在代谢能力
优势 考虑了通路的完整性,不是简单加和
局限性 无法区分活跃/非活跃通路(需要转录组验证)

注意事项

⚠️ 重要提醒

  • 通路丰度反映的是基因潜力(gene potential),不是实际代谢活性
  • 一个通路存在 ≠ 该通路正在被使用
  • 需要结合转录组(RNA-seq)或代谢组数据才能确定实际活性
  • 对于您的 n=1 样本,只能做描述性比较,无法统计推断

STEP 5 — Descriptive visualisations (appropriate for n = 1 per group) for Data_Tam_Metagenomics_2026_Soil

Updated Step 5: PNG Figures + Complete Excel Exports

Prerequisites Update

Add openxlsx to your package list (for Excel export):

# Install if needed
install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr",
                   "tidyr", "pheatmap", "openxlsx"))

And load it:

library(openxlsx)

Complete Step 5 — Replace your existing Step 5 with this

# =============================================================================
# STEP 5 — Visualisations (PNG) + Complete Excel exports
# =============================================================================

dir.create("figures", showWarnings = FALSE)
dir.create("tables",  showWarnings = FALSE)

# Helper: safe log2 fold change (adds pseudocount to avoid log(0))
safe_log2fc <- function(x, y, pseudo = 1e-6) {
  log2((x + pseudo) / (y + pseudo))
}

# =============================================================================
# 5a. Top-N species stacked bar plot (Loc1 vs Loc4)
# =============================================================================

top_n_sp <- 20
top_species <- names(sort(rowMeans(otu_table(species_ps)),
                          decreasing = TRUE))[1:top_n_sp]

ps_top <- prune_taxa(top_species, species_ps)

df_species <- psmelt(ps_top) %>%
  mutate(Species = factor(Species, levels = rev(top_species)))

p_species <- ggplot(df_species, aes(x = Location, y = Abundance, fill = Species)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "D") +
  labs(title = paste("Top", top_n_sp, "Species by Location"),
       x = "Location", y = "Relative Abundance", fill = "Species") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/species_top20_barplot.png", p_species,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/species_top20_barplot.png\n")

# =============================================================================
# 5b. Species heatmap (all species, row-scaled)
# =============================================================================

otu_mat <- as.matrix(otu_table(species_ps))

# Filter species with very low abundance (max < 0.01% across both samples)
keep <- apply(otu_mat, 1, max) > 0.0001
otu_filt <- otu_mat[keep, , drop = FALSE]

# Annotation column
ann_col <- data.frame(Location = metadata[colnames(otu_filt), "Location"],
                      row.names = colnames(otu_filt))

# Write heatmap to PNG
png("figures/species_heatmap.png",
    width = 10, height = max(8, 0.25 * nrow(otu_filt) + 2),
    units = "in", res = 300)

pheatmap(otu_filt,
         scale         = "row",
         clustering_distance_rows = "euclidean",
         clustering_method        = "complete",
         annotation_col = ann_col,
         main           = "Species Abundance Heatmap (row-scaled)",
         fontsize_row   = 6,
         fontsize_col   = 10,
         show_rownames  = nrow(otu_filt) <= 80)

dev.off()
cat("✅ Saved: figures/species_heatmap.png\n")

# =============================================================================
# 5c. Top pathways stacked bar plot
# =============================================================================

top_n_pw <- 20
top_pw_names <- names(sort(rowMeans(otu_table(pathway_ps)),
                           decreasing = TRUE))[1:top_n_pw]

ps_pw_top <- prune_taxa(top_pw_names, pathway_ps)

df_pw <- psmelt(ps_pw_top) %>%
  mutate(Pathway = factor(Pathway, levels = rev(top_pw_names)))

p_pw <- ggplot(df_pw, aes(x = Location, y = Abundance, fill = Pathway)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  labs(title = paste("Top", top_n_pw, "HUMAnN Pathways by Location"),
       x = "Location", y = "Relative Abundance", fill = "Pathway") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/pathways_top20_barplot.png", p_pw,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathways_top20_barplot.png\n")

# =============================================================================
# 5d. Pathway dot plot (Loc1 vs Loc4)
# =============================================================================

df_pw_wide <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway") %>%
  filter(Pathway %in% top_pw_names) %>%
  pivot_longer(-Pathway, names_to = "Sample", values_to = "Abundance") %>%
  left_join(data.frame(Sample = rownames(metadata_clean),
                       Location = metadata_clean$Location,
                       stringsAsFactors = FALSE),
            by = "Sample")

p_dot <- ggplot(df_pw_wide, aes(x = Location, y = Pathway,
                                 size = Abundance, color = Abundance)) +
  geom_point() +
  scale_color_viridis_c() +
  labs(title = "Pathway Abundance: Loc1 vs Loc4",
       x = "Location", y = "Pathway") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 7))

ggsave("figures/pathway_dotplot.png", p_dot,
       width = 8, height = 10, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathway_dotplot.png\n")

# =============================================================================
# 5e. COMPLETE species list → Excel
# =============================================================================

# Build full species table (ALL species, no cutoff)
sp_full <- as.data.frame(otu_table(species_ps)) %>%
  rownames_to_column("Species")

# Ensure columns exist (defensive)
if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(sp_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in species OTU table.")
}

sp_full <- sp_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Species,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for species
sp_wb <- createWorkbook()
addWorksheet(sp_wb, "All_Species")
addWorksheet(sp_wb, "Top50_by_Diff")
addWorksheet(sp_wb, "Top50_by_Abundance")
addWorksheet(sp_wb, "Loc1_only")
addWorksheet(sp_wb, "Loc4_only")
addWorksheet(sp_wb, "Shared")

writeData(sp_wb, "All_Species",       sp_full)
writeData(sp_wb, "Top50_by_Diff",     head(sp_full, 50))
writeData(sp_wb, "Top50_by_Abundance",
          sp_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(sp_wb, "Loc1_only",
          sp_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(sp_wb, "Loc4_only",
          sp_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(sp_wb, "Shared",
          sp_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

# Formatting
header_style <- createStyle(textDecoration = "bold", bgFill = "#D3D3D3")
for (sh in c("All_Species","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(sp_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(sp_wb, sh, cols = 1:ncol(sp_full), widths = "auto")
  freezePane(sp_wb, sh, firstRow = TRUE)
}

saveWorkbook(sp_wb, "tables/species_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/species_Loc1_vs_Loc4.xlsx  (%d species total)\n",
            nrow(sp_full)))

# =============================================================================
# 5f. COMPLETE pathway list → Excel
# =============================================================================

pw_full <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway")

if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(pw_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in pathway OTU table.")
}

pw_full <- pw_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Pathway,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for pathways
pw_wb <- createWorkbook()
addWorksheet(pw_wb, "All_Pathways")
addWorksheet(pw_wb, "Top50_by_Diff")
addWorksheet(pw_wb, "Top50_by_Abundance")
addWorksheet(pw_wb, "Loc1_only")
addWorksheet(pw_wb, "Loc4_only")
addWorksheet(pw_wb, "Shared")

writeData(pw_wb, "All_Pathways",       pw_full)
writeData(pw_wb, "Top50_by_Diff",      head(pw_full, 50))
writeData(pw_wb, "Top50_by_Abundance",
          pw_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(pw_wb, "Loc1_only",
          pw_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(pw_wb, "Loc4_only",
          pw_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(pw_wb, "Shared",
          pw_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

for (sh in c("All_Pathways","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(pw_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(pw_wb, sh, cols = 1:ncol(pw_full), widths = "auto")
  freezePane(pw_wb, sh, firstRow = TRUE)
}

saveWorkbook(pw_wb, "tables/pathways_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways total)\n",
            nrow(pw_full)))

# =============================================================================
# Summary
# =============================================================================

cat("\n========================================\n")
cat("STEP 5 COMPLETE\n")
cat("========================================\n")
cat("Figures (PNG, 300 dpi):\n")
cat("  • figures/species_top20_barplot.png\n")
cat("  • figures/species_heatmap.png\n")
cat("  • figures/pathways_top20_barplot.png\n")
cat("  • figures/pathway_dotplot.png\n")
cat("\nExcel tables (complete lists, no cutoff):\n")
cat(sprintf("  • tables/species_Loc1_vs_Loc4.xlsx   (%d species)\n", nrow(sp_full)))
cat(sprintf("  • tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways)\n", nrow(pw_full)))
cat("========================================\n")

What You Get

📊 Figures (PNG, 300 dpi, publication-ready)

File Content Size
species_top20_barplot.png Top 20 species stacked bar 10 × 8 in
species_heatmap.png All species above 0.01% threshold, row-scaled 10 × auto (scales with # species)
pathways_top20_barplot.png Top 20 pathways stacked bar 10 × 8 in
pathway_dotplot.png Top 20 pathways dot plot 8 × 10 in

📑 Excel Files (complete lists, no cutoff)

Each workbook contains 6 sheets:

Sheet Content
All_Species / All_Pathways Every detected feature, sorted by absolute difference
Top50_by_Diff Top 50 features by Loc4 − Loc1
Top50_by_Abundance Top 50 features by mean abundance
Loc1_only Features detected only in Loc1
Loc4_only Features detected only in Loc4
Shared Features detected in both locations

📋 Columns in each Excel file

Column Meaning
Species / Pathway Feature name
Abundance_Loc1 Raw relative abundance in Loc1
Abundance_Loc4 Raw relative abundance in Loc4
Diff_Loc4_minus_Loc1 Absolute difference (Loc4 − Loc1)
Log2FC_Loc4_vs_Loc1 Log₂ fold change (with pseudocount 1e-6 to handle zeros)
Present_in_Loc1 TRUE/FALSE
Present_in_Loc4 TRUE/FALSE
Total_Abundance Sum across both samples
Mean_Abundance Mean across both samples

Notes

  1. Log₂ fold change: Uses a small pseudocount (1e-6) to avoid log(0). For features absent in one location, this gives a large but finite fold change — interpret these as “presence/absence” rather than true fold change.

  2. Heatmap filtering: Species with maximum abundance < 0.01% across both samples are excluded to keep the heatmap readable. Adjust the threshold in keep <- apply(otu_mat, 1, max) > 0.0001 if needed.

  3. Excel formatting: Headers are bold with grey background, columns auto-sized, and the first row is frozen for easy scrolling.

  4. File locations: All outputs go into figures/ and tables/ subfolders inside your current working directory (reports/).