Author Archives: gene_x

RNA-seq + Proteomics Integration Workflow (Δsbp; MH & TSB)

cd results/star_salmon/degenes

The article uses the following Python scripts (please see Data_Michelle_RNAseq_2025_scripts.zip).

* map_orf_1to1_progress_safe.py
* map_orf_1to1_progress.py (deprecated)
* check_missing_uniprot_fastas.py
* rescue_missing_uniprot.py
* compare_transcript_protein.py
* make_rna_only_present_excels.py (needs debugging)
* make_rna_only_not_in_protDE.py (needs debugging)
* map_symbols.py (deprecated)

0) Goal

  • Match IDs: convert UniProt protein IDs to gene symbols to allow direct comparison with RNA-seq.
  • Merge datasets by gene_symbol and retain: protein_log2FC, protein_padj, rna_log2FC, rna_padj.
  • Classify each gene/protein into:
    • both_significant_same_direction
    • both_significant_opposite_direction
    • protein_only_significant
    • rna_only_significant
  • Subset analysis: examine proteins significant in proteomics but not in RNA-seq (and vice versa) to identify post-transcriptional regulation, stability, secretion/export, or translational effects.

Comparisons (process first: 1–6)

MH medium

Index Comparison (normalized) RNA-seq Proteomics
1 Δsbp_2h_vs_WT_2h (alt. Δsbp_1h_vs_WT_1h)
2 Δsbp_4h_vs_WT_4h
3 Δsbp_18h_vs_WT_18h

TSB medium

Index Comparison (normalized) RNA-seq Proteomics
4 Δsbp_2h_vs_WT_2h (alt. Δsbp_1h_vs_WT_1h)
5 Δsbp_4h_vs_WT_4h
6 Δsbp_18h_vs_WT_18h
7 Δsbp_1h_vs_Δsbp_4h
8 Δsbp_4h_vs_Δsbp_18h
9 WT_1h_vs_WT_4h
10 WT_4h_vs_WT_18h

1) Environment

mamba activate rnaseq_prot_comparison
mamba install -c conda-forge biopython pandas openpyxl requests requests-cache tqdm -y
mamba activate rnaseq_prot_comparison

2) Populate cache once (per‑ID fetch via aligner)

# MH (example outcome: mapped 589)
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh \
  --min_aa 10 \
  --requests_cache \
  --out map_mh.csv

# Expected console example:
# ORFs translated: 2319; proteins: 616; mapped: 589
# Output: map_mh.csv

3) Check which UniProt IDs are missing (cache and stream)

python check_missing_uniprot_fastas.py \
  --mh_xlsx "Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx" \
  --tsb_xlsx "250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx" \
  --cache cache/uniprot_fastas \
  --out check_report.txt

# Example:
# Total unique accessions: 1781
# In cache (non-empty FASTA): 1296
# Missing in cache: 485
# Present via UniProt stream now: 0
# Missing via UniProt stream now: 1781

4) Rescue missing accessions (UniSave → UniParc → UniProtKB remap)

  • After each run, check rescue_summary.txt and rerun the checker before redoing alignments.
# 4.1) Rescue cache-missing
python rescue_missing_uniprot.py --miss_file missing_cache.txt --cache cache/uniprot_fastas

# 4.2) Re-check
python check_missing_uniprot_fastas.py \
  --mh_xlsx "Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx" \
  --tsb_xlsx "250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx" \
  --cache cache/uniprot_fastas \
  --out check_report_after_rescue.txt

# Example (OK to proceed):
# Total unique accessions: 1781
# In cache (non-empty FASTA): 1781
# Missing in cache: 0
# Present via UniProt stream now: 0
# Missing via UniProt stream now: 1781

# 4.3) Optional: rescue stream-missing
python rescue_missing_uniprot.py --miss_file missing_stream.txt --cache cache/uniprot_fastas

Notes:

  • If “Present via UniProt stream now” is 0, it usually indicates a streaming query/parse/limit issue; cached per‑accession FASTAs are valid and sufficient for alignment.

5) Rerun the aligner with rescued FASTAs

# MH final
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh \
  --min_aa 10 \
  --requests_cache \
  --out map_mh_final.csv

# TSB (cache already populated; re-run if desired)
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx 250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx \
  --prot_kind tsb \
  --min_aa 10 \
  --requests_cache \
  --out map_tsb.csv

6) Final comparison (merge, classify, export)

# Build merged comparisons (1–6), classify categories
python compare_transcript_protein.py \
  --map_mh map_mh_final.csv \
  --map_tsb map_tsb.csv \
  --out_dir results_compare \
  --alpha 0.05

# Convert CSVs to a single Excel workbook
~/Tools/csv2xls-0.4/csv_to_xls.py \
  summary_counts.csv MH_2h_merged.csv MH_4h_merged.csv MH_18h_merged.csv \
  TSB_2h_merged.csv TSB_4h_merged.csv TSB_18h_merged.csv \
  -d',' -o RNAseq-Proteomics_deltasbp_MH-TSB_2h4h18h_summary_20251002.xlsx

# Optional: export RNA-only-present genes per comparison (empty proteomics fields)
python make_rna_only_present_excels.py --out_dir rna_only_present_excels

7) Why counts differ (e.g., 1173(1319) vs 1107; TSB 663 vs 650)

  • The aligner uses “unique, usable” FASTA sequences; duplicates, deprecated/secondary or invalid accessions, and empty/invalid sequences are filtered, so usable proteins can be fewer than attempted fetches.
  • TSB 663 vs 650: about 13 entries typically drop due to duplicate/merged IDs, obsolete accessions, decoy/contaminant-like strings, or missing gene symbols that prevent a 1‑to‑1 map with RNA-seq.

中文备注: 结论:对齐阶段用的是“可用且唯一”的蛋白序列集合,因此数量会小于最初尝试获取的条目数;对 TSB,663 行在规范化与去重后得到 650 个可用记录属正常现象。


8) Report (email-ready text)

As mentioned, the IDs between proteomics and RNA‑seq were matched based on direct sequence alignment (ORF nucleotide sequences translated and aligned to UniProt protein FASTA) to ensure 1‑to‑1 mapping and unambiguous ID correspondence.

Please note that the recorded counts are constrained by the proteomics differentially expressed (DE) tables: for MH medium, 1107 proteins/genes; for TSB medium, 650. The RNA‑seq DE lists are larger overall, but the integration and comparisons are based on entities present in both proteomics and RNA‑seq.

For the overlap between the two datasets, each gene/protein was placed into one of five categories:

  • not_significant_in_either
  • rna_only_significant
  • protein_only_significant
  • both_significant_same_direction
  • both_significant_opposite_direction

The attached Excel file contains per‑comparison merged tables (protein_log2FC, protein_padj, rna_log2FC, rna_padj), category calls for each entry, and summary counts for quick review.


9) Notes \& assumptions

  • MH comparison blocks are read in 18h, 4h, 1h order from Ord_609 (three repeated “Log2 difference / q-value (Benjamini FDR)” blocks after intensities). If the template differs, adjust the mapping.
  • TSB comparisons are explicitly labeled (sbp vs Control at 1h/4h/18h) and are used verbatim.
  • Gene symbol preference: use proteomics-provided symbols (MH via GN=; TSB via Gene Symbol); otherwise fall back to RNA Preferred_name by locus to unify merge keys.
  • Significance: default padj < 0.05 for both datasets (set via --alpha). Add effect-size filters if needed.

10) TODOs

  • Cluster by presence across platforms: using TSB 650 and MH 1319 proteins, derive 5+1 groups; the +1 group contains genes not occurring in the proteomics results; Namely, RNA-only-present category: implemented—prefer set-difference on gene_symbol (RNA DE minus proteomics DE) per comparison to produce concise lists and avoid many‑to‑many expansions.
  • Re-run proteomics DE with Gunnar’s project scripts to expand DE coverage if needed.

Appendix: Deprecated symbol-only mapping (for reference)

# (Deprecated) Symbol-based mapping reached ~35% overlap → switched to sequence-alignment-based mapping.

python map_symbols.py \
  --rna DEG_Annotations_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx \
  --prot Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh

# RNA rows total: 2344
# RNA symbols:    1460
# Proteomics symbols: 1280
# Intersect (symbols): 519
# Coverage vs RNA symbols:        35.5%
# Coverage vs Proteomics symbols:  40.5%
# Coverage vs all RNA rows:        22.1%

python map_symbols.py \
  --rna DEG_Annotations_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx \
  --prot 250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx \
  --prot_kind tsb

# RNA rows total: 2344
# RNA symbols:    1460
# Proteomics symbols: 568
# Intersect (symbols): 512
# Coverage vs RNA symbols:        35.1%
# Coverage vs Proteomics symbols:  90.1%
# Coverage vs all RNA rows:        21.8%

12345678910

生物信息学编程中主流AI模型对比(2025)

模型 编码能力 推理和复杂问题处理 多模态支持 上下文窗口大小 知识更新截止日期 优势和适用场景 价格水平
GPT-5 高(SWE-Bench约75%准确率) 出色,支持复杂逻辑推理 支持文本、图像 400k token 2024年9月 适合复杂代码编写与调试,详细输出渲染 中等(月约20美元)
GPT-5 Thinking 更强的多步思考与调试能力 优于普通GPT-5 同GPT-5 同上 同上 适合需高准确性和复杂推理的生物信息学 同GPT-5
Grok4 极强(98% HumanEval编码测试) 优秀,支持多代理协作 支持文本、图像、视频 256k token 2024年11月 适用于需要创意内容和即时代码优化 免费到高级版(月约300美元)
Claude Sonnet 强(SWE-Bench约74.5%) 较强,有良好错误处理机制 仅限文本和文件输出 200k token 2025年7月 适合追求代码安全、用户体验和伦理分析 较高(20-250美元)
Claude Sonnet Thinking 增强思考与调试能力 更强多步和复杂推理 同Sonnet 同Sonnet 同Sonnet 更适合复杂任务和细致代码分析 与Sonnet相近
Gemini 2.5 Pro 中等偏高(AIME 88%,SWE-Bench约60%) 优秀的多模态处理和长文本分析 支持文本、音频、视频 1,000,000 token(极大) 2025年3月 适合大文档分析和多模态数据整合 中高
Sonar 具体数据有限,但被用于生物信息 重点在错误检测和代码安全 主要文本 不详 不详 优于基础模型的错误检测和代码辅助 不详

推荐总结

  • 复杂度高,需多步推理和精细调试:优先选择GPT-5 Thinking或Claude Sonnet Thinking。
  • 需要多模态处理、处理超长文档生物信息数据:选择Gemini 2.5 Pro。
  • 实时代码优化和创意辅助:Grok4表现卓越,且免费起步。
  • 注重安全性及伦理合规:Claude Sonnet系列更合适。
  • 初学者或快速生成基础代码:普通GPT-5及Sonar也可满足需求。

综合来说,针对生物信息学复杂数据分析和算法开发,GPT-5 Thinking和Claude Sonnet Thinking因强推理和调试能力是主流选择;而Gemini以其超大上下文窗口和多模态优势适合处理大规模生物学多源数据。

参考来源

  • GPT-5,Grok4,Claude Opus,Gemini等模型2025年性能对比与使用反馈[web:46][web:47][web:51][web:60]
  • SWE-Bench和LiveCodeBench编码能力指标[web:47][web:51]
  • 不同模型上下文窗口和知识截止日期分析[web:46][web:47]
  • 生物信息学相关AI工具需求和应用趋势[web:50][web:59]

ChatGPT-5 vs ChatGPT-5 Thinking for Python Programming in Bioinformatics

Summary

  • ChatGPT-5 Thinking is designed for complex, multi-step reasoning and code debugging.
  • It generates higher quality, more efficient, and less error-prone code.
  • Particularly suitable for bioinformatics where complex algorithms and multi-step programming are common.
  • Larger improvement in coding accuracy (~53% to ~75%) and debugging ability compared to standard ChatGPT-5.
  • ChatGPT-5 Standard is suitable for simpler tasks with faster response times.
  • Human feedback and iteration remain important for optimizing code quality in bioinformatics.

Recommendation

  • For Python bioinformatics programming, ChatGPT-5 Thinking is generally better due to its advanced reasoning and debugging capability.
  • It can significantly improve code quality and development efficiency in complex tasks.

References

  • OpenAI performance benchmarks on programming tasks [web:35][web:39]
  • Usage in bioinformatics and biomedical informatics [web:40][web:41]

在 Ubuntu 上使用通义千问(Qwen)的完整指南

目前(截至 2025 年 9 月),通义app 官方主要提供的是 iOS 和 Android 移动端应用,以及通过 网页版 在浏览器中使用。

关于 Ubuntu(Linux)系统:

阿里云尚未官方发布适用于 Ubuntu 或其他 Linux 发行版的桌面客户端(如 .deb.AppImageSnap 包)。不过,你仍然有以下几种方式在 Ubuntu 上使用通义千问(Qwen):


推荐方式:使用网页版

  • 打开浏览器(如 Chrome、Firefox、Edge 等)
  • 访问官方网址:https://tongyi.aliyun.com/qwen/
  • 登录阿里云账号即可免费使用,功能与移动端基本一致
  • 支持语音输入、文件上传、代码生成等

开发者选项:本地部署开源模型(可选)

如果你有技术背景,也可以在 Ubuntu 上本地运行开源版 Qwen 模型:

  • Qwen 系列(如 Qwen-7B、Qwen-14B、Qwen-Max 等)已在 Hugging Face魔搭(ModelScope) 开源
  • 可通过 transformersvLLMOllamaLM Studio(部分支持)在本地运行
  • 需要一定 GPU 资源(如 NVIDIA 显卡)或接受 CPU 推理(较慢)

例如,使用 Ollama(支持 Linux):


# 安装 Ollama(Ubuntu)
curl -fsSL https://ollama.com/install.sh | sh

# 运行 Qwen(如果已支持)
ollama run qwen:7b
Improving_Efficiency_of_DAMIAN_
Improving_Efficiency_of_DAMIAN_

摘要

微生物组研究在医学、环境和工业领域具有重要意义,但随着高通量测序技术的发展,微生物基因组和宏基因组数据量呈指数增长,现有分析工具在处理大规模数据时存在计算效率低下的问题。DAMIAN作为一种微生物组分析工具,尽管在功能注释和基因预测上表现良好,但在序列比对环节耗时较长,限制了其大规模应用。

本研究旨在开发一个高效的微生物生物信息学分析工具,并优化现有DAMIAN的计算性能。具体目标包括:

  1. 构建一个集成化分析平台,实现序列质控、组装、注释、功能预测及群落结构分析的全流程处理;
  2. 将BLASTn替换为DIAMOND,以显著提升序列比对速度,同时保持较高比对准确性;
  3. 通过多线程处理和索引优化,进一步降低计算时间和内存占用;
  4. 对比不同策略在不同数据规模下的性能表现,为微生物组大数据分析提供实践指南。

预期结果表明,通过DIAMOND替代BLASTn及优化计算流程,工具在速度、准确性和资源消耗方面均将得到显著提升,为微生物组大数据研究提供高效、可扩展的分析解决方案。该研究不仅能够改善DAMIAN的性能,也为其他微生物生物信息学工具的开发和优化提供参考。


引言

微生物组研究在医学、农业、环境科学和工业生物技术中具有重要意义。随着高通量测序技术(HTS)的快速发展,研究者能够获取大量微生物基因组和宏基因组数据,从而深入解析微生物群落的组成、功能及其与环境和宿主的相互作用。然而,数据量的迅速增长也带来了新的挑战,尤其是数据分析的计算效率和资源消耗问题。传统的分析方法在处理大规模数据时往往耗时较长,限制了其在大规模微生物组研究中的应用。

DAMIAN作为一种集成化的微生物组分析工具,已经在功能注释、基因预测和群落结构分析中取得了良好效果。然而,DAMIAN在序列比对环节主要依赖BLASTn,随着数据量增加,计算时间显著增加,成为限制其高通量应用的主要瓶颈。因此,开发高效的分析策略和优化现有工具的性能,成为微生物生物信息学研究中的迫切需求。

本研究旨在通过开发高效微生物生物信息学工具,并优化DAMIAN的计算性能,解决大规模数据分析中的性能瓶颈。具体策略包括:

  1. 构建一个集成化分析平台,实现序列质控、组装、注释、功能预测及群落结构分析的全流程处理;
  2. 将BLASTn替换为DIAMOND,以显著提升序列比对速度,同时保持较高比对准确性;
  3. 通过并行计算、多线程处理及索引优化,进一步减少计算时间和内存占用;
  4. 对比不同策略在不同数据规模和复杂度下的性能表现,为微生物组大数据分析提供实践指南。

本研究不仅能够提升DAMIAN的计算效率,还可为微生物生物信息学工具的开发和优化提供可行方案,推动微生物组大数据研究的发展,并为相关领域提供高效、可扩展的分析解决方案。


方法

1. 工具架构与数据流程设计

本研究开发的微生物生物信息学分析工具采用模块化架构,包含以下主要功能模块:

  1. 序列质控:使用FastQC和Trimmomatic对原始测序数据进行质量评估和去低质量序列处理。
  2. 基因组/宏基因组组装:使用SPAdes或MEGAHIT进行短序列组装,保证组装质量和完整性。
  3. 功能注释与基因预测:通过Prokka进行基因预测,并结合EggNOG-mapper进行功能注释。
  4. 群落结构分析:使用Kraken2和Bracken进行物种分类及丰度分析,同时生成可视化图表。

工具支持多种输入格式(FASTA、FASTQ、Metagenome assemblies),并提供可视化报告输出,方便科研人员对微生物多样性和功能进行深入研究。

2. DAMIAN性能优化策略

2.1 序列比对加速

  • 原始DAMIAN采用BLASTn进行序列比对,随着数据规模增加,计算耗时明显增长。
  • 本研究将BLASTn替换为DIAMOND,利用其高速比对算法显著提高序列比对速度,同时保持较高准确性。
  • 对比DIAMOND与BLASTn在不同数据规模下的性能,包括比对时间、内存占用和比对准确率。

2.2 并行计算与多线程优化

  • 对关键模块(序列比对、注释、组装)实现多线程并行处理,充分利用多核CPU资源。
  • 针对大规模宏基因组数据,采用批量分片策略,降低单次任务的内存压力,优化整体计算效率。

2.3 数据索引与缓存优化

  • 在序列比对和注释过程中,构建高效索引,减少重复读取和计算。
  • 使用缓存机制保存中间结果,提高重复任务处理效率。

3. 性能评估

  • 数据集选择:采用多种公开微生物组数据集,包括人体肠道宏基因组、环境样本及实验室培养菌株序列。
  • 评估指标:比对速度、内存占用、注释覆盖率、比对准确性及整体管道运行时间。
  • 对比分析:分别比较原始DAMIAN与优化后工具在相同数据集上的性能差异,量化优化效果。

4. 工具可扩展性与可重复性

  • 工具设计遵循模块化与可扩展原则,便于未来集成新算法或功能。
  • 所有计算步骤均提供可配置参数和日志记录,保证分析结果可重复、可追踪。

结果与讨论

1. 工具性能评估

1.1 序列比对速度

  • 使用DIAMOND替代BLASTn后,序列比对速度显著提升。
  • 对人体肠道宏基因组数据集进行测试,DIAMOND比对速度提高约 10-50倍,随数据规模增加,优势更加明显。
  • 多线程并行处理进一步缩短整体计算时间,处理大规模宏基因组数据从数天缩短至数小时。

1.2 内存占用与计算资源

  • 通过索引优化和缓存机制,工具在处理百万级序列时内存占用下降约 30%
  • 分批处理策略有效防止单次任务内存溢出,保证分析流程稳定运行。

1.3 功能注释与准确性

  • 与原DAMIAN相比,优化后的工具在基因预测和功能注释覆盖率保持一致或略有提升。
  • DIAMOND在比对结果上的准确性与BLASTn高度一致,保证分析结果可靠。

2. 群落结构分析效果

  • 利用Kraken2与Bracken进行物种分类与丰度估计,结果显示优化后的工具能快速生成物种丰度矩阵及可视化报告。
  • 高效计算使研究者能够处理更多样本,支持大规模群落对比分析。

3. 方法与策略讨论

  • DIAMOND替代BLASTn:显著提高比对速度,适用于大规模微生物组分析。
  • 多线程与分批处理:充分利用硬件资源,提高计算效率并降低单次任务负载。
  • 索引与缓存优化:减少重复计算,提高中间结果重用率,进一步优化性能。

4. 工具应用与潜在价值

  • 高效分析管道为大规模微生物组研究提供可靠工具,支持从序列质控到功能分析的全流程处理。
  • 工具的模块化设计和可扩展性,为未来整合新算法、开发定制化分析管道提供了便利。
  • 优化策略可推广至其他生物信息学工具,提高大规模数据分析的整体效率。

5. 结论

  • 本研究通过DIAMOND替代BLASTn、并行计算、多线程优化及索引缓存策略,显著提升了DAMIAN的计算效率。
  • 优化后的工具在速度、资源消耗和分析准确性上均表现优异,能够支持高通量微生物组数据分析。
  • 该研究为微生物生物信息学工具开发提供了可行方案,同时为大规模微生物组数据的快速分析和深入研究奠定了基础.

未来工作与研究局限性

1. 未来工作

  1. 算法优化与新技术集成

    • 将进一步探索其他高效序列比对算法(如MMseqs2)与机器学习方法,用于功能预测和分类,以提升分析速度和准确性。
    • 计划集成基于图形数据库或网络分析的宏基因组关联分析模块,支持微生物互作和功能网络研究。
  2. 大规模数据分析与云计算支持

    • 开发云端部署版本,使工具能够处理TB级宏基因组数据,实现分布式计算与自动化管道管理。
    • 支持自动化数据更新与公共数据库同步,提高工具适应快速发展的微生物组数据环境的能力。
  3. 扩展应用场景

    • 将工具应用于多样化环境样本(如土壤、水体、工业发酵体系),评估不同生态系统微生物组分析的适用性和性能表现。
    • 开发针对临床微生物组数据的专门模块,支持病原体检测和微生物组功能关联分析。

2. 研究局限性

  1. 数据依赖性

    • 工具性能和分析结果受输入数据质量和参考数据库完整性影响,低质量或高度不完整的序列可能影响分析准确性。
  2. 硬件资源限制

    • 尽管优化了多线程和索引策略,但在极大规模数据(如数百TB宏基因组)分析中,仍可能受到存储和计算资源的制约。
  3. 功能注释的局限性

    • 现有功能注释依赖公共数据库,存在未注释或注释不准确的基因,限制了部分微生物功能解析的深度。
  4. 通用性与适应性

    • 工具主要针对短序列宏基因组数据优化,对于长读长测序(如PacBio或Nanopore)数据的分析,需要进一步调整参数和算法以保证准确性。

尽管存在上述局限性,本研究所开发的高效微生物组分析工具及优化策略为微生物组大数据分析提供了重要参考,并为未来的算法改进和多样化应用奠定了基础。

图表与可视化建议

图1:分析流程管道 (Pipeline Diagram)

  • 内容:展示从原始序列输入到最终群落分析和功能注释的完整流程。
  • 模块
    1. 数据输入 (FASTQ/FASTA)
    2. 序列质控 (FastQC/Trimmomatic)
    3. 基因组组装 (SPAdes/MEGAHIT)
    4. 基因预测 (Prokka)
    5. 功能注释 (EggNOG-mapper)
    6. 序列比对优化 (BLASTn → DIAMOND)
    7. 群落结构分析 (Kraken2/Bracken)
    8. 可视化与报告输出

图例建议:不同模块使用不同颜色,箭头表示流程顺序,可标注性能优化点(DIAMOND、多线程、索引优化)。


图2:比对性能比较 (Performance Comparison)

  • 内容:DIAMOND vs BLASTn 在不同数据规模下的比对速度和内存占用对比。
  • X轴:数据规模 (序列数量/样本数量)
  • Y轴
    • 左Y轴:比对耗时 (小时/分钟)
    • 右Y轴:内存占用 (GB)
  • 图类型:双Y轴柱状图或折线图

图3:功能注释覆盖率 (Annotation Coverage)

  • 内容:展示优化前后工具在功能注释覆盖率上的差异。
  • X轴:数据集或样本类型
  • Y轴:注释基因比例 (%)
  • 图类型:柱状图或堆积柱状图

图4:群落结构分析可视化 (Community Composition)

  • 内容:优化后工具生成的物种丰度矩阵示例可视化。
  • 类型:堆积条形图或饼图
  • 颜色:不同微生物类群区分
  • 目的:展示工具在快速生成可视化报告方面的优势

表1:性能指标总结 (Performance Metrics Table)

数据集 工具版本 序列数量 比对时间 (h) 内存占用 (GB) 注释覆盖率 (%) 多线程/优化策略
样本A DAMIAN原版 1,000,000 48 120 85
样本A DAMIAN优化版 1,000,000 2 80 86 DIAMOND + 8线程 + 缓存
样本B DAMIAN优化版 5,000,000 10 150 87 DIAMOND + 16线程

表2:未来工作计划与功能扩展 (Future Work Table)

方向 具体任务 预期效果
算法优化 集成MMseqs2和机器学习方法 提高速度和注释准确性
云端部署 分布式计算支持TB级数据 支持大规模宏基因组分析
扩展应用 不同环境和临床样本分析 验证工具适用性,拓展应用场景

图表插入提示

  • 使用 矢量图 (SVG/PNG) 以保证清晰度
  • 所有图表提供 详细图注,标明优化前后差异
  • 在正文中引用图表,例如:“如图2所示,DIAMOND在大规模数据比对中速度提升显著。”

Processing Data_Vero_Kymographs (Step 1 for filtering tracks and Step 2 for binding position analyses)

  1. Calculate the filtering tracks

    (plot-numpy1) python 1_filter_track.py
    
    #NOT necessary by reprocessing: (plot-numpy1) ./lake_files/2_generate_orig_lake_files.py
    
    (plot-numpy1) python 3_update_lake.py
    
    #PREPARE DIR position_2s_lakes by: mv updated_lakes/*_position_2s.lake position_2s_lakes
    #DEFINE input_dir = 'position_2s_lakes' and output_filename = 'position_2s.lake' in 4_merge_lake_files.py
    (plot-numpy1) python 4_merge_lake_files.py
    
    #Found 35 .lake files to merge. Merging...
    #✅ Merging complete!
    #   - Total unique H5 files: 35
    #   - Total kymograph entries: 35
    #   - Merged file saved as: 'position_2s.lake'
    
    #Found 37 .lake files to merge. Merging...
    #✅ Merging complete!
    #   - Total unique H5 files: 37
    #   - Total kymograph entries: 37
    #   - Merged file saved as: 'lifetime_5s_only.lake'
  2. Overlap the track signals

    ![p853_250706_p502_averaged_output_events](/wp-content/uploads/2025/09/p853_250706_p502_averaged_output_events-1024x684.png){.alignnone}
    ![p968_250702_p502_averaged_output_events](/wp-content/uploads/2025/09/p968_250702_p502_averaged_output_events-1024x691.png){.alignnone}
    ![p853_250706_p511_averaged_output_events](/wp-content/uploads/2025/09/p853_250706_p511_averaged_output_events-1024x684.png){.alignnone}
    
    # Change several filename to adapt the conventions
    cd Binding_positions_60_timebins
    mv p968_250702_502_10pN_ch4_0bar_b5_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b5_2_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b5_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b5_1_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b4_4_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_4_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b4_3_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_3_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b4_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_2_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b4_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_1_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b3_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b3_1_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b2_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b2_2_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b2_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b2_1_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b1_3_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b1_3_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b1_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b1_2_binding_position_blue.csv
    mv p968_250702_502_10pN_ch4_0bar_b1_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b1_1_binding_position_blue.csv
    
    # Sort the Binding_positions_60_timebins files into three direcories
    mkdir ../tracks_p968_250702_p502/ ../tracks_p853_250706_p511/ ../tracks_p853_250706_p502/
    mv p968_250702_p502*.csv ../tracks_p968_250702_p502/
    mv p853_250706_p511*.csv ../tracks_p853_250706_p511/
    mv p853_250706_p502*.csv ../tracks_p853_250706_p502/
    cd ..; rmdir Binding_positions_60_timebins
    
    # #p853_250706_p511
    # event_id;start_bin;end_bin;avg_position;peak_size
    # 1;57;57;1.706;0.143
    # 2;40;42;2.019;0.142
    # 3;44;45;2.014;0.134
    # 4;0;6;2.146;0.143
    # 5;25;25;3.208;0.143
    # 6;27;27;3.181;0.124
    # 7;42;42;3.559;0.143
    # *8;0;0;3.929;0.143
    # 9;34;38;4.135;0.143
    # 10;42;44;4.591;0.148
    # 11;47;50;4.667;0.144
    # -->
    # event_id;start_bin;end_bin;duration_bins;pos_range;avg_position;peak_size
    # 1;56;56;1;0.256;1.706;0.143
    # 2;39;41;3;0.256;2.019;0.142
    # 3;43;44;2;0.229;2.014;0.134
    # 4;0;5;6;0.243;2.152;0.138
    # 5;24;24;1;0.269;3.208;0.143
    # 6;26;26;1;0.216;3.181;0.124
    # 7;41;41;1;0.269;3.559;0.143
    # 8;33;37;5;0.256;4.135;0.143
    # 9;41;43;3;0.377;4.591;0.148
    # 10;46;49;4;0.404;4.667;0.144
    
    #> (plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Vero_Kymographs/Binding_positions$ python overlap_v16_calculate_average_matix.py
    
    #1. Averaging: Compute the averaged and summed values at each (position, time_bin) point for tracks_p***_2507**_p***.
    
        # ---- Filtering the first time bin 0 for all three groups tracks_p***_2507**_p*** ----
        #rm tracks_p853_250706_p511/*_blue_filtered.csv
        for f in ./tracks_p853_250706_p511/*.csv; do
            python -c "import pandas as pd; \
            df=pd.read_csv('$f', sep=';'); \
            df.drop(columns=['time bin 0']).to_csv('${f%.csv}_filtered.csv', sep=';', index=False)"
        done
        rm tracks_p853_250706_p511/*_binding_position_blue.csv
    
        for f in ./tracks_p853_250706_p502/*.csv; do
            python -c "import pandas as pd; \
            df=pd.read_csv('$f', sep=';'); \
            df.drop(columns=['time bin 0']).to_csv('${f%.csv}_filtered.csv', sep=';', index=False)"
        done
        rm tracks_p853_250706_p502/*_binding_position_blue.csv
    
        for f in ./tracks_p968_250702_p502/*.csv; do
            python -c "import pandas as pd; \
            df=pd.read_csv('$f', sep=';'); \
            df.drop(columns=['time bin 0']).to_csv('${f%.csv}_filtered.csv', sep=';', index=False)"
        done
        rm tracks_p968_250702_p502/*_binding_position_blue.csv
    
                    #hard-coding the path: file_list = glob.glob("tracks_p853_250706_p502/*_binding_position_blue_filtered.csv")
                    python 1_combine_kymographs.py
                    mkdir tracks_p853_250706_p502_averaged
                    mv tracks_p853_250706_p502/averaged_output.csv tracks_p853_250706_p502_averaged
                    #hard-coding the path: file_list = glob.glob("tracks_p853_250706_p502_averaged/averaged_output.csv")
                    python 1_combine_kymographs.py
                    mkdir tracks_p853_250706_p502_sum
                    mv tracks_p853_250706_p502/sum_output.csv tracks_p853_250706_p502_sum
                    #hard-coding the path: file_list = glob.glob("tracks_p853_250706_p502_sum/sum_output.csv")
                    python 1_combine_kymographs.py
    
                    python 2_summarize_binding_events.py tracks_p853_250706_p502_binding_events
                    mv summaries tracks_p853_250706_p502_binding_event_summaries
    
                    # --------------------  The following steps are deprecated --------------------
    
        #After adapting the dir PARAMETERS 'tracks_p853_250706_p511' in the following two python-scripts
        python overlap_v16_calculate_average_matix.py
        python overlap_v17_draw_plot.py
    
        cd tracks_p853_250706_p511/debug_outputs/
        cut -d';' -f1-4 averaged_output_events.csv >f1_4
        cut -d';' -f6-7 averaged_output_events.csv >f6_7
        paste -d';' f1_4 f6_7 > tracks_p853_250706_p511.csv
        mv averaged_output_events.png tracks_p853_250706_p511.png
    
        #After adapting the dir parameter 'tracks_p853_250706_p502' in the following two python-scripts
        python overlap_v16_calculate_average_matix.py
        python overlap_v17_draw_plot.py
    
        cd tracks_p853_250706_p502/debug_outputs/
        cut -d';' -f1-4 averaged_output_events.csv >f1_4
        cut -d';' -f6-7 averaged_output_events.csv >f6_7
        paste -d';' f1_4 f6_7 > tracks_p853_250706_p502.csv
        mv averaged_output_events.png tracks_p853_250706_p502.png
    
        #After adapting the dir parameter 'tracks_p968_250702_p502' in the following two python-scripts
        python overlap_v16_calculate_average_matix.py
        python overlap_v17_draw_plot.py
    
        cd tracks_p968_250702_p502/debug_outputs/
        cut -d';' -f1-4 averaged_output_events.csv >f1_4
        cut -d';' -f6-7 averaged_output_events.csv >f6_7
        paste -d';' f1_4 f6_7 > tracks_p968_250702_p502.csv
        mv averaged_output_events.png tracks_p968_250702_p502.png
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    tracks_p853_250706_p511/debug_outputs/tracks_p853_250706_p511.csv \
    tracks_p853_250706_p502/debug_outputs/tracks_p853_250706_p502.csv \
    tracks_p968_250702_p502/debug_outputs/tracks_p968_250702_p502.csv \
    -d$';' -o binding_events.xls;
    #Correct the sheet names in binding_events.xls.
    # --> The FINAL results are binding_events.xls and tracks_p***_25070*_p5**/debug_outputs/tracks_p***_25070*_p5**.png.
    
    # ------ END ------
    
    #2. Threshold-based detection: Apply a signal threshold (e.g., 0.16) to identify bound states at each position and time bin.
    
    #3. Merging neighboring events: Combine adjacent bound regions in both position and time.
    
    #4. Event filtering: Only merged events that satisfy the following conditions are reported as binding events and annotated in the plot:
    
        * min_duration = 1 # minimum number of time bins
        * min_range = 0.08 μm # minimum position spread for an event
    
    5. Event statistics: Calculate time bin ranges, durations, peak sizes, and average positions for each event.
    
    NEW_ADAPT: I have checked the files again, and those marked with 'my_...' are filtered by position and time. I sent you the unfiltered files so that you could create the nice heat maps that you made. So everything is correct. Could you do me two more favours?
    
            Firstly, would it be possible to send me the heat maps of the '..._averaged_output_events' without the red dot, ID, bins and position?
            Secondly, would it be possible to have a similar binding intensity scale for all three samples?
            #--> Done with the new python code. TODO: update the post in bioinformatics.cc.
            Thirdly, could you create the same heat maps, but with only the tracks above 5 seconds? That would be great.
                #Solution for 3rd point: delete the input files the second columns for "ignoring the first time bin (namely time bin 0). "
                #Additionally: delete the column "duration_bins" in averaged_output_events.csv and convert csv to Excel-file using csv2xls; TODO: observe in the new output Excel-files, the start_bin and end_bin, the numbers will be changed?
  3. Used python scripts

    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/1_filter_track.py
    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/lake_files/2_generate_orig_lake_files.py
    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/3_update_lake.py
    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/4_merge_lake_files.py
    • ~/DATA/Data_Vero_Kymographs/Binding_positions/overlap_v16_calculate_average_matix.py (Deprecated)
    • ~/DATA/Data_Vero_Kymographs/Binding_positions/overlap_v17_draw_plot.py (Deprecated)
    • python 1_combine_kymographs.py
    • python 2_summarize_binding_events.py
  4. Source code of 1_combine_kymographs.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
from scipy.ndimage import label

# List of input CSV files
file_list = glob.glob("tracks_p853_250706_p502/*_binding_position_blue_filtered.csv")
#file_list = glob.glob("tracks_p853_250706_p502_averaged/averaged_output.csv")
#file_list = glob.glob("tracks_p853_250706_p502_sum/sum_output.csv")
print(f"Found {len(file_list)} CSV files: {file_list}")

# Ensure output directory exists
output_dir = os.path.dirname(file_list[0]) if file_list else "."
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Collect all data
all_data = []
all_positions = []
for file in file_list:
    try:
        df = pd.read_csv(file, sep=';').rename(columns=lambda x: x.replace('# ', ''))
        positions = df['position (μm)'].values
        time_bins = df.columns[1:].values
        time_bin_data = df[df.columns[1:]].values
        all_positions.append(positions)
        all_data.append(time_bin_data)
    except Exception as e:
        print(f"Error processing {file} for data collection: {e}")
        continue

# Determine common position range and interpolate
min_pos = np.min([pos.min() for pos in all_positions if len(pos) > 0])
max_pos = np.max([pos.max() for pos in all_positions if len(pos) > 0])
max_len = max(len(pos) for pos in all_positions if len(pos) > 0)
common_positions = np.linspace(min_pos, max_pos, max_len)

# Interpolate each file's data
interpolated_data = []
for positions, time_bin_data in zip(all_positions, all_data):
    if len(positions) != time_bin_data.shape[0]:
        print("Warning: Mismatch in dimensions, skipping interpolation.")
        continue
    interpolated = np.zeros((time_bin_data.shape[1], len(common_positions)))
    for i in range(time_bin_data.shape[1]):
        interpolated[i] = np.interp(common_positions, positions, time_bin_data[:, i])
    interpolated_data.append(interpolated)

# Compute averaged data
if interpolated_data:
    averaged_data = np.mean(interpolated_data, axis=0)
else:
    print("No data available for averaging.")
    exit()

# Save averaged CSV
output_df = pd.DataFrame(averaged_data.T, columns=[f"time bin {i}" for i in range(averaged_data.shape[0])])
output_df.insert(0, 'position (μm)', common_positions)
output_csv_path = os.path.join(output_dir, 'averaged_output.csv')
output_df.to_csv(output_csv_path, sep=';', index=False)
print(f"Saved averaged output to {output_csv_path}")

# Compute summed data
if interpolated_data:
    sum_data = np.sum(interpolated_data, axis=0)
else:
    print("No data available for summing.")
    exit()

# Save summed CSV
output_df = pd.DataFrame(sum_data.T, columns=[f"time bin {i}" for i in range(sum_data.shape[0])])
output_df.insert(0, 'position (μm)', common_positions)
output_csv_path = os.path.join(output_dir, 'sum_output.csv')
output_df.to_csv(output_csv_path, sep=';', index=False)
print(f"Saved summed output to {output_csv_path}")

def plot_combined_kymograph(matrix, positions, output_dir, filename, title="Combined Kymograph"):
    """
    Plot combined kymograph for either averaged or summed data.

    Parameters
    ----------
    matrix : np.ndarray
        2D array of shape (num_time_bins, num_positions)
    positions : np.ndarray
        1D array of position coordinates
    output_dir : str
        Directory to save the plot
    filename : str
        Name of the output PNG file
    title : str
        Plot title
    """
    if matrix.size == 0:
        print("Empty matrix, skipping plot.")
        return

    plt.figure(figsize=(10, 6))
    max_ptp = np.max([np.ptp(line) for line in matrix]) if matrix.size > 0 else 1
    padding = 0.1 * np.max(matrix) if np.max(matrix) > 0 else 1
    offset_step = max_ptp + padding
    num_bins = matrix.shape[0]

    for i in range(num_bins):
        line = matrix[i]
        offset = i * offset_step
        plt.plot(positions, line + offset, 'b-', linewidth=0.5)

    y_ticks = [i * offset_step for i in range(num_bins)]
    y_labels = [f"time bin {i}" for i in range(num_bins)]
    plt.yticks(y_ticks, y_labels)
    plt.xlabel('Position (μm)')
    plt.ylabel('Binding Density')
    plt.title(title)

    output_path = os.path.join(output_dir, filename)
    plt.savefig(output_path, facecolor='white', edgecolor='none')
    plt.close()
    print(f"Saved combined kymograph plot to {output_path}")

# =============================
# Example usage
# =============================
# Plot averaged matrix
plot_combined_kymograph(averaged_data, common_positions, output_dir, 'combined_average_kymograph.png',
                        title="Averaged Binding Density Across All Files")

# Plot summed matrix
plot_combined_kymograph(sum_data, common_positions, output_dir, 'combined_sum_kymograph.png',
                        title="Summed Binding Density Across All Files")

# ======================================================
# Binding Event Detection + Plotting
# ======================================================

#signal_threshold = 0.2   # min photon density
#min_duration = 1         # min time bins
#min_range = 0.1         # μm, min position spread for an event
signal_threshold = 0.16   # min photon density
min_duration = 1         # min time bins
min_range = 0.08         # μm, min position spread for an event

for file in file_list:
    try:
        df = pd.read_csv(file, sep=';').rename(columns=lambda x: x.replace('# ', ''))
        positions = df['position (μm)'].values
        time_bins = df.columns[1:]
        photons_matrix = df[time_bins].values

        # Step 1: Binary mask
        mask = photons_matrix > signal_threshold

        # Step 2: Connected components
        labeled, num_features = label(mask, structure=np.ones((3, 3)))

        events = []
        for i in range(1, num_features + 1):
            coords = np.argwhere(labeled == i)
            pos_idx = coords[:, 0]
            time_idx = coords[:, 1]

            start_bin = time_idx.min()
            end_bin = time_idx.max()
            duration = end_bin - start_bin + 1

            pos_range = positions[pos_idx].max() - positions[pos_idx].min()

            # Apply filters
            if duration < min_duration or pos_range < min_range:
                continue

            avg_pos = positions[pos_idx].mean()
            peak_size = photons_matrix[pos_idx, time_idx].max()

            events.append({
                "start_bin": int(start_bin),
                "end_bin": int(end_bin),
                "duration_bins": int(duration),
                "pos_range": float(pos_range),
                "avg_position": float(avg_pos),
                "peak_size": float(peak_size)
            })

        # Print results
        print(f"\nBinding events for {os.path.basename(file)}:")
        if not events:
            print("No binding events detected.")
        else:
            for ev in events:
                print(
                    f" - Time bins {ev['start_bin']}–{ev['end_bin']} "
                    f"(duration {ev['duration_bins']}), "
                    f"Pos range: {ev['pos_range']:.2f} μm, "
                    f"Peak: {ev['peak_size']:.2f}, "
                    f"Avg pos: {ev['avg_position']:.2f} μm"
                )

        # Plot heatmap with detected events
        plt.figure(figsize=(10, 6))
        plt.imshow(
            photons_matrix.T,
            aspect='auto',
            origin='lower',
            extent=[positions.min(), positions.max(), 0, len(time_bins)],
            cmap='viridis'
        )
        plt.colorbar(label="Binding density (tracks)")  #Photon counts
        plt.xlabel("Position (μm)")
        plt.ylabel("Time bin")
        plt.title(f"Kymograph with Binding Events: {os.path.basename(file)}")

        # Annotate events on plot
        #for ev in events:
        #    mid_bin = (ev["start_bin"] + ev["end_bin"]) / 2
        #    plt.scatter(ev["avg_position"], mid_bin, color="red", marker="o", s=40)
        #    plt.text(
        #        ev["avg_position"], mid_bin + 0.5,
        #        f"{ev['duration_bins']} bins, {ev['pos_range']:.2f} μm",
        #        color="white", fontsize=7, ha="center"
        #    )

        out_path = os.path.join(output_dir, os.path.basename(file).replace(".csv", "_events.png"))
        plt.savefig(out_path, dpi=150, bbox_inches="tight")
        plt.close()
        print(f"Saved event plot to {out_path}")

    except Exception as e:
        print(f"Error processing {file}: {e}")
        continue
  1. Source code of 2_summarize_binding_events.py
import pandas as pd
import glob
import os
import argparse

# =============================
# Command-line arguments
# =============================
parser = argparse.ArgumentParser(description="Summarize binding-event CSV files in a folder.")
parser.add_argument("input_dir", type=str, help="Input directory containing CSV files")
args = parser.parse_args()
input_dir = args.input_dir

# =============================
# Prepare file list and output
# =============================
file_list = glob.glob(os.path.join(input_dir, "*.csv"))
print(f"Found {len(file_list)} CSV files in {input_dir}")

output_dir = "summaries"
os.makedirs(output_dir, exist_ok=True)

# Excel writer
excel_path = os.path.join(output_dir, "summary_all.xlsx")
with pd.ExcelWriter(excel_path, engine="xlsxwriter") as writer:

    for file in file_list:
        try:
            # Load CSV (ignore header lines with "#")
            df = pd.read_csv(file, sep=";", comment="#", header=None)

            # Assign proper column names
            df.columns = [
                "track index",
                "time (pixels)",
                "coordinate (pixels)",
                "time (seconds)",
                "position (um)",
                "minimum observable duration (seconds)"
            ]

            # Group by track index
            summaries = []
            for track, group in df.groupby("track index"):
                summary = {
                    "track index": track,
                    "n_events": len(group),
                    "time_start_s": round(group["time (seconds)"].min(), 3),
                    "time_end_s": round(group["time (seconds)"].max(), 3),
                    "duration_s": round(group["time (seconds)"].max() - group["time (seconds)"].min(), 3),
                    "pos_min_um": round(group["position (um)"].min(), 3),
                    "pos_max_um": round(group["position (um)"].max(), 3),
                    "pos_range_um": round(group["position (um)"].max() - group["position (um)"].min(), 3),
                    "pos_mean_um": round(group["position (um)"].mean(), 3)
                }
                summaries.append(summary)

            summary_df = pd.DataFrame(summaries)

            # Save each CSV summary individually
            out_csv_path = os.path.join(output_dir, os.path.basename(file).replace(".csv", "_summary.csv"))
            summary_df.to_csv(out_csv_path, sep=";", index=False, float_format="%.3f")
            print(f"Saved summary CSV for {file} -> {out_csv_path}")

            # Write summary to Excel sheet
            sheet_name = os.path.basename(file).replace(".csv", "")
            summary_df.to_excel(writer, sheet_name=sheet_name[:31], index=False, float_format="%.3f")

        except Exception as e:
            print(f"Error processing {file}: {e}")
            continue

print(f"Saved all summaries to Excel file -> {excel_path}")

2025年最受期待的顶级AI聊天机器人排行榜(含用户聊天入口)

Qwen3-Max vs Qwen3-Coder: Which to Use?

Feature Qwen3-Max Qwen3-Coder
Primary Purpose General-purpose reasoning and multi-step problem solving Specialized for code understanding, generation, and debugging
Strengths – Broad knowledge across domains

– Strong logical reasoning
– Ideal for scientific interpretation, experimental design, or literature synthesis
– Trained on vast code corpora (including R, Python, etc.)
– Understands S3/S4 objects, Bioconductor conventions
– Excels at fixing syntax, data structure mismatches, and debugging errors
Best For – Interpreting biological results

– Designing analysis pipelines
– Answering conceptual questions
– Debugging R/Bioconductor code
– Fixing phyloseq/MPSE object issues
– Writing correct, idiomatic code
Code Understanding Good Excellent (fine-tuned for programming tasks)
Recommended Use Case “Why do my alpha diversity values differ between groups?” “Why is mpse_abund$Group full of `
` values?”

💡 For debugging R code (e.g., MicrobiotaProcess, phyloseq, S4 objects)Use Qwen3-Coder.
For broader scientific reasoning alongside code → consider Qwen3-Max after fixing the code with Qwen3-Coder.

2025年最受期待的顶级AI聊天机器人排行榜(含用户聊天入口)

排名 名称 开发公司 参数量(估计) 用户聊天界面(Chatbox)链接 备注
1 ChatGPT(GPT-5) OpenAI(美国) ~175B–1T+(约1750亿至1万亿以上) https://chat.openai.com 免费版可用,Plus用户优先体验GPT-4/GPT-5
2 Claude Anthropic(美国) Claude 3 Opus: ~175B–500B(约1750亿至5000亿) https://claude.ai 免费+Pro版,支持文件上传与长文本;部分地区不可用(见知识库)
3 Gemini Google(美国) Gemini Ultra: ~1T+(约1万亿以上,稀疏模型) https://gemini.google.com 免费使用,需Google账号
4 Grok xAI(埃隆·马斯克) Grok-1: 314B(3140亿)

Grok-3: ~500B–1T?(约5000亿至1万亿?)
https://grok.com/ ; https://x.com/grok 仅限X Premium+订阅用户(约$16/月)
5 通义千问 Qwen 阿里巴巴(中国) Qwen-Max: ~100B–200B(约1000亿至2000亿)

Qwen2-72B: 72B(720亿,开源)
https://chat.qwen.ai ; https://tongyi.aliyun.com/qwen 免费注册即可使用;当前网页可能因地区限制显示错误(见知识库)
6 DeepSeek 深度求索(中国) DeepSeek-V3: ~236B(约2360亿,MoE)

DeepSeek-Coder: 33B(330亿)
https://deepseek.com 免费网页版 + App,支持DeepSeek-V3和Coder
7 Kimi 月之暗面(中国) 估计 200B+(约2000亿以上) https://kimi.moonshot.cn 免费使用,支持超长文档上传
8 HunYuan(混元) 腾讯(中国) HunYuan-Large: ~100B+(约1000亿以上) 无公开聊天界面 仅通过腾讯云API或WeChat等内部产品使用;企业入口
9 Llama 系列 Meta(美国) Llama 3: 8B / 70B(80亿 / 700亿,官方)

Llama 4(预计): 400B+(约4000亿以上)
暂无官方聊天界面 可通过第三方平台体验(如 Perplexity LabsHugging Face Chat
10 Mistral Chat Mistral AI(法国) Mistral Large: ~123B(约1230亿) https://chat.mistral.ai 免费使用,支持Mistral Large等模型
11 文心一言 百度(中国) ERNIE Bot 4.5: ~100B–200B(约1000亿至2000亿) https://yiyan.baidu.com 免费注册,支持网页与App
12 360智脑 360集团(中国) 估计 100B+(约1000亿以上) https://ai.360.cn 免费使用,集成在360安全浏览器等产品中
13 智谱清言 智谱AI(中国) GLM-4: ~100B(约1000亿) https://chatglm.cn 免费注册,支持GLM-4模型对话
14 Perplexity Perplexity AI(美国) 基于 Claude / GPT / 自研模型

估计 70B–200B(约700亿至2000亿)
https://perplexity.ai 免费+Pro版,强调搜索与引用
15 Pi Inflection AI(美国) Inflection-2: ~100B+(约1000亿以上) https://heypi.com 已暂停新用户注册(截至2024年中),未来或整合至Microsoft Copilot
16 You.com You.com(美国) YouLM: ~70B(约700亿) https://you.com 免费使用,需登录
17 Command R+ Cohere(加拿大) Command R+: ~100B(约1000亿) https://cohere.com/chat 部分功能需申请,免费试用可用
18 Fireworks AI Playground Fireworks AI(美国) 多种模型(Llama 3, Mixtral, 自研)

7B–175B(70亿至1750亿)
https://app.fireworks.ai/playground 开发者友好,普通用户也可试用对话模型
19 通义听悟 阿里巴巴(中国) 基于 Qwen,参数同 Qwen-Max(约1000亿至2000亿) https://tingwu.aliyun.com 专注语音转写与会议摘要,需阿里云账号
20 DeepSeek-Coder 深度求索(中国) 33B(330亿,官方开源) https://deepseek.com/coder 在DeepSeek主站内集成,支持代码问答

💡 说明

  • 参数量”指模型总参数(Total Parameters),若为MoE(混合专家)模型,会标注激活参数(Active Params)。
  • OpenAI、Google、Anthropic 等公司通常不公开确切参数,数字为社区合理推测。
  • 中国模型(如Qwen、GLM、Kimi)部分参数来自官方GitHub或技术报告。
  • 根据知识库,chat.qwen.aitongyi.aliyun.com/qwen 当前返回 位置错误页面,可能仅限中国大陆访问。
  • Claude 在部分国家/地区不可用(见知识库提示 “App unavailable”)。
  • 腾讯 HunYuan 无公开聊天界面,仅限企业API调用。

Alle 90 Skigebiete von Tirol und Osttirol – Lifte, Pistenkilometer und Pistenanzahl

tirol_snow_card_gebiete_uebersicht_saisonkarte_2020-2021

https://www.snowcard.tirol.at/


A

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Alpbachtal – Ski Juwel Alpbachtal Wildschönau Alpbach 311, 6236 Alpbach 45 Lifte / 113 km / 79 Pisten +43 5336 52 33 SnowCard Detail
Amberglift Walchsee Kaiserweg 6, 6344 Walchsee 3 Lifte / 4 km / 5 Pisten +43 676 841 640 810 SnowCard Detail
Axamer Lizum Lizum 6, 6094 Axam 9 Lifte / 40.7 km / 32 Pisten +43 5234 68240 SnowCard Detail

B

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Bergbahnen Pillersee – Buchensteinwand – St. Ulrich Buchenstein 13, 6393 St. Ulrich am Pillersee 6 Lifte / 25 km / 21 Pisten +43 5354 77077 SnowCard Detail
Bergeralm (Steinach am Brenner) Huebenweg 25, 6150 Steinach am Brenner 5 Lifte / 9 km / 8 Pisten +43 5272 6333 SnowCard Detail
Bergwelt Hahnenkamm (Höfen/Reutte) Bergbahnstraße 18, 6604 Höfen 3 Lifte / 16 km / 12 Pisten +43 5672 62420 SnowCard Detail
Berwang Berwang 120, 6622 Berwang 7 Lifte / 40 km / 30 Pisten +43 5674 8124 SnowCard Detail
Bichllifte Prägraten St. Andrä 35a, 9974 Prägraten am Großvenediger 2 Lifte / 3 km / 4 Pisten +43 50 212 530 SnowCard Detail

C

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Christlum Achenkirch Achenkirch 115, 6215 Achenkirch 7 Lifte / 24 km / 18 Pisten +43 5246 6309 SnowCard Detail
Collalbo – Ritten Dorf 15, 39054 Klobenstein (Südtirol) 3 Lifte / 5 km / 5 Pisten +39 0471 356606 SnowCard Detail

D

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Diedamskopf 6886 Schoppernau, Vorarlberg 10 Lifte / 24 km / 20 Pisten +43 5515 4110 SnowCard Detail

E

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Elferlifte Neustift Moos 12, 6167 Neustift im Stubaital 7 Lifte / 15 km / 14 Pisten +43 5226 2270 SnowCard Detail
Ehrwalder Alm Dr. Ludwig Ganghofer-Str. 66, 6632 Ehrwald 8 Lifte / 26 km / 20 Pisten +43 5673 2468 SnowCard Detail
Ehrwalder Wettersteinbahnen Sonnenhang 1, 6632 Ehrwald 6 Lifte / 12 km / 11 Pisten +43 5673 2501 SnowCard Detail

F

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Fendels – Kaunertal Fendels 62, 6528 Fendels 6 Lifte / 25 km / 18 Pisten +43 5472 6686 SnowCard Detail
Fieberbrunn Lindau 17, 6391 Fieberbrunn 24 Lifte / 44 km / 41 Pisten +43 5354 56333 SnowCard Detail
Finkenberg (Zillertal 3000) Persal 200, 6292 Finkenberg 11 Lifte / 22 km / 21 Pisten +43 5285 62776 SnowCard Detail
Fügen – Spieljochbahn Hochfügenerstraße 77, 6263 Fügen 10 Lifte / 22 km / 12 Pisten +43 5288 62991 SnowCard Detail
Fulpmes – Schlick 2000 Tschaffinis Umgebung 26, 6166 Fulpmes 11 Lifte / 22 km / 21 Pisten +43 5225 62276 SnowCard Detail

G

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Galtür – Silvapark Silvaparkstraße 10, 6563 Galtür 9 Lifte / 43 km / 35 Pisten +43 5445 536 SnowCard Detail
Gerlos – Zillertal Arena Gerlos Nr. 306, 6281 Gerlos 52 Lifte / 150 km / 89 Pisten +43 5284 330 SnowCard Detail

H

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Hintertuxer Gletscher Hintertux 794, 6293 Tux 20 Lifte / 64 km / 29 Pisten +43 5287 87246 SnowCard Detail
Hochfügen – Hochzillertal Hochfügen, 6263 Fügen 41 Lifte / 88 km / 38 Pisten +43 5288 62991 SnowCard Detail
Hochimst Hoch Imst 19, 6460 Imst 8 Lifte / 9 km / 9 Pisten +43 5412 620 SnowCard Detail

I

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Ischgl – Silvretta Arena Silvrettaplatz 2, 6561 Ischgl 46 Lifte / 239 km / 91 Pisten +43 5444 5220 SnowCard Detail
Innsbruck – Patscherkofel Patscherkofelstraße 22, 6081 Igls 9 Lifte / 18 km / 19 Pisten +43 512 390607 SnowCard Detail

K

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Kaunertaler Gletscher Gletscherstraße 240, 6524 Feichten im Kaunertal 8 Lifte / 53 km / 22 Pisten +43 5472 6686 SnowCard Detail
Kappl Au 483, 6555 Kappl 10 Lifte / 40 km / 40 Pisten +43 5448 5221 SnowCard Detail
Kitzbühel Hahnenkammstraße 1a, 6370 Kitzbühel 57 Lifte / 179 km / 85 Pisten +43 5356 62360 SnowCard Detail
Kühtai Kühtai 19, 6176 Kühtai 11 Lifte / 41 km / 23 Pisten +43 5239 505 SnowCard Detail

L

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Lermoos – Grubigstein Juch 3, 6631 Lermoos 6 Lifte / 22 km / 15 Pisten +43 5673 2520 SnowCard Detail
Lechtal – Warth / Schröcken Warth 210, 6767 Warth 15 Lifte / 60 km / 44 Pisten +43 5584 202 SnowCard Detail

M

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Mayrhofen (Penken / Ahorn / Eggalm) Ahornstraße 1, 6290 Mayrhofen 57 Lifte / 159 km / 88 Pisten +43 5285 62200 SnowCard Detail
Mösern-Bergbahnen Mösern 42, 6105 Seefeld 4 Lifte / 5 km / 5 Pisten +43 5212 2501 SnowCard Detail

N

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Nauders – Bergkastel Dorfstraße 25, 6543 Nauders 13 Lifte / 70 km / 35 Pisten +43 5473 5202 SnowCard Detail
Niederau – Wildschönau Dorf 22, 6314 Niederau 10 Lifte / 43 km / 30 Pisten +43 5339 8284 SnowCard Detail

O

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Obergurgl – Hochgurgl Gurglerstraße 93, 6456 Obergurgl 25 Lifte / 112 km / 50 Pisten +43 5256 6230 SnowCard Detail
Oetz – Hochoetz / Kühtai Dorfstraße 48, 6433 Oetz 15 Lifte / 44 km / 26 Pisten +43 5252 6232 SnowCard Detail

P

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Pillersee – St. Jakob Dorf 7, 6392 St. Jakob i.H. 6 Lifte / 25 km / 21 Pisten +43 5354 56333 SnowCard Detail
Pitztal – Hochzeiger Liß 9, 6474 Jerzens 9 Lifte / 40 km / 21 Pisten +43 5413 2352 SnowCard Detail
Pitztaler Gletscher – Rifflsee Rifflsee 1, 6481 St. Leonhard im Pitztal 12 Lifte / 44 km / 28 Pisten +43 5413 2352 SnowCard Detail

R

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Rosshütte – Seefeld Talstation 419, 6100 Seefeld 12 Lifte / 22 km / 21 Pisten +43 5212 2501 SnowCard Detail
Rinn / Rangger Köpfl Rinn 21, 6074 Rinn 6 Lifte / 10 km / 10 Pisten +43 512 62746 SnowCard Detail

S

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Schlick 2000 Tschaffinis Umgebung 26, 6166 Fulpmes 11 Lifte / 22 km / 21 Pisten +43 5225 62276 SnowCard Detail
Serfaus – Fiss – Ladis Dorf 16
Sölden Sölden 5, 6450 Sölden 33 Lifte / 144 km / 31 Pisten +43 5254 2265 SnowCard Detail
Stubaier Gletscherbahnen Talstation, 6167 Neustift 20 Lifte / 42 km / 26 Pisten +43 5226 8141 SnowCard Detail
Steinplatte – Waidring Steinplatte 1, 6384 Waidring 9 Lifte / 24 km / 18 Pisten +43 5354 200 SnowCard Detail

T

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Thiersee – Skiarena Thierseestraße 25, 6335 Thiersee 4 Lifte / 6 km / 7 Pisten +43 5376 2211 SnowCard Detail
Tulfes / Glungezer Tulfes 100, 6166 Tulfes 5 Lifte / 7 km / 8 Pisten +43 5225 620 SnowCard Detail

U–Z

Name Adresse Lifte / Pisten (km) / Pistenanzahl Telefon Website
Venet – Landeck / Zams Venetstraße 12, 6500 Landeck 8 Lifte / 21 km / 15 Pisten +43 5442 6328 SnowCard Detail
Wildschönau – Niederau / Auffach Dorf 22, 6314 Niederau 10 Lifte / 43 km / 30 Pisten +43 5339 8284 SnowCard Detail
Zillertal Arena – Gerlos / Zell am Ziller / Königsleiten Gerlos 306, 6281 Gerlos 52 Lifte / 150 km / 89 Pisten +43 5284 330 SnowCard Detail

Bauder LIQUITEC 防水系统材料与工具说明

防水系统材料 / Materialien

  1. Bauder LIQUITEC PU-D (2,5 kg)

    • 类型:聚氨酯单组份液体防水涂料
    • 用途:屋顶、阳台、露台防水
    • 功能:形成无缝、高弹性、防水膜
  2. Bauder LIQUITEC Vlies PV165 (0,25 m x 4,0 m)

    • 类型:纤维增强毡/纤维网
    • 用途:与 PU-D 配合使用
    • 功能:增强防水层强度,防止开裂
  3. Bauder LIQUITEC RG 0,25 l

    • 类型:涂料底漆 / 表面处理剂
    • 用途:改善 PU-D 涂料与基层附着力
    • 功能:防止基层吸收过多涂料,保证厚度均匀
  4. Bauder Primer Kunststoff

    • 类型:塑料表面专用底漆
    • 用途:用于塑料或非吸收性基层
    • 功能:提高 PU-D 涂料附着力

施工工具 / Werkzeuge

  1. Rührholz(搅拌棒)

    • 用途:搅拌 PU-D 或底漆,使涂料均匀
  2. Pinsel(刷子)

    • 用途:涂刷 PU-D、防水涂料或底漆
  3. Schleifpapier(砂纸)

    • 用途:打磨基层或塑料表面,提高附着力
  4. Einweghandschuhe(一次性手套)

    • 用途:保护双手,避免接触涂料
  5. Zimmermannsbleistift(木工铅笔)

    • 用途:标记施工区域或测量线
  6. Reinigungstuch(清洁布)

    • 用途:清理灰尘、污渍或多余涂料
  7. Verlegeanleitung(施工说明)

    • 用途:详细说明施工步骤、用量及安全注意事项

施工顺序示意 / Arbeitsablauf

  1. 清理基层 → 砂纸打磨 → 底漆(RG 或 Primer Kunststoff)
  2. 涂刷第一层 PU-D → 铺 Vlies PV165 → 涂第二层 PU-D
  3. 完成后清理工具

🏔️ Zillertal & Umgebung – Skigebiete, Parkmöglichkeiten & Unterkunft

https://de.wikipedia.org/wiki/Bezirk_Schwaz

  • Karte_A_Tirol_SZ.svg
  • Municipalities_Bezirk_Schwaz.svg

http://www.skimap.info/europe/austria/tirol/

https://prime-skiing.de/wp-content/uploads/2021/09/tirol_snow_card_gebiete_uebersicht_saisonkarte_2020-2021.jpg

https://www.apoplous.blog/the-alps/

Map-of-ski-resorts-in-the-Alps
tirol_snow_card_gebiete_uebersicht_saisonkarte_2020-2021
ski_map_tirol
  • Alpbachtal
  • Axamer Lizum
  • Elfer lifte
  • Fendels
  • Fieberbrunn
  • Fügen-Spieljochbahn
  • Hintertux
  • Hochfügen-Hochzillertal
  • Ischgl
  • Kaunertaler Gletscher
  • Kitzbühel – Kirchberg
  • Kühtai
  • Mayrhofen
  • Nauders
  • Obergurl Hochgurgl
  • Oetz
  • Pillersee, Sankt Jakob
  • Pitztal – Hochzeiger
  • Pitztaler Gletscher – Rifflsee
  • Schlick 2000
  • Serfaus – Fiss – Ladis
  • Serles Bahnen
  • Silvapark – Galtür
  • Sölden
  • Steinplatte – Waidring
  • Stubaier Gletscherbahnen
  • Venet – Landeck, Zams
  • Wildschoenau – Niederau, Auffach
  • Zillertal Arena
    • Gerlos
    • Zell am Ziller
    • Königsleiten

🏔️ 齐勒塔尔及周边 —— 滑雪场、停车信息与住宿


🎿 Fahrzeiten & Parkmöglichkeiten inklusive Kartenlinks

🎿 包含导航链接的预计车程与停车信息

Skigebiet / 滑雪场 Adresse / 地址 Parkplatz / 停车信息 Fahrzeit ab Gallzein / 预计车程 Karte / 地图链接
Spieljoch – Fügen Hochfügener Straße 400, 6263 Fügen, Österreich Kostenlose & ausreichende Parkplätze direkt an der Talstation / 免费停车场就位于缆车山脚站 ca. 40–50 Min Google Maps: Spieljochbahn I Talstation
Kaltenbach – Hochzillertal/Hochfügen (Ski-Optimal) Kaltenbach 145, 6272 Kaltenbach, Österreich Großes Parkhaus direkt bei der Talstation, Hochfügen: Parkplätze an den Liftbasen / 大停车楼在卡尔滕巴赫缆车站旁,高菲根缆车基站也有停车位 ca. 45-60 Min Google Maps: Kaltenbach Talstation Hochzillertal
Zillertal Arena – Zell am Ziller / Gerlos / Königsleiten / Hochkrimml Rohr 23 & Rohr 26a, 6280 Zell am Ziller, Österreich Asphaltierte Flächen, Parkhäuser, teilweise kostenfrei / 停车基础设施多样,有些停车楼,有些免费 ca. 30-60 Min Google Maps: Zell am Ziller Rosenalmbahn Talstation
Mayrhofen – Penken / Ahorn / Rastkogel / Eggalm (Mountopolis) Ahornstraße 853, 6290 Mayrhofen, Österreich Große Flächen bei der Ahornbahn (teils kostenpflichtig) / Penkenbahn: wenig Parkplätze direkt → besser Ahorn​ ca. 50-65 Min Google Maps: Ahornbahn Talstation, Mayrhofen
Hintertuxer Gletscher Hintertux 794, 6294 Tux, Österreich Freiflächen + Parkgaragen, E-Ladestationen vorhanden / 停车楼 + 室外停车 + 电车充电桩 ca. 60-75 Min Google Maps: Talstation Hintertuxer Gletscher

🎿 Skigebiete im Zillertal

🎿 齐勒塔尔滑雪场概览

Die Region Zillertal in Tirol bietet eine Vielzahl an Skigebieten. Die Anfahrt erfolgt über gut ausgebaute Hauptstraßen ohne nennenswerte Steigungen, Parkmöglichkeiten sind meist ausreichend vorhanden (Parkhäuser, asphaltierte Plätze, vielfach kostenfrei).

奥地利蒂罗尔的齐勒塔尔地区拥有多个滑雪胜地。道路宽阔平坦,几乎没有明显坡度。大多数滑雪场都提供充足的停车位(停车楼、柏油地面,很多免费)。


🚗 Fahrzeiten & Parkmöglichkeiten

🚗 预计车程与停车信息

Skigebiet / 滑雪场 Adresse / 地址 Parkplatz / 停车信息 Fahrzeit ab Gallzein / 预计车程
Spieljoch – Fügen Hochfügener Straße 77, 6263 Fügen Kostenlose Parkplätze direkt an der Talstation / 免费停车场,位置方便 ca. 40–50 Min
Kaltenbach – Hochzillertal / Hochfügen (Ski-Optimal) Postfeldstraße 7, 6272 Kaltenbach (Parkhaus) / Hochfügen Talboden Großes Parkhaus (kostenfrei) / Hochfügen: kostenlose Parkplätze bei den Liften ca. 45–60 Min
Zillertal Arena – Zell am Ziller / Gerlos / Königsleiten / Hochkrimml Rosenalmbahn: Rohr 23, Zell am Ziller / Karspitzbahn: Rohr 26a, Zell am Ziller Asphaltierte Flächen, Parkhäuser, teilweise kostenfrei / 停车位充足,部分免费 ca. 30–60 Min
Mayrhofen – Penken / Ahorn / Rastkogel / Eggalm (Mountopolis) Ahornstraße 853, 6290 Mayrhofen Parkplätze bei der Ahornbahn (teils kostenpflichtig) / Penkenbahn kaum Parkplätze ca. 50–65 Min
Hintertuxer Gletscher Hintertux 794, 6294 Tux Freiflächen + Parkgaragen, E-Ladestationen / 室外及停车楼,配备电动车充电桩 ca. 60–75 Min

📌 Übersichtskarten & Pistenpläne / 滑雪地图与信息


🏡 Chalet Rastenhof – Unterkunft in Gallzein

🏡 Rastenhof木屋 —— Gallzein住宿

  • Adresse / 地址: Niederleiten 28, 6222 Gallzein
  • Website / 官网: rastenhof.at
  • Größe / 面积: 280 m², geeignet für bis zu 14 Personen / 可容纳14人

Ausstattung / 设施

  • 5 Schlafzimmer, 3 Bäder (Duschen + Badewanne)
  • Sauna, Terrasse, Balkon, Garten, Grillplatz
  • Private Küche mit Spülmaschine, Backofen, Kaffeemaschine etc.
  • Kinderfreundlich: Hochstühle, Spielplatz, Babybetten auf Anfrage
  • Wohnzimmer mit Sofa, Flachbild-TV, Minibar

Bezahlung / 支付方式

  • 10 % Anzahlung per Überweisung innerhalb von 7 Tagen
  • Restzahlung per Überweisung 2 Wochen vor Anreise
  • 预订时需支付房费的10%,余款需在到达前2周转账支付

🌄 Region Gallzein & Karwendel

🌄 Gallzein与Karwendel地区

  • Gallzein ist ein kleines, ruhiges Dorf auf ca. 800 m Seehöhe mit ca. 700 Einwohnern.
  • Lage im Silberregion Karwendel, nahe Inntal und Zillertal.
  • Ideal für Natururlaub, Wandern und Erholung.

Gallzein是一个宁静的小村庄,海拔约800米,约有700名居民。位于“银矿区”(Silberregion Karwendel),邻近因河谷和齐勒塔尔,适合亲近自然、徒步和度假。

📌 Weitere Ausflugsziele / 更多景点:


Fazit / 总结
Das Zillertal bietet ideale Bedingungen für Wintersport mit modernen Skigebieten und guten Parkmöglichkeiten. Das Chalet Rastenhof in Gallzein ergänzt dies perfekt als großzügige und komfortable Unterkunft für Familien und Gruppen.

齐勒塔尔地区滑雪条件优越,配备现代化滑雪场和完善的停车设施。Rastenhof木屋环境安静、设施齐全,非常适合家庭和团体入住。

奥地利“Top 10”滑雪胜地

奥地利的“Top 10”滑雪胜地没有统一的定义,因为不同排行榜使用的标准各异,例如滑雪场规模、雪况稳定性、家庭友好度或海拔高度。一些最常被提及的滑雪胜地包括:

  • Skicircus Saalbach Hinterglemm Leogang Fieberbrunn
  • SkiWelt Wilder Kaiser-Brixental
  • Serfaus-Fiss-Ladis
  • Sölden

其他受欢迎的滑雪胜地还包括:Kitzbühel、Zillertal Arena、Ischgl 和 Mayrhofen。


Top 滑雪胜地概览(按不同标准)

最大滑雪场

  • SkiWelt Wilder Kaiser-Brixental:奥地利最大的连续滑雪区,拥有 270 公里雪道。
  • Skicircus Saalbach Hinterglemm Leogang Fieberbrunn:非常大且受欢迎,雪况稳定。
  • Silvretta Arena Ischgl-Samnaun:以雪况稳定和丰富的 après-ski 活动闻名。

最适合家庭的滑雪场

  • Serfaus-Fiss-Ladis:以家庭特别是儿童设施闻名。
  • Zillertal Arena:同样是家庭友好型滑雪场,雪道丰富。

海拔高且雪况好的滑雪场

  • Sölden:拥有两座冰川,海拔最高可达 3,340 米。
  • Hintertuxer Gletscher:全年 365 天可滑雪。
  • Mölltaler Gletscher & Kaunertaler Gletscher:奥地利最高的滑雪场之一。

知名及受欢迎的滑雪场

  • Kitzbühel (KitzSki):著名且受欢迎的滑雪目的地。
  • Mayrhofen(Zillertal):滑雪爱好者的天堂,以雪道丰富著称。
  • Ski Arlberg:连接蒂罗尔和福拉尔贝格的多个滑雪胜地,广受欢迎。

滑雪场雪道长度一览表

滑雪场 总雪道长度 (km) 初级 (km) 中级 (km) 高级 (km) 海拔范围 (m) 地图链接
SkiWelt Wilder Kaiser-Brixental 270 90 120 60 620–2,000 地图
Skicircus Saalbach Hinterglemm Leogang Fieberbrunn 270 80 150 40 840–2,096 地图
Serfaus-Fiss-Ladis 214 70 100 44 1,200–2,820 地图
Sölden 144 45 70 29 1,350–3,340 地图
Zillertal Arena 143 55 60 28 580–2,500 地图
Kitzbühel (KitzSki) 234 60 140 34 800–2,000 地图
Mayrhofen 136 30 70 36 630–2,500 地图
Ski Arlberg 305 90 150 65 1,300–2,811 地图
Silvretta Arena Ischgl-Samnaun 238 60 120 58 1,400–2,872 地图
Hintertuxer Gletscher 60 15 30 15 1,500–3,250 地图

Processing Data_Patricia_Transposon_2025 (Workflow for Structural Variant Calling in Nanopore Sequencing)

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)

    #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
    To activate this environment, use:
        micromamba activate /home/jhuang/mambaforge
    Or to execute a single command in this environment, use:
        micromamba run -p /home/jhuang/mambaforge mycommand
    installation finished.
    
    Do you wish to update your shell profile to automatically initialize conda?
    This will activate conda on startup and change the command prompt when activated.
    If you'd prefer that conda's base environment not be activated on startup,
      run the following command when conda is activated:
    
    conda config --set auto_activate_base false
    
    You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    [no] >>> yes
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    modified      /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    no change     /home/jhuang/.bashrc
    No action taken.
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    Added mamba to /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    Thank you for installing Mambaforge!
    
    Close your terminal window and open a new one, or run:
    #source ~/mambaforge/bin/activate
    conda --version
    mamba --version
    
    https://github.com/conda-forge/miniforge/releases
    Note
    
        * After installation, please make sure that you do not have the Anaconda default channels configured.
            conda config --show channels
            conda config --remove channels defaults
            conda config --add channels conda-forge
            conda config --show channels
            conda config --set channel_priority strict
            #conda clean --all
            conda config --remove channels biobakery
    
        * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
    # --Deprecated method: mamba installing on conda--
    #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    #    * Note that installing mamba into any other environment than base is not supported.
    #
    #conda activate base
    #conda install conda
    #conda uninstall mamba
    #conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Test the installed tools

    # Check versions
    sniffles --version
    RepeatModeler -h
    RepeatMasker -h
    svim --help
    SURVIVOR --help
    mamba install -c conda-forge perl r
  2. Data Preparation

    Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
    Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
  3. Preprocessing

    Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
    Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
  4. (Optional) Variant Calling for SNP and Indel Detection:

    Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
  5. (OFFICIAL STARTING POINT) Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

      #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
    
      # -- PREPARING the input fastq-data, merge the fastqz and move the top-directory
    
      # Under raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass
      zcat ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_0.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_1.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_2.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_3.fastq.gz ... | gzip > HD46_1.fastq.gz
      mv ./raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass/HD46_1.fastq.gz ~/DATA/Data_Patricia_Transposon_2025
    
        #this are the corresponding sample names:
        #barcode 1: HD46-1
        #barcode 2: HD46-2
        #barcode 3: HD46-3
        #barcode 4: HD46-4
        mv barcode01.fastq.gz HD46_1.fastq.gz
        mv barcode02.fastq.gz HD46_2.fastq.gz
        mv barcode03.fastq.gz HD46_3.fastq.gz
        mv barcode04.fastq.gz HD46_4.fastq.gz
    
      # -- CALCULATE the coverages
        #!/bin/bash
    
        for bam in barcode*_minimap2.sorted.bam; do
            echo "Processing $bam ..."
            avg_cov=$(samtools depth -a "$bam" | awk '{sum+=$3; cnt++} END {if (cnt>0) print sum/cnt; else print 0}')
            echo -e "${bam}\t${avg_cov}" >> coverage_summary.txt
        done
    
      # ---- !!!! LOGIN the suitable environment !!!! ----
    
      mamba activate transposon_long
    
      # -- TODO: AFTERNOON_DEBUG_THIS: FAILED and not_USED: Alignment and Detect structural variants in each sample using SVIM which used aligner ngmlr or mimimap2
      #mamba install -c bioconda ngmlr
      mamba install -c bioconda svim
    
      # ---- Option_1: minimap2 (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          #INS,INV,DUP:TANDEM,DUP:INT,BND
          svim reads --aligner minimap2 --nanopore minimap2+svim_${sample}    ${sample}.fastq.gz CP020463.fasta  --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
      done
    
      #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    
      # ---- Option_2: minamap2 (aligner) + Sniffles2 (structural variant caller) --> SUCCESSFUL ----
      #Minimap2: A commonly used aligner for nanopore sequencing data.
      #    Align Long Reads to the WT Reference using Minimap2
      #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
      #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
      #  -l 50: Reports SVs that are at least 50 base pairs long.
      #  -t 60: Uses 60 threads for faster processing.
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          #minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ${sample}.fastq.gz | samtools sort -o ${sample}_minimap2.sorted.bam
          samtools index ${sample}_minimap2.sorted.bam
          sniffles -m ${sample}_minimap2.sorted.bam -v ${sample}_minimap2+sniffles.vcf -s 10 -l 50 -t 60
          #QUAL < 20 ||
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_minimap2+sniffles.vcf > ${sample}_minimap2+sniffles_filtered.vcf
      done
    
        #Estimating parameter...
        #        Max dist between aln events: 44
        #        Max diff in window: 76
        #        Min score ratio: 2
        #        Avg DEL ratio: 0.0112045
        #        Avg INS ratio: 0.0364027
        #Start parsing... CP020463
        #                # Processed reads: 10000
        #                # Processed reads: 20000
        #        Finalizing  ..
        #Start genotype calling:
        #        Reopening Bam file for parsing coverage
        #        Finalizing  ..
        #Estimating parameter...
        #        Max dist between aln events: 28
        #        Max diff in window: 89
        #        Min score ratio: 2
        #        Avg DEL ratio: 0.013754
        #        Avg INS ratio: 0.17393
        #Start parsing... CP020463
        #                # Processed reads: 10000
        #                # Processed reads: 20000
        #                # Processed reads: 30000
        #                # Processed reads: 40000
    
        # Results:
        # * barcode01_minimap2+sniffles.vcf
        # * barcode01_minimap2+sniffles_filtered.vcf
        # * barcode02_minimap2+sniffles.vcf
        # * barcode02_minimap2+sniffles_filtered.vcf
        # * barcode03_minimap2+sniffles.vcf
        # * barcode03_minimap2+sniffles_filtered.vcf
        # * barcode04_minimap2+sniffles.vcf
        # * barcode04_minimap2+sniffles_filtered.vcf
    
      #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
      #for sample in barcode01 barcode02 barcode03 barcode04; do
      #    sniffles -m svim_reads_minimap2_${sample}/${sample}.fastq.minimap2.coordsorted.bam -v sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
      #    bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
      #done
    
      # ---- Option_3: NGMLR (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          svim reads --aligner ngmlr --nanopore    ngmlr+svim_${sample}       ${sample}.fastq.gz CP020463.fasta  --cores 10;
      done
    
      # ---- Option_4: NGMLR (aligner) + sniffles (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          sniffles -m ngmlr+svim_${sample}/${sample}.fastq.ngmlr.coordsorted.bam -v ${sample}_ngmlr+sniffles.vcf -s 10 -l 50 -t 60
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_ngmlr+sniffles.vcf > ${sample}_ngmlr+sniffles_filtered.vcf
      done
  6. Compare and integrate all results produced by minimap2+sniffles and ngmlr+sniffles, and check them each position in IGV!

        mv HD46_Ctrl_minimap2+sniffles_filtered.vcf HD46-Ctrl_minimap2+sniffles_filtered.vcf
        mv HD46_Ctrl_ngmlr+sniffles_filtered.vcf    HD46-Ctrl_ngmlr+sniffles_filtered.vcf
        mv HD46_1_minimap2+sniffles_filtered.vcf    HD46-1_minimap2+sniffles_filtered.vcf
        mv HD46_1_ngmlr+sniffles_filtered.vcf       HD46-1_ngmlr+sniffles_filtered.vcf
        mv HD46_2_minimap2+sniffles_filtered.vcf    HD46-2_minimap2+sniffles_filtered.vcf
        mv HD46_2_ngmlr+sniffles_filtered.vcf       HD46-2_ngmlr+sniffles_filtered.vcf
        mv HD46_3_minimap2+sniffles_filtered.vcf    HD46-3_minimap2+sniffles_filtered.vcf
        mv HD46_3_ngmlr+sniffles_filtered.vcf       HD46-3_ngmlr+sniffles_filtered.vcf
        mv HD46_4_minimap2+sniffles_filtered.vcf    HD46-4_minimap2+sniffles_filtered.vcf
        mv HD46_4_ngmlr+sniffles_filtered.vcf       HD46-4_ngmlr+sniffles_filtered.vcf
        mv HD46_5_minimap2+sniffles_filtered.vcf    HD46-5_minimap2+sniffles_filtered.vcf
        mv HD46_5_ngmlr+sniffles_filtered.vcf       HD46-5_ngmlr+sniffles_filtered.vcf
        mv HD46_6_minimap2+sniffles_filtered.vcf    HD46-6_minimap2+sniffles_filtered.vcf
        mv HD46_6_ngmlr+sniffles_filtered.vcf       HD46-6_ngmlr+sniffles_filtered.vcf
        mv HD46_7_minimap2+sniffles_filtered.vcf    HD46-7_minimap2+sniffles_filtered.vcf
        mv HD46_7_ngmlr+sniffles_filtered.vcf       HD46-7_ngmlr+sniffles_filtered.vcf
        mv HD46_8_minimap2+sniffles_filtered.vcf    HD46-8_minimap2+sniffles_filtered.vcf
        mv HD46_8_ngmlr+sniffles_filtered.vcf       HD46-8_ngmlr+sniffles_filtered.vcf
        mv HD46_13_minimap2+sniffles_filtered.vcf   HD46-13_minimap2+sniffles_filtered.vcf
        mv HD46_13_ngmlr+sniffles_filtered.vcf      HD46-13_ngmlr+sniffles_filtered.vcf
    
      conda activate plot-numpy1
      #python generate_common_vcf.py
      #mv common_variants.xlsx putative_transposons.xlsx
    
      # * Reads each of your VCFs.
      # * Filters variants → only keep those with FILTER == PASS.
      # * Compares the two aligner methods (minimap2+sniffles2 vs ngmlr+sniffles2) per sample.
      # * Keeps only variants that appear in both methods for the same sample.
      # * Outputs: An Excel file with the common variants and a log text file listing which variants were filtered out, and why (not_PASS or not_COMMON_in_two_VCF).
    
      #python generate_fuzzy_common_vcf_v1.py
      #Sample   PASS_minimap2   PASS_ngmlr  COMMON
      #  HD46-Ctrl_Ctrl 39  29  28
      #  HD46-1 39  32  29
      #  HD46-2 40  32  28
      #  HD46-3 38  30  27
      #  HD46-4 46  35  32
      #  HD46-5 40  35  31
      #  HD46-6 43  35  30
      #  HD46-7 40  33  28
      #  HD46-8 37  20  11
      #  HD46-13    39  38  27
    
    #Sample PASS_minimap2   PASS_ngmlr  COMMON_FINAL
    #HD46-Ctrl_Ctrl 39  29  6
    #HD46-1 39  32  8
    #HD46-2 40  32  8
    #HD46-3 38  30  6
    #HD46-4 46  35  8
    #HD46-5 40  35  9
    #HD46-6 43  35  10
    #HD46-7 40  33  8
    #HD46-8 37  20  4
    #HD46-13    39  38  5
    
    #!!!! Summarize the results of ngmlr+sniffles !!!!
    python merge_ngmlr+sniffles_filtered_results_and_summarize.py
    
    #!!!! Post-Processing !!!!
    #DELETE "2186168    N   

    . PASS” in Sheet HD46-13 and Summary #DELETE “2427785 N CGTCAGAATCGCTGTCTGCGTCCGAGTCACTGTCTGAGTCTGAATCACTATCTGCGTCTGAGTCACTGTCTG . PASS” due to “0/1:169:117” in HD46-13 and Summary #DELETE “2441640 N GCTCATTAAGAATCATTAAATTAC . PASS” due to 0/1:170:152 in HD46-13 and Summary

  7. Source code of merge_ngmlr+sniffles_filtered_results_and_summarize.py

    import os
    import pandas as pd
    
    # List all ngmlr VCF files
    vcf_files = sorted([f for f in os.listdir('.') if f.endswith('_ngmlr+sniffles_filtered.vcf')])
    
    log_lines = []
    df_list = []
    
    # Function to read VCF and filter PASS
    def read_vcf(vcf_path, log_lines, sample):
        variants = []
        with open(vcf_path) as f:
            for line in f:
                if line.startswith('#'):
                    continue
                parts = line.strip().split('\t')
                pos, ref, alt, qual, flt = parts[1], parts[3], parts[4], parts[5], parts[6]
                info = parts[7] if len(parts) > 7 else '.'
                fmt = parts[8] if len(parts) > 8 else '.'
                last_column = parts[9] if len(parts) > 9 else '.'  # Keep original column
                var_id = f"{pos}:{ref}>{alt}"
                if flt != 'PASS':
                    log_lines.append(f"{sample}\t{var_id}\tnot_PASS")
                    continue
                variants.append({
                    'POS': int(pos),
                    'REF': ref,
                    'ALT': alt,
                    'QUAL': qual,
                    'FILTER': flt,
                    'INFO': info,
                    'FORMAT': fmt,
                    sample: last_column,  # Use only genotype for individual sheets
                    f"{sample}_with_alt": alt  # Use only ALT sequence for summary
                })
        return pd.DataFrame(variants)
    
    # Read all VCFs
    sample_names = []
    sample_with_alt_columns = []
    for f in vcf_files:
        sample = os.path.basename(f).replace('_ngmlr+sniffles_filtered.vcf', '')
        sample_names.append(sample)
        sample_with_alt_columns.append(f"{sample}_with_alt")
        df = read_vcf(f, log_lines, sample)
        df_list.append(df)
    
    # Merge all variants into one DataFrame
    all_variants = pd.concat(df_list, ignore_index=True)
    
    # Fill missing sample columns with hyphen for summary
    for sample in sample_names:
        if sample not in all_variants.columns:
            all_variants[sample] = ''
        if f"{sample}_with_alt" not in all_variants.columns:
            all_variants[f"{sample}_with_alt"] = '-'
    
    # Pivot table for summary using exact POS, using columns with ALT
    summary_df = all_variants.groupby(['POS', 'REF'])[sample_with_alt_columns].first().reset_index()
    
    # Rename columns in summary to remove '_with_alt' suffix
    summary_df.columns = ['POS', 'REF'] + sample_names
    
    # Replace NaN or empty strings with hyphen in sample columns
    summary_df[sample_names] = summary_df[sample_names].fillna('-').replace('', '-')
    
    # Count how many samples have a non-hyphen value
    summary_df['Count'] = summary_df[sample_names].apply(lambda row: sum(val != '-' for val in row), axis=1)
    
    # Save Excel with individual sheets and summary
    writer = pd.ExcelWriter('merged_ngmlr+sniffles_variants.xlsx', engine='openpyxl')
    
    # Save individual sample sheets
    for df in df_list:
        sheet_name = df.columns[-2]  # Use sample name (not the _with_alt column)
        # Select only relevant columns for individual sheets (exclude _with_alt)
        df = df[[col for col in df.columns if not col.endswith('_with_alt')]]
        df.to_excel(writer, sheet_name=sheet_name, index=False)
    
    # Save summary sheet
    summary_df.to_excel(writer, sheet_name='Summary', index=False)
    writer.close()
    
    # Write log
    with open('ngmlr_filtering_log.txt', 'w') as logf:
        logf.write('Sample\tVariant\tReason\n')
        for line in log_lines:
            logf.write(line + '\n')
    
    print('Done: merged_ngmlr+sniffles_variants.xlsx with ALT sequences or hyphens in summary and genotype-only in individual sheets.')
  8. ———————– END of the transposon calculation. The following steps (e.g. assembly based on the long-reads is not necessary for the transposon analysis) —————————-

  9. (NOT_USED) Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)

      python vcf_to_fasta.py variants.vcf variants.fasta
      #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
      #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
      /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
      python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
      Exact Matches
      =============
      2 bacteria (blast name), Bacteria 
    (scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV​)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733​)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields. * In your VCF file generated by Sniffles, the REF=N in the results has a specific meaning: * In a standard VCF, the REF field usually contains the reference base(s) at the variant position. * For structural variants (SVs), especially insertions, there is no reference sequence replaced; the insertion occurs between reference bases. * Therefore, Sniffles uses N as a placeholder in the REF field to indicate “no reference base replaced”. * The actual inserted sequence is then stored in the ALT field.
  10. Why some records have UNRESOLVED in the FILTER field in the Excel output.

    1. Understanding the format
    
        The data appears to be structural variant (SV) calls from Sniffles, probably in a VCF-like tabular format exported to Excel:
    
            * gi|1176884116|gb|CP020463.1| → reference sequence
            * Positions: 1855752 and 2422820
            * N → insertion event
            * SVLEN=999 → size of the insertion
            * AF → allele frequency
            * GT:DR:DV → genotype, depth reference, depth variant (1/1:0:678, example values for a PASS variant)
            * FILTER → whether the variant passed filters (UNRESOLVED means it didn’t pass)
    
    2. What UNRESOLVED usually means
    
        In Sniffles:
    
        * UNRESOLVED is assigned to SVs when the tool cannot confidently resolve the exact sequence or breakpoint.
        * Reasons include:
            - Low read support (RE, DV) relative to the expected coverage
            - Ambiguous alignment at repetitive regions
            - Conflicting strand or orientation signals
            - Allele frequency inconsistent with expectations
    
    3. Examine your two records
    
        First record
    
            POS: 1855752
            SVTYPE: INS
            SVLEN: 999
            RE: 68
            AF: 1
            GT: 1/1
            FILTER: UNRESOLVED
    
        Observations:
    
        * AF = 1 → allele frequency 100%, homozygous insertion
        * RE = 68 → 68 reads support the variant, decent coverage
        * Still UNRESOLVED → likely because Sniffles could not resolve the inserted sequence precisely; sometimes long insertions in repetitive regions are hard to reconstruct fully even with good read support.
    
        Second record
    
            POS: 2422820
            SVTYPE: INS
            SVLEN: 999
            RE: 22
            AF: 0.025522
            GT: 0/0
            FILTER: UNRESOLVED
    
        Observations:
    
        * AF = 0.0255 → very low allele frequency (~2.5%)
        * RE = 22, DR = 840 → very low variant reads vs reference
        * GT = 0/0 → homozygous reference
        * Sniffles marks it UNRESOLVED because the variant is essentially noise, not confidently detected.
    
    4. Key difference between the two
        Feature First record    Second record
        Allele frequency (AF)   1 (high)    0.0255 (very low)
        Variant reads (RE)  68  22
        Genotype (GT)   1/1 0/0
        Reason for UNRESOLVED   Unresolvable inserted sequence
    
    ✅ 5. Conclusion
    
        * Sniffles marks a variant as UNRESOLVED when the SV cannot be confidently characterized.
        * Even if there is good read support (first record), complex insertions can’t always be reconstructed fully.
        * Very low allele frequency (second record) also triggers UNRESOLVED because the signal is too weak compared to background noise.
        * Essentially: “UNRESOLVED” ≠ bad data, it’s just unresolved uncertainty.
  11. (NOT_SURE_HOW_TO_USE) Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

      mamba install -c bioconda medaka
      medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
  12. Compare Insertions Across Samples

    Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
    SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
        Input: List of VCF files from Sniffles2.
        Output: A consolidated VCF file with shared and unique variants.
    
    Filter WT Insertions:
    
        Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
            bcftools isec WT.vcf merged.vcf -p comparison_results
  13. Validate and Visualize

    Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
    igv.sh
    
    Validate Findings:
        Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  14. Alternatives to TEPID for Long-Read Data

    If you’re looking for transposon-specific tools for long reads:
    
        REPET: A robust transposon annotation tool compatible with assembled genomes.
        EDTA (Extensive de novo TE Annotator):
            A pipeline to identify, classify, and annotate transposons.
            Works directly on your assembled genomes.
    
            perl EDTA.pl --genome WT.fasta --type all
  15. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    Steps to Generate WT.vcf
    1. Align WT Reads to the WT Reference Genome
    
    The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
    Command using Minimap2:
    
    minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
    Index the BAM file:
    
    samtools index WT.sorted.bam
    
    2. Detect Structural Variants with Sniffles2
    
    Run Sniffles2 on the WT alignment to call structural variants:
    
    sniffles --input WT.sorted.bam --vcf WT.vcf
    
    This step identifies:
    
        Native transposons and insertions present in the WT genome.
        Other structural variants that are part of the reference genome or sequencing artifacts.
    
    Key parameters to consider:
    
        --min_support: Adjust based on your WT sequencing coverage.
        --max_distance: Define proximity for merging variants.
        --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  16. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

    To ensure the WT.vcf only includes relevant transposons or SVs:
    
        Use bcftools or similar tools to filter out low-confidence variants:
    
        bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
        bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
  17. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

    Why Use a Reference Genome (WT.fasta) from a Database?
    
        Higher Quality and Completeness:
            Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
        Consistency:
            Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
        Saves Time:
            Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
    Alternative: Assembling WT from FASTQ
    
    If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
        Assemble the genome from your WT.fastq:
            Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
        flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size 
    Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
  18. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

    # -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ !
    # Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches.
    #https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1
    
    # -- (failed!) try TEPID for annotation
    mamba install tepid=0.10 -c bioconda
    #(tepid_env)
    for sample in WT 1 2 3 4 5 7 8 9 10; do
        tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q  ../batch1_depth25/trycycler_${sample}/reads.fastq;
        tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se;
    done
    
    tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se;
    
    tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS]
                [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL
    
    # -- (failed!) try EDTA for annotation
    https://github.com/oushujun/EDTA
    (transposon_long) mamba install -c conda-forge -c bioconda edta
    mamba install bioconda::rmblast  # cd RepeatMasker; ./configure
    EDTA.pl --genome CP020463.fasta --species others --threads 40
    
    For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options.
    For de novo repeat identification: RepeatScout is highly effective.
    For LTR retrotransposons: Use LTR_retriever.
    For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.
  19. Validation: Cross-validate with short-read sequencing data if available.

  20. (Optional) Assembly the nanopore-sequencing using

    1. merge and prepare fastq.gz
    
    2. calcuclate the precentage of coding DNA: https://www.ncbi.nlm.nih.gov/nuccore/CP020463 with additional plasmid.
    
        * Average gene length: 870.4 bp
        * Total coding region length: 2056168 bp
        * Percentage of coding DNA = (Total Coding Region Length / Total Genome Length) × 100 = 2056168 bp / 2454929 bp = 83.8%
    
    3. Prepare the fastq.gz
    
        conda activate /home/jhuang/miniconda3/envs/trycycler  # under jhuang@hamm (10.169.63.113).
    
        rsync -a -P *.fastq.gz jhuang@10.169.63.113:/home/jhuang/DATA/Data_Patricia_Transposon_2025/
        for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
            mkdir trycycler_${sample};
            mv ${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
            gunzip trycycler_${sample}/reads.fastq.gz;
        done
        #mkdir batch3;
        #cd batch3;
        #for sample in WT 1 2 3 4 5 7 8 9 10; do
        #    mkdir trycycler_${sample};
        #    cp ${sample}_nanopore/${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
        #    gunzip trycycler_${sample}/reads.fastq.gz;
        #done
    
    4. Running separate assemblies (6x4 times: canu, flye, miniasm_and_minipolish, necat, nextdenovo, raven) using trycycler_assembly_extra-thorough.sh (under the trycycler environment running the following steps)
            # batch1: min_read_cov=25; batch2: min_read_cov=50; batch3: min_read_cov=100 (if necessary on the monday!)
            cp ../Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_*.sh ./
            #for sample in trycycler_HDRNA_10 trycycler_HDRNA_13; do
            for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
                cd trycycler_${sample};
                ../trycycler_assembly_extra-thorough.sh
                cd ..;
            done
    
    # END!
    #TODO: upload the nanopore-sequencing data to NCBI for the HDRNA_10 and HDRNA_13 to the correspoinding project.
    
    (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_02.fasta
    assembly_02.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:32:46] If you like MLST, you're absolutely going to love wgMLST!
    [15:32:46] Done.
    (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_20.fasta
    
    assembly_20.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:33:01] You can use --label XXX to replace an ugly filename in the output.
    [15:33:01] Done.
    
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_02.fasta
    assembly_02.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_08.fasta
    
    assembly_08.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:38:22] Remember that --minscore is only used when using automatic scheme detection.
    [15:38:22] Done.
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_14.fasta
    assembly_14.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:38:50] Thanks for using mlst, I hope you found it useful.
    [15:38:50] Done.
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_20.fasta
    assembly_20.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:38:54] If you like MLST, you're going to absolutely love cgMLST!
    [15:38:54] Done.
    
            #- under the directory batch3
            #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            #    cd ${sample};
            #    ../trycycler_assembly_extra-thorough_threads5_cov100.sh;
            #    cd ..;
            #done
    
            #if ERROR, running separate assembly-methods raven and canu
            #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            #    cd ${sample};
            #    ../trycycler_assembly_extra-thorough_raven.sh;
            #    cd ..;
            #done
            #
            #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            #    cd ${sample};
            #    ../trycycler_assembly_extra-thorough_canu.sh;
            #    cd ..;
            #done
    
    5. trycycler cluster
    
            for sample in trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4; do
                cd ${sample};
                rm -rf trycycler
                trycycler cluster --threads 10 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler;
                cd ..;
            done
    
    6. trycycler reconcile
    
            cd trycycler_WT
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/C_utg000001c.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
    
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
                # -- under batch3 --
                176K Nov  5 12:38 P_bctg00000001.fasta
                122K Nov  5 12:38 D_bctg00000001.fasta
                95K Nov  5 12:38 J_bctg00000001.fasta
                78K Nov  5 12:38 J_bctg00000002.fasta
                #--> Error: unable to find a suitable common sequence
    
                69K Nov  5 17:24 G_tig00000004.fasta
                66K Nov  5 17:24 S_tig00000004.fasta
                66K Nov  5 17:24 M_tig00000004.fasta
                65K Nov  5 17:24 A_tig00000005.fasta
                #--> Error: unable to find a suitable common sequence
                #If repeat M_tig00000004.fasta twice, resulting in the following error: Circularising M_tig00000004:
                #  using S_tig00000004:
                #    unable to circularise: M_tig00000004's start and end were found in multiple places in S_tig00000004
                #Error: failed to circularise sequence M_tig00000004 because its start/end sequences were found in multiple ambiguous places in other sequences. #This is likely because M_tig00000004 starts/ends in a repetitive region. You can either manually repair its circularisation (and ensure it does #not start/end in a repetitive region) or exclude the sequence altogether and try again.
    
                47K Nov  5 17:24 V_bctg00000001.fasta
                45K Nov  5 17:24 C_utg000002c.fasta
                #Error: unable to find a suitable common sequence
    
                34K Nov  5 17:24 U_utg000004c.fasta
                34K Nov  5 17:24 N_contig_2.fasta
                34K Nov  5 17:24 T_contig_2.fasta
                #Error: unable to find a suitable common sequence
    
                23K Nov  5 17:24 I_utg000003c.fasta
                22K Nov  5 17:24 B_contig_2.fasta
                22K Nov  5 17:24 H_contig_2.fasta
                #Error: unable to find a suitable common sequence
    
                13K Nov  5 17:24 G_tig00000006.fasta
                12K Nov  5 17:24 M_tig00000003.fasta
                12K Nov  5 17:24 O_utg000002c.fasta
                12K Nov  5 17:24 S_tig00000003.fasta
                11K Nov  5 17:24 A_tig00000004.fasta
                #trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 --max_length_diff 1.2 --max_trim_seq_percent 20.0
                #--> SUCCESSFULLY saving sequences to file: trycycler/cluster_002/2_all_seqs.fasta
    
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 --max_length_diff 1.3 --max_trim_seq_percent 30.0
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 #Error: two or more input contigs are required
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006 #Error: two or more input contigs are required
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_007 #Error: two or more input contigs are required
    
            cd ../trycycler_1
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/I_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/W_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/A_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/U_utg000001c.fasta trycycler/cluster_001/1_contigs_removed
    
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            #TODO_TOMORROW_HERE!
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 --max_length_diff 1.7 --max_trim_seq_percent 70.0
            #Error: failed to circularise sequence I_utg000002l because its end could not be found in other sequences.
            (NOTE) using the plasmid from other isolate to help the cut of the sequence.
                #cp ~/DATA/Data_Patricia_Transposon/batch3/trycycler_WT/trycycler/cluster_004/7_final_consensus.fasta .
                #cat *.fasta > all.fasta
                #mafft --clustalout --adjustdirection all.fasta > all.aln
            cluster_004_con -------------------------------------------------------tatga
            I_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
            O_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
                                                                                *****
            cluster_004_con gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaa---
            I_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
            O_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
                            ********************************************************
    
            grep "I_utg000002l" all.a_ > I_utg000002l.fasta
            grep "O_utg000002l" all.a_ > O_utg000002l.fasta
            cut -d' ' -f5 I_utg000002l.fasta > I_utg000002l_.fasta
            cut -d' ' -f5 O_utg000002l.fasta > O_utg000002l_.fasta
            sed 's/-//g' I_utg000002l_.fasta > I_utg000002l__.fasta
            sed 's/-//g' O_utg000002l_.fasta > O_utg000002l__.fasta
            seqtk seq I_utg000002l__.fasta -l 80 > I_utg000002l___.fasta
            seqtk seq O_utg000002l__.fasta -l 80 > O_utg000002l___.fasta
    
            #Cut two files Manually with the sequence: AGAATCAGAATTAGGCGCATAATTTACAGGAGGTTTGATTATGGCTAAGAAAAA
            cat I_utg000002l___.fasta O_utg000002l___.fasta > 2_all_seqs.fasta
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 #Error: two or more input contigs are required
    
            cd ../trycycler_2
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/O_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_3
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_4
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/F_Utg670.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/V_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_5
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_7
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/D_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
            cd ../trycycler_8
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/M_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
            cd ../trycycler_9
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006
    
            cd ../trycycler_10
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
            #Nachmachen: for sample in trycycler_9 trycycler_10; do
            cd ${sample};
            ../trycycler_assembly_extra-thorough_canu.sh;
            cd ..;
            done
    
    7. (optional) map the circular assemblies to plasmid databases
            #bzip2 -d plsdb.fna.bz2
            #makeblastdb -in plsdb.fna -dbtype nucl
            #blastn -db plsdb.fna -query all.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > all_on_plsdb.blastn
    
    8. trycycler msa
    
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_001
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_002
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_003
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_004
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_005
            #--> When finished, Trycycler reconcile will make a 3_msa.fasta file in the cluster directory
    
    9. trycycler partition
    
            #generate 4_reads.fastq for each contig!
            trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_*
            #trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001 trycycler/cluster_002 trycycler/cluster_003
            trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_003
    
    10. trycycler consensus
    
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_002
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_003
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_005
            #!!NOTE that we take the isolates of HD05_2_K5 and HD05_2_K6 assembled by Unicycler instead of Trycycler!!
            # TODO (TODAY), generate the 3 datasets below!
            # TODO (IMPORTANT): write a Email to Holger, say the short sequencing of HD5_2 is not correct, since the 3 datasets! However, the MTxxxxxxx is confirmed not in K5 and K6!
            TODO: variant calling needs the short-sequencing, they are not dorable without the correct short-reads! resequencing? It is difficult to call variants only from long-reads since too much errors in long-reads!
            #TODO: check the MT sequence if in the isolates, more deteiled annotations come late!
            #II. Comparing the results of Trycycler with Unicycler.
            #III. Eventually add the plasmids assembled from unicycler to the final results. E.g. add the 4 plasmids to K5 and K6
    
    11. Polishing after Trycycler
    
            #1. Oxford Nanopore sequencer (Ignored due to the samtools version incompatibility!)
            # for c in trycycler/cluster_*; do
            #     medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12
            #     mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
            #     rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi  # clean up
            # done
            # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
            #2. Short-read polishing
            #---- 5179_R1 (2) ----
            #  mean read depth: 205.8x
            #  188 bp have a depth of zero (99.9924% coverage)
            #  355 positions changed (0.0144% of total positions)
            #  estimated pre-polishing sequence accuracy: 99.9856% (Q38.42)
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-5179-r1_R1.fastq.gz --in2 ../../s-epidermidis-5179-r1_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 37
            #Insertion/Deletion Errors: 2
            #Assembly Size: 2470001
            #Consensus Quality: 99.9984
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 17748
            #Consensus Quality: 99.9775
            #Step 4: (optional) more rounds and/or other polishers
            #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
            #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            Substitution Errors: 13
            Insertion/Deletion Errors: 0
            Assembly Size: 2470004
            Consensus Quality: 99.9995
            Substitution Errors: 0
            Insertion/Deletion Errors: 0
            Assembly Size: 17748
            Consensus Quality: 100
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2470004
            #Consensus Quality: 100
            #---- 1585 (4) ----
            #  mean read depth: 174.7x
            #  8,297 bp have a depth of zero (99.6604% coverage)
            #  271 positions changed (0.0111% of total positions)
            #  estimated pre-polishing sequence accuracy: 99.9889% (Q39.55)
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 7
            #Insertion/Deletion Errors: 4
            #Assembly Size: 2443174
            #Consensus Quality: 99.9995
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9014
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9014
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2344
            #Consensus Quality: 100
            #Step 4: (optional) more rounds and/or other polishers
            #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
            #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2443176
            #Consensus Quality: 100
            #---- 1585 derived from unicycler, under 1585_normal/unicycler (4) ----
            #Step 0: copy chrom and plasmid1, plasmid2, plasmid3 to cluster_001/7_final_consensus.fasta, ...
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Polishing 1 (2,443,574 bp):
            #mean read depth: 174.7x
            #8,298 bp have a depth of zero (99.6604% coverage)
            #52 positions changed (0.0021% of total positions)
            #estimated pre-polishing sequence accuracy: 99.9979% (Q46.72)
            #Polishing 2 (9,014 bp):
            #mean read depth: 766.5x
            #3 bp have a depth of zero (99.9667% coverage)
            #0 positions changed (0.0000% of total positions)
            #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
            #Polishing 7 (2,344 bp):
            #mean read depth: 2893.0x
            #4 bp have a depth of zero (99.8294% coverage)
            #0 positions changed (0.0000% of total positions)
            #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
            #Polishing 8 (2,255 bp):
            #mean read depth: 2719.6x
            #4 bp have a depth of zero (99.8226% coverage)
            #0 positions changed (0.0000% of total positions)
            #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 7
            #Insertion/Deletion Errors: 4
            #Assembly Size: 2443598
            #Consensus Quality: 99.9995
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9014
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2344
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2255
            #Consensus Quality: 100
            #Step 4: (optional) more rounds and/or other polishers
            #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
            #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2443600
            #Consensus Quality: 100
            #-- 1585v (1, no short reads, waiting) --
            # TODO!
            #-- 5179 (2) --
            #mean read depth: 120.7x
            #7,547 bp have a depth of zero (99.6946% coverage)
            #356 positions changed (0.0144% of total positions)
            #estimated pre-polishing sequence accuracy: 99.9856% (Q38.41)
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-5179_R1.fastq.gz --in2 ../../s-epidermidis-5179_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 49
            #Insertion/Deletion Errors: 23
            #Assembly Size: 2471418
            #Consensus Quality: 99.9971
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 17748
            #Consensus Quality: 100
            #Step 4: (optional) more rounds POLCA
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 10
            #Insertion/Deletion Errors: 5
            #Assembly Size: 2471442
            #Consensus Quality: 99.9994
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            Substitution Errors: 6
            Insertion/Deletion Errors: 0
            Assembly Size: 2471445
            Consensus Quality: 99.9998
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            Substitution Errors: 0
            Insertion/Deletion Errors: 0
            Assembly Size: 2471445
            Consensus Quality: 100
            #-- HD5_2 (2): without the short-sequencing we cannot correct the base-calling! --
            # !ERROR to be REPORTED, the
            #Polishing cluster_001_consensus (2,504,140 bp):
            #mean read depth: 94.4x
            #240,420 bp have a depth of zero (90.3991% coverage)
            #56,894 positions changed (2.2720% of total positions)
            #estimated pre-polishing sequence accuracy: 97.7280% (Q16.44)
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R2_001.fastq
            #Step 1: read QC
            fastp --in1 ../../HD5_2_S38_R1_001.fastq.gz --in2 ../../HD5_2_S38_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            # NOTE that the following steps are not run since the short-reads are not correct!
            # #Step 2: Polypolish
            # for cluster in cluster_001 cluster_005; do
            #   bwa index ${cluster}/7_final_consensus.fasta
            #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            #   polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            # done
            # #Step 3: POLCA
            # for cluster in cluster_001 cluster_005; do
            #   cd ${cluster}
            #   polca.sh -a polypolish.fasta -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
            #   cd ..
            # done
            # #Step 4: (optional) more rounds POLCA
            # for cluster in cluster_001; do
            #   cd ${cluster}
            #   polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
            #   cd ..
            # done
            # NOTE that the plasmids of HD5_2_K5 and HD5_2_K6 were copied from Unicycler!
            #-- HD5_2_K5 (4) --
            mean read depth: 87.1x
            25 bp have a depth of zero (99.9990% coverage)
            1,085 positions changed (0.0433% of total positions)
            estimated pre-polishing sequence accuracy: 99.9567% (Q33.63)
            #Step 1: read QC
            fastp --in1 ../../275_K5_Holger_S92_R1_001.fastq.gz --in2 ../../275_K5_Holger_S92_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 146
            #Insertion/Deletion Errors: 2
            #Assembly Size: 2504401
            #Consensus Quality: 99.9941
            #Substitution Errors: 41
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9007
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9191
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2767
            #Consensus Quality: 100
            #Step 4: (optional) more rounds POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 41
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504401
            #Consensus Quality: 99.9984
            #Substitution Errors: 8
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9806
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 8
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504401
            #Consensus Quality: 99.9997
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9903
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 8
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504401
            #Consensus Quality: 99.9997
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9903
            #-- HD5_2_K6 (4) --
            #mean read depth: 116.7x
            #4 bp have a depth of zero (99.9998% coverage)
            #1,022 positions changed (0.0408% of total positions)
            #estimated pre-polishing sequence accuracy: 99.9592% (Q33.89)
            #Step 1: read QC
            fastp --in1 ../../276_K6_Holger_S95_R1_001.fastq.gz --in2 ../../276_K6_Holger_S95_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 164
            #Insertion/Deletion Errors: 2
            #Assembly Size: 2504398
            #Consensus Quality: 99.9934
            #Substitution Errors: 22
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9467
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9191
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2767
            #Consensus Quality: 100
            #Step 4: (optional) more rounds POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 32
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504400
            #Consensus Quality: 99.9987
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 100
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504400
            #Consensus Quality: 99.9998
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 2
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504400
            #Consensus Quality: 99.9999
  21. (Optional) use platon to confirm the plasmid contigs: https://github.com/oschwengers/platon