Author Archives: gene_x

Fitting Models for Boxplot Data

在箱线图(Boxplot)中,通常不涉及直接拟合曲线,因为箱线图的主要目的是展示数据分布的统计特征,而不是反映具体的函数关系。然而,如果你需要在箱线图中添加趋势线或拟合曲线,这通常是为了提供更多的背景信息或分析数据的变化趋势。

箱线图主要用于:

  • 展示数据分布的概况:包括中位数、四分位数范围(IQR)以及异常值。
  • 比较多个数据组的分布:通过箱线图的高度和位置比较不同数据组的差异。
  • 识别异常值:通过“须线”之外的数据点定位异常值。

尽管箱线图本身不直接涉及拟合,但在以下情景下,可以结合拟合曲线:

  • 数据趋势分析:

    • 如果你的数据是按时间、空间或其他连续变量分组的,你可以在箱线图上添加趋势线(如线性回归曲线)以显示数据随分组变量的变化趋势。
    • 例如,用箱线图展示某一变量随时间的变化,同时用曲线拟合整体趋势。
  • 概率分布或密度曲线:

    • 你可以将箱线图和核密度估计(KDE)曲线结合,显示数据分布的密度。
  • 数学模型拟合:

    • 如果你正在研究某种函数关系,可以根据每组数据的统计特征(如中位数)拟合一条曲线。

在数据可视化工具(如Python的Matplotlib或Seaborn库)中,可以通过以下步骤实现:

  • 绘制箱线图:展示每组数据的分布。

  • 计算趋势线或拟合曲线:根据数据组的统计特征(如中位数或平均值),计算拟合曲线的参数。

        Choosing a Fitting Model
    
        Based on the complexity of the data relationships, select an appropriate fitting model:
    
        Linear Model: Assumes a linear relationship between the data feature values.
        Polynomial Model: If the trend is nonlinear, a quadratic or higher-order polynomial is suitable for fitting.
        Nonlinear Model: For example, exponential, logarithmic, or other complex models.
    
        Linear Fitting Formula:        y=mx+b
    
        Where:
        y is the feature value (such as the median or mean).
        x is the group identifier (e.g., A=1, B=2, C=3).
        m is the slope, and bb is the intercept.
    
        Polynomial Fitting Formula (example for quadratic):        y=ax2+bx+c
    
        Where: a,b,c are the fitting parameters.
  • 叠加曲线:将拟合曲线叠加到箱线图上。

        import numpy as np
        import matplotlib.pyplot as plt
        import seaborn as sns
        from scipy.stats import linregress
    
        # 示例数据:三个组的数据
        data = {
        'Group A': [12, 15, 14, 19, 22, 17, 15, 24, 13, 18],
        'Group B': [22, 17, 15, 24, 23, 20, 18, 21, 25, 19],
        'Group C': [13, 18, 20, 16, 22, 21, 20, 19, 18, 20]
        }
    
        # 将数据转换为适合绘制箱线图的格式
        import pandas as pd
        df = pd.DataFrame(data)
    
        # 绘制箱线图
        plt.figure(figsize=(8, 6))
        sns.boxplot(data=df)
    
        # 计算每组数据的中位数或平均值
        groups = np.array([1, 2, 3])  # 对应 'Group A', 'Group B', 'Group C'
        medians = df.median().values  # 使用中位数
    
        # 线性拟合
        slope, intercept, r_value, p_value, std_err = linregress(groups, medians)
    
        # 拟合曲线
        fitted_values = slope * groups + intercept
    
        # 叠加拟合曲线
        plt.plot(groups, fitted_values, label='线性拟合趋势线', color='red', linewidth=2)
    
        ##箱线图展示了每个组的数据分布,包括中位数、四分位数、异常值等。
        ##红色(或绿色)线条显示了拟合曲线,表示中位数随组别变化的趋势。
        ## 多项式拟合(例如二次拟合)
        #coefficients = np.polyfit(groups, medians, 2)  # 二次拟合
        #fitted_curve = np.polyval(coefficients, groups)
        #
        ## 叠加拟合曲线
        #plt.plot(groups, fitted_curve, label='二次拟合曲线', color='green', linewidth=2)
    
        # 设置图形标题和标签
        plt.title('箱线图与线性拟合曲线')
        plt.xlabel('组别')
        plt.ylabel('值')
        plt.xticks([0, 1, 2], ['Group A', 'Group B', 'Group C'])
    
        # 显示图例
        plt.legend()
    
        # 显示图形
        plt.show()
    
        #数据输入:使用一个字典 data 来表示每个组的数据。
        #绘制箱线图:seaborn.boxplot() 用于绘制箱线图。
        #计算中位数:通过 df.median().values 提取每组的中位数,作为拟合曲线的参考数据点。
        #线性拟合:使用 scipy.stats.linregress 计算线性拟合的斜率和截距。
        #叠加拟合曲线:将拟合曲线通过 plt.plot() 叠加到箱线图上,拟合曲线使用红色线条表示。
        #设置标题、标签和图例:增强图形的可读性。

Investigation of the BKPyV life cycle and antiviral mechanisms of BKPyV-specific inhibitors in relevant in vitro models

https://grk2771.de/projects/project-p7/

Human Polyomaviruses (PyV) are highly prevalent and establish a lifelong asymptomatic persistence in the healthy immunocompetent host1,2. However, under immunosuppression, these viruses can reactivate, causing life-threatening infections (e.g. BKV caused PyV associated nephropathy, PVAN) due to uncontrolled viral replication1. Currently, no specific antiviral treatment is available This lack of effective therapeutics is partly due to the lack of small animal models and the availability of only poor surrogate in vitro/in vivo systems. We have recently identified 16 small molecule inhibitors, C1-16, against BKV using a phenotypic high throughput screen (Kraus et al., unpublished). For the further development of these inhibitors, it is essential to have an understanding of their cellular target structure and/or which part of the viral life cycle they inhibit.

Within this project we will gain a better understanding of the BKV the life cycle in relevant infection systems (e.g. primary cells and organoids). We will use these previously identified antiviral compounds in terms of their interference with essential host structures for viral reproduction (transport vesicles, nuclear uptake, replication compartments or vesicle dependent egress). Furthermore, we will characterize specific viral inhibitors at the molecular and structural level.

The project uses organoids and primary cells as infection models and BKV inhibitor characterization. It applies confocal live cell microscopy to follow BKV entry/ spread. Furthermore, the project takes advantage of X-ray crystallography to characterize inhibitor/target interaction.

References

Chong S, Antoni M, Macdonald A, Reeves M, Harber M, Magee CN (2019) BK virus: Current understanding of pathogenicity and clinical disease in transplantation. Rev Med Virol 29:e2044. Abstract

Theiss JM, Günther T, Alawi M, Neumann F, Tessmer U, Fischer N, Grundhoff A (2015) A Comprehensive Analysis of Replicating Merkel Cell Polyomavirus Genomes Delineates the Viral Transcription Program and Suggests a Role for mcv-miR-M1 in Episomal Persistence. PLoS Pathog 11:e1004974. Abstract

重新审视诊断:微生物细胞游离DNA测序:解决与植入物相关的心血管感染中的未解决挑战

  • 心血管植入物相关感染(CVIAI)是一个重大挑战,对于心脏植入电子设备(CIEDs)的死亡率约为15%,而对于假体瓣膜性心内膜炎(PVE)和血管移植物或内膜移植物感染(VGEI)的死亡率则为15%到80%。

  • 快速识别病原体对于早期启动有效的CVIAI治疗至关重要,而血液培养目前仍是诊断的金标准。

  • 然而,依赖生长的培养方法可能由于感染由挑剔性微生物引起或患者曾接受抗生素治疗而导致灵敏度下降。

  • 正因如此,通过血液培养进行病因明确的CVIAI病例的比例通常较低,约为50%左右。

  • 结果,确诊病因通常依赖于来自组织样本或在长时间抗生素治疗后植入物取出的低产量培养。

  • 由于只有少数CVIAI病例会进行外科修复或设备移除,因此大量病例未能识别出致病微生物。

  • 分子技术,如广谱PCR或物种特异性PCR,可以克服培养方法的不足。

  • 事实上,广谱PCR检测已成功应用于提取的心脏组织,其检测率显著高于传统培养。

  • 因此,广谱PCR于2023年被纳入了杜克标准用于感染性心内膜炎的诊断。

  • 然而,直接从血液中检测病原体的方法在败血症中大多未能成功,可能是因为病原负荷低于大多数现有PCR试剂的分析灵敏度。

  • 因此,亟需创新的、适用于外周血并提高灵敏度的、无需培养的检测方法。

  • 近年来,采用无偏序列测定(即临床宏基因组学)进行假设自由的病原检测的新策略应运而生。

  • Microbial cell free DNA-sequencing(mcfDNA-Seq),其靶向微生物脱落到血液中的小片段DNA,已在覆盖面和分析灵敏度方面显示出特别有前景的结果。

  • 此方法使得可以直接从败血症患者的血样中进行检测。

  • mcfDNA-Seq比传统培养方法更快,因为培养可能需要几天或几周的时间才能检测到生长缓慢的微生物。

  • 虽然广谱PCR也加快了诊断速度,但mcfDNA-Seq无需假设,允许通过单一检测广泛检测病原体。

  • 已有多个商业平台提供mcfDNA-Seq检测,如在美国广泛使用的Karius测试和在欧洲使用的Noscendo的DISQVER。

  • 最近一项关于败血症患者的研究报告了mcfDNA-Seq与传统血液培养之间93.7%的一致性,并且分子检测的检出率显著提高(169个病因确认病例对比132个)。

  • 这使得mcfDNA-Seq在CVIAI诊断中尤为有趣,特别是因为该方法不依赖于难以获取的活检或设备移除。

  • 然而,mcfDNA-Seq在CVIAI中的应用经验仍然有限。

  • 在一项小规模的概念验证研究中,mcfDNA-Seq在7名VGEI感染患者中检测到了3例病原。

  • 另一项针对感染性心内膜炎(IE)患者的研究中,在34名IE患者中(其中22名为PVE,65%),mcfDNA-Seq在24例中(71%)为阳性,包括6名血液培养阴性的心内膜炎患者中的3例(50%)。

  • 类似地,在23例确诊的IE病例中,包括12例PVE或CIEDs相关的心内膜炎,mcfDNA-Seq和血液培养的灵敏度都为87%。

  • 尽管从抗生素治疗到mcfDNA采样的平均时间明显长于从抗生素治疗到血液培养采样的平均时间(11.7天对比0.09天,p值<0.01)。

  • mcfDNA-Seq对抗生素治疗的持续检测结果尤为引人关注,mcfDNA-Seq在治疗后中位数为38.1天仍然可检测,而血液培养的可检测时间仅为3.7天(比例比值,2.952;p值=0.028)。

  • 近期的摘要评估了mcfDNA-Seq在由金黄色葡萄球菌或表皮葡萄球菌引起的败血症中预测CIED参与的潜力,通过量化血浆中的mcfDNA-Seq读取数。

  • 作者建议,将血浆mcfDNA-Seq与相关临床参数结合,可以为识别无需设备移除的患者提供有价值的见解。

  • 同样,Eichenberger等(2015)表明,与局部疾病患者相比,患有败血症性和转移性感染的患者mcfDNA-Seq检测阳性的时间较长(22天对比8天,p=0.0054),而接受外科清创或设备移除的患者mcfDNA降解速度更快。

  • 因此,mcfDNA-Seq可能能够检测到血管移植物的持续感染。

  • 据我们所知,mcfDNA-Seq尚未用于监测无法手术治疗的CVIAI患者接受抑菌治疗时的情况。

  • 然而,基于上述结果,进一步的研究似乎是必要的,因为该技术可能有助于避免不必要的长期使用抗生素。

  • mcfDNA-Seq作为一种诊断工具有一些重要的局限性。

  • 一个主要挑战是将测序结果与临床发现相关联的困难,因为微生物DNA的检测并不总是表示活跃的感染。

  • 特别是在低读取数情况下,常见污染物(如凝固酶阴性葡萄球菌或皮肤分枝杆菌)可能会显著影响cfDNA测序的分析灵敏度和特异性,这需要根据具体物种、平台和可能的应用设置严格的解释阈值。

  • 还存在检测来自无关源(例如,由肠道细菌转移或刷牙引起的短暂性菌血症)的DNA的风险,这可能导致误诊和不必要或过长时间的抗生素治疗。

  • 此外,和传统的培养法一样,mcfDNA-Seq也容易受到污染,尤其是在样本采集和后续处理过程中。

  • 微生物学家、信息学家和传染病专家的跨学科团队,以及来自患者和对照队列的cfDNA数据库,是解释患者结果的关键。

  • 这种方法可以设置测序阈值,确保更准确的结果。

  • 此外,尽管该方法能够检测病原体,但它并不一定能提供广泛的基因组覆盖,限制了其预测抗微生物药物敏感性或进行菌株分型的能力。

  • 将mcfDNA-Seq纳入常规诊断,即使在大型大学医院中,也面临着显著的挑战,因为其工作流程复杂,需要专门的专业知识,而且缺乏明确的监管框架。

  • 这种复杂性,以及缺乏标准化协议,可能会延缓结果的交付,尤其是在CVIAI等复杂病例中,快速获得结果至关重要。

  • 确保cfDNA测序的实验室质量控制和认证非常复杂,特别是在平衡上述要求和对快速结果的需求时。

  • 此外,积累经验和数据,通过精心设计的前瞻性队列研究,是推动该新兴领域专业化的关键。

  • 目前,大多数研究是以病例系列研究的形式进行的,这种研究本身存在偏倚。

  • 此外,许多研究由外部商业供应商协助进行,这些供应商通常无法全面获取患者数据和医学专业知识。

  • 这突显出验证cfDNA测序临床效用的设计良好的研究的缺乏。因此,这些公司需要对进行严格的同行评审研究负责。

  • 作为临床微生物学家和传染病专家,我们应该要求设计良好的研究,以更好地定义cfDNA测序的附加价值,然后再倡导将cfDNA检测作为常规方法。

  • mcfDNA-Seq在地区医院的应用是一个关键问题,也需要进一步讨论。

  • 虽然大型学术中心可能拥有采纳此技术的资源,但地区医院可能会面临成本和物流的挑战,因此提供cfDNA测试的可行性较低。

  • 一个解决方案是使用集中实验室,提供快速的cfDNA测序服务,并能在短时间内交付结果,从而帮助进行及时的临床决策。

  • 然而,在整个过程中维护数据透明性——包括数据库和控制结果——是至关重要的。

  • 必须与地区医院的微生物学和传染病专家有效沟通这些信息,以确保结果的准确解释,并结合特定患者的情况。

  • 一个主要限制因素阻碍了mcfDNA-Seq广泛应用的是其相当高的成本制了其在常规诊断中的可及性,还对该领域大规模独立研究的开展构成了重大挑战。

  • 然而,将mcfDNA-Seq纳入前瞻性研究将是识别该测试额外诊断价值并证明保险公司报销的必要前提。

  • 由于高成本部分归因于相关测序技术和分析的垄断,进一步研究开放的测序管道和数据库,以及利用替代测序平台似乎是必要的。

  • 总之,mcfDNA-Seq正作为一种有前景、无偏见、非侵入性的CVIAI诊断工具崭露头角。

  • 该方法的潜力不仅限于病原体识别,还可能作为潜在的标志物,阐明植入物的实际涉及情况,并评估感染的持续性,特别是在无法进行手术干预的患者中。

  • 利用mcfDNA-Seq的能力进行个性化医学可能通过提供关于感染动态和治疗反应的个体化见解,改变诊断格局。

  • 然而,想要实现这些潜在的好处,将需要通过设计良好的前瞻性研究对心血管植入物相关感染进行广泛研究,严格的“针对性”纳入标准,聚焦于高风险治疗失败或复发的人群(例如,需要长期抑制性治疗的不可手术治疗的CVIAI感染)。

  • 这将需要大学医院主导并支持的更多研究倡议,推动跨学科合作,监督从mcfDNA-Seq采样指征到结果解释的过程,并直接影响患者护理。

  • 此外,国家或国际研究资助对于支持该领域独立研究工作,并为住院指征提供足够的保险覆盖至关重要。

  • 这种共同努力对于明确mcfDNA-Seq在未来常规诊断工作流中的最终作用至关重要。

Variant Calling for Herpes Simplex Virus 1 from Patient Sample Using Capture Probe Sequencing

  1. The following data-cleaning strategies were applied before variant calling:

    * Note that the intrahost results does not contain interhost variants.
    * Using the two methods (freebayes and spandx for reporting interhost variant), using viral-ngs reporting intrahost variants.
    * The interhost results is released in the point 4 and the intrahost results released in the step 13.
    * Merge the two results, delete the items from intrahost variants if it occurs in the interhost tables.
    * A records in intrahost table in which the frequency in an isolate >= 0.5 while in an other isolate < 0.5 should be in interhost table. If both are >= 0.5, or < 0.5 should be not in the interhost table.
    * We can roughly check if the correctness of the intrahost variant results with the table from point 18 generated by "~/Scripts/check_sequence_differences.py aligned_1_.aln".
    * At the end, we should have a interhost variant calling table + a intrahost varicant calling table in which the frequency varies between 0.05 and 0.5.
    * Another control method: merge the two tables and sort according to the coordinate. Then compare the coordinate with results ~/Scripts/check_sequence_differences.py aligned_1_.aln. They should be similar.
    * If a record occurs in the table from point 18, not in intrahost+interhost table, meaning the base is wrongly called during the assembly.
    * The correction of the assembly in the step data/02_assembly and data/03_multialign_to_ref/aligned_1.fasta is actually not very critical, since if a wrongly called base, the intrahost will be a record >= 0.5 frequency.
    * The error earlier: only report the intrahost variant calling, few interhost variant calling. The interhost variant calling will be found if they the bases in the assembly is wrongly called.
    * !!!! IMPORTANT: Delete the last position of the alleles if there are three in the alleles in the intrahost Excel-table before releasing !!!!
    
    #Report the interhost+intrahost results to Nicole.
    • Input BAM File
      • The original BAM file (e.g., HSV-Klinik_S2.raw.bam) was used as the initial input.
    • BMTagger Depletion
      • BMTagger was employed to remove reads matching contaminants, using specified databases. The resulting file (e.g., HSV-Klinik_S2.bmtagger_depleted.bam) is expected to have potential contaminants removed.
      • Databases Used:
        • metagenomics_contaminants_v3: Contains sequences commonly found as contaminants in metagenomic samples.
        • GRCh38_ncRNA-GRCh38_transcripts-HS_rRNA_mitRNA: A comprehensive database of human RNA, including non-coding RNA (ncRNA), transcripts, ribosomal RNA (rRNA), and mitochondrial RNA (mitRNA).
        • hg38: A version of the human genome.
    • Duplicate Removal
      • Duplicates were removed, producing a file (e.g., HSV-Klinik_S2.rmdup.bam) with reduced PCR duplicates, which helps decrease bias in downstream analyses.
    • BLASTn Depletion
      • After duplicate removal, BLASTn was used to further refine the file by removing remaining non-target reads. The output (e.g., HSV-Klinik_S2.cleaned.bam) should be free of contaminants and duplicates.
      • Databases Used (blastDbs):
        • hybsel_probe_adapters: Contains sequences for hybrid selection probe adapters and synthetic sequences used in sequencing and library preparation.
        • metag_v3.ncRNA.mRNA.mitRNA.consensus: A curated database of consensus sequences, including non-coding RNA, mRNA, and mitochondrial RNA.
    • Taxonomic Filtering

      • HSV-1-specific sequences were isolated by filtering with a custom database of 161 complete HSV-1 genomes from GenBank (see the end of this email). The tool last was used (documentation: [https://docs.hpc.qmul.ac.uk/apps/bio/last/]), producing the taxfiltBam file (e.g., HSV-Klinik_S2.taxfilt.bam).
      • Assembly with Taxonomically Filtered Reads
      • Precise Mapping

        • Using the aligner novoalign with alignment options -r Random -l 20 -g 40 -x 20 -t 100 -k, I created a file (HSV-Klinik_S2.mapped.bam) containing reads aligned to themselves.

          Read Counts for BAM Files: File Read Count HSV1_S1.raw.bam 1,816,139 × 2 HSV1_S1.bmtagger_depleted.bam 1,750,387 × 2 HSV1_S1.rmdup.bam 1,278,873 × 2 HSV1_S1.cleaned.bam 664,544 × 2 HSV1_S1.taxfilt.bam 22,841 × 2 HSV1_S1.mapped.bam 131 × 2 HSV-Klinik_S2.raw.bam 2,709,058 × 2 HSV-Klinik_S2.bmtagger_depleted.bam 1,582,923 × 2 HSV-Klinik_S2.rmdup.bam 595,066 × 2 HSV-Klinik_S2.cleaned.bam 442,841 × 2 HSV-Klinik_S2.taxfilt.bam 400,301 × 2 HSV-Klinik_S2.mapped.bam 80,915 × 2

          bin/taxon_filter.py deplete \ inBam=data/00_raw/HSV-Klinik_S2.bam \ revertBam=tmp/01_cleaned/HSV-Klinik_S2.raw.bam \ bmtaggerBam=tmp/01_cleaned/HSV-Klinik_S2.bmtagger_depleted.bam \ rmdupBam=tmp/01_cleaned/HSV-Klinik_S2.rmdup.bam \ blastnBam=data/01_cleaned/HSV-Klinik_S2.cleaned.bam \ bmtaggerDbs=[‘/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3’, \ ‘/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA’, \ ‘/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19’] \ blastDbs=[‘/home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters’, \ ‘/home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus’] \ srprism_memory=14250 \ chunkSize=1000000 \ clear_tags=False \ tags_to_clear=[‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’] \ JVMmemory=50g \ threads=120 \ loglevel=INFO \ tmp_dir=/tmp \ tmp_dirKeep=False

          inBam Input BAM file. revertBam Output BAM: read markup reverted with Picard. bwaBam Output BAM: depleted of reads with BWA. bmtaggerBam Output BAM: depleted of reads with BMTagger. rmdupBam Output BAM: bmtaggerBam run through M-Vicuna duplicate removal. blastnBam Output BAM: rmdupBam run through another depletion of reads with BLASTN. last876

  2. Using bengal3_ac3 pipeline to get trimmed reads and snippy interhost variants (for virus, it does not work!)

    #using the env bengal3_ac3
    conda activate bengal3_ac3
    mkdir interhost_variants; cd interhost_variants
    
    #prepare scritps
    cp /home/jhuang/Tools/bacto/bacto-0.1.json .
    cp /home/jhuang/Tools/bacto/cluster.json .
    cp /home/jhuang/Tools/bacto/Snakefile .
    ln -s /home/jhuang/Tools/bacto/local .
    ln -s /home/jhuang/Tools/bacto/db .
    ln -s /home/jhuang/Tools/bacto/envs .
    
    #preparing raw_data
    mkdir raw_data; cd raw_data
    ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R1_001.fastq.gz HSV1_S1_R1.fastq.gz
    ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R2_001.fastq.gz HSV1_S1_R2.fastq.gz
    ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R1_001.fastq.gz HSV-Klinik_S2_R1.fastq.gz
    ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R2_001.fastq.gz HSV-Klinik_S2_R2.fastq.gz
    #ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R1_001.fastq.gz NTC_S3_R1.fastq.gz
    #ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R2_001.fastq.gz NTC_S3_R2.fastq.gz
    
    #preparing bacto-0.1.json.
    "fastqc": false,
    "taxonomic_classifier": false,
    "assembly": false,
    "typing_ariba": false,
    "typing_mlst": false,
    "pangenome": false,
    "variants_calling": true,
    "phylogeny_fasttree": true,
    "phylogeny_raxml": true,
    "recombination": true,
    
    "genus": "Herpesvirus",
    "kingdom": "Viruses",
    "species": "human herpesvirus 1",
    "species": "herpes"
    "reference": "db/OP297860.gb",
    
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    
    # --DEBUG_1 (Don't need to run the step by changing the configuration!)--
    prokka --force --outdir prokka/HSV-Klinik_S2 --cpus 2 --usegenus --genus Herpesvirus --kingdom Viruses --species human herpesvirus 1 --addgenes --addmrna --prefix HSV-Klinik_S2 --locustag HSV-Klinik_S2  shovill/HSV-Klinik_S2/contigs.fa -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
    
    # - using bakta instead due to the error during the prokka-running (bakta doesn't work due to too huge fasta-file)
    bakta --db /mnt/nvme0n1p1/bakta_db shovill/HSV-Klinik_S2/contigs.fa --prefix HSV-Klinik_S2 --output prokka/HSV-Klinik_S2 --force
    
    # ---- running directly freebayes as follows ----
    
        cd data
        mkdir 02_align_to_OP297860
        ../bin/read_utils.py align_and_fix 01_per_sample/HSV1_S1.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV1_S1.bam --outBamFiltered 02_align_to_OP297860/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
        ../bin/read_utils.py align_and_fix 01_per_sample/HSV-Klinik_S2.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV-Klinik_S2.bam --outBamFiltered 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
        b
        samtools sort 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam -o HSV-Klinik_S2_reads_aligned_sorted.bam
        samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
        freebayes -f ../ref_genome/reference.fasta -i HSV-Klinik_S2_reads_aligned_sorted.bam --min-coverage 10 --min-alternate-count 3 --vcf freebayes_interhost_out.vcf
    
        #CHROM  POS ID  REF ALT QUAL
        OP297860.1  8885    .   A   G   7.37349e-13
        OP297860.1  8895    .   A   G   6.00837e-05
        OP297860.1  8956    .   A   G   339.579
        OP297860.1  8991    .   ATTGT   CCTGC   3188.1
        OP297860.1  9616    .   C   A   4.44801e-14
        OP297860.1  12748   .   C   A   63475.5
        #HSV-Klinik_S2-1 13203   A       C       0.8466  snp     1.34479 C:1581:1060:1581:1060:1 A:21:15:21:15:1
        * OP297860.1    13203   .   T   C   86820.7
        OP297860.1  13755   .   G   A   107298
        OP297860.1  14114   .   C   A   1.21987e-13
        OP297860.1  46861   .   T   C   710.176
        * OP297860.1    47109   .   T   G   9375.53
        OP297860.1  47170   .   G   T   5942.86
        OP297860.1  47182   .   G   A   6108.66
        OP297860.1  47320   .   A   G   10275.4
        OP297860.1  47377   .   G   T   972.379
        OP297860.1  47516   .   T   C   257.388
        OP297860.1  47563   .   G   A   372.177
        OP297860.1  47660   .   G   A   438.692
        OP297860.1  47707   .   T   C   3252.11
        OP297860.1  47722   .   A   G   5343.39
        OP297860.1  48064   .   G   A   21575.7
        OP297860.1  48113   .   C   T   4284.1
        OP297860.1  48129   .   T   C   1778.66
        OP297860.1  48167   .   T   C   3316.44
        OP297860.1  48219   .   A   C   6892.21
        OP297860.1  48398   .   C   A   5.72805e-16
        OP297860.1  53216   .   G   T   2031
        OP297860.1  53298   .   A   G   465.154
        OP297860.1  53423   .   C   T   5586.37
        OP297860.1  54025   .   A   G   385.75
        OP297860.1  54073   .   G   A   8463.94
        OP297860.1  54408   .   T   G   2923.39
        OP297860.1  54568   .   G   T   1391.08
        OP297860.1  54708   .   TG  GA,TA   840.319
        OP297860.1  54769   .   G   T   1.72979e-14
        * OP297860.1    55501   .   T   C   33158.1
        * OP297860.1    55807   .   C   A   0
        OP297860.1  56493   .   A   G   39336.9
        OP297860.1  56867   .   C   A   7.83521e-14
        OP297860.1  57513   .   C   A   0
        OP297860.1  58047   .   A   T   4.21917e-15
        OP297860.1  58054   .   C   A   0
        OP297860.1  58056   .   ACCA    TCCT    0
        OP297860.1  58075   .   ACTC    GCTT    2947.03
        OP297860.1  63377   .   C   A   0
        OP297860.1  63393   .   G   T   1.39225e-14
        OP297860.1  65179   .   T   C   7903.32
    * OP297860.1    65225   .   G   A   13223.5
    * OP297860.1    65402   .   C   A   1.53811e-13
        OP297860.1  65992   .   T   C   25982.5
        OP297860.1  66677   .   G   A   5.27367e-15
        OP297860.1  67131   .   C   A   225.935
        OP297860.1  67336   .   G   A   8.13698e-15
        OP297860.1  94706   .   C   A   0
        OP297860.1  94709   .   G   T   0
        * OP297860.1    94750   .   G   T   0
        OP297860.1  95750   .   C   A   2.89975e-08
        OP297860.1  95990   .   C   A   0
        OP297860.1  96070   .   G   T   0
        OP297860.1  137360  .   G   T   0
        OP297860.1  137373  .   C   A   0
        OP297860.1  137527  .   A   T   4880.59
        OP297860.1  137569  .   C   T   10142.1
        OP297860.1  137602  .   C   A   19065.3
        OP297860.1  137986  .   A   G   0
        OP297860.1  138170  .   T   C   53588.3
        OP297860.1  138343  .   C   T   7310.38
  3. spandx varicant calling (see http://xgenes.com/article/article-content/314/call-and-merge-snp-and-indel-results-from-using-two-variant-calling-methods/)

    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860
    #cp OP297860.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860/genes.gbk
    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build OP297860     -d
    #Protein check:  OP297860        OK: 73  Not found: 0    Errors: 0       Error percentage: 0.0%
    
            ## -- try using gffs+fa to install the database, but failed --
            #cp OP297860.gff3 ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860/genes.gff
            #cp OP297860.fasta ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860//sequences.fa
            #vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
            ##OP297860.genome : Herpes_Simplex_Virus_1
            #snpEff build -v OP297860  #-gff3
            #cd /path/to/snpEff/data
            #mkdir NC_001806
            #cp NC_001806.gff3 NC_001806/genes.gff
            #cp NC_001806.fa NC_001806/sequences.fa
    
            ##NC_001806.genome : Herpes_Simplex_Virus_1
            ##bcftools reheader -h new_header.vcf HSV-Klinik_S2.PASS.snps.vcf -o updated_vcf.vcf
            ##table_annovar <input.vcf> <humandb> -buildver <genome_version> -out <output_prefix> -protocol <protocol_list> -operation <operation_list>
    
    cd trimmed
    mv HSV1_S1_trimmed_P_1.fastq HSV1_S1_R1.fastq
    mv HSV1_S1_trimmed_P_2.fastq HSV1_S1_R2.fastq
    mv HSV-Klinik_S2_trimmed_P_1.fastq HSV-Klinik_S2_R1.fastq
    mv HSV-Klinik_S2_trimmed_P_2.fastq HSV-Klinik_S2_R2.fastq
    gzip *_R1.fastq *_R2.fastq
    cp ref_genome/reference.fasta OP297860.fasta
    #Clean the header to only retain the accession-id "OP297860.1"
    
    ln -s /home/jhuang/Tools/spandx/ spandx
    (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_R{1,2}.fastq.gz" --ref OP297860.fasta --annotation --database OP297860 -resume
    
    # -- DEBUG: All_SNPs_indels_annotated.txt is not correctly annotated, manually rerun snpeff-4.1l-8 and related steps --
    ## OPTION_1: copy the viral-ngs4 database to the spandx database, failed during the version difference --
    #cp /home/jhuang/miniconda3/envs/viral-ngs4/share/snpeff-4.1l-8/data/1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c/genes.gbk .
    ##/home/jhuang/miniconda3/envs/spandx/bin/snpEff build OP297860     -d
    
    # OPTION_2: run via interhost.py (SUCCESSFUL!)
    #repeat the processing in spandx/bin/SNP_matrix.sh to generate All_SNPs_indels_annotated.txt, the snpEff step using we with 'bin/interhost.py snpEff' in the env viral-ngs4
    
    cd work/f8/93141f3ef382d7ac9dd40def9c50ce (last directory sorted by timestamp)
    
    #gatk VariantsToTable -V out.vcf -F CHROM -F POS -F REF -F ALT -F TYPE -GF GT -O out.vcf.table.all
    #
    ##clean-up the out.vcf.table.all because GATK outputs A/A
    #sed -i 's#|#/#g' out.vcf.table.all
    #awk ' { for (i=6; i<=NF; i++) {
    #        if ($i == "A/A") $i="A";
    #        if ($i == "G/G") $i="G";
    #        if ($i == "C/C") $i="C";
    #        if ($i == "T/T") $i="T";
    #        if ($i == "*/*") $i="*";
    #        if ($i == "./.") $i=".";
    #        }};
    #        {print $0} ' out.vcf.table.all > out.vcf.table.all.tmp
    #awk ' { for (i=6; i<=NF; i++) {
    #        if ($i ~ /\//) {
    #          split($i, a, "/");
    #        if (a[1] == a[2]) $i=a[1];
    #           }
    #         }
    #       };
    #       {print $0} ' out.vcf.table.all.tmp > out.vcf.table.all
    
    # Switch the env to viral-ngs4 and manully run snpEff
    #(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/work/ea/6f30cd5eed0efbbf3e3fe1ddfac0df$ snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v 1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c out.vcf > out.annotated.vcf  ##-debug
    # Number of chromosomes      : 1
    # Chromosomes                : Format 'chromo_name size codon_table'
    #               'OP297860'      152526  Standard
    #vim /home/jhuang/miniconda3/envs/viral-ngs4/share/snpeff-4.1l-8/snpEff.config
    #1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c.chromosomes : OP297860
    
    # Alternative snpEff calling with "bin/interhost.py snpEff"
    #(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/work/ea/6f30cd5eed0efbbf3e3fe1ddfac0df$ ../../../bin/interhost.py snpEff out.filtered.vcf OP297860.1 out.annotated.vcf j.huang@uke.de --loglevel DEBUG
    
    #remove headers from annotated vcf and out.vcf
    grep -v '#' out.annotated.vcf > out.annotated.vcf.headerless
    #grep -v '#' out.vcf > out.vcf.headerless
    awk '{
        if (match($0,"EFF=")){print substr($0,RSTART)}
        else
        print ""
        }' out.annotated.vcf.headerless > effects
    
    sed -i 's/EFF=//' effects
    sed -i 's/(/ /g' effects
    sed -i 's/|/ /g' effects
    sed -i 's/UPSTREAM MODIFIER /UPSTREAM MODIFIER - /g' effects
    cut -d " " -f -8 effects > effects.mrg
    sed -i 's/ /\t/g' effects.mrg
    rm effects
    
    tail -n+2 out.vcf.table.all > out.vcf.table.all.headerless
    sed -i 's/ /\t/g' out.vcf.table.all.headerless
    paste out.vcf.table.all.headerless effects.mrg > out.vcf.headerless.plus.effects
    head -n1 out.vcf.table.all | sed 's/.GT//g' > header.left
    echo -e "Effect\tImpact\tFunctional_Class\tCodon_change\tProtein_and_nucleotide_change\tAmino_Acid_Length\tGene_name\tBiotype" > header.right
    paste header.left header.right > header
    cat header out.vcf.headerless.plus.effects > All_SNPs_indels_annotated.txt
    echo "SPANDx has finished"
    
    cp All_SNPs_indels_annotated.txt ../../../Outputs/Phylogeny_and_annotation/
  4. merge the two variant calling

    #Output: interhost_variants/snippy/summary_snps_indels.csv
    python3 ~/Scripts/summarize_snippy_res.py interhost_variants/snippy  #Note that although the ALT bases are wrong, but we only need the positions. We can use the results for downstream processing!
    
    #Sort summary_snps_indels.csv according to the coordinate positions.
    #merge the following two files summary_snps_indels.csv (70) and All_SNPs_indels_annotated.txt (819) --> merged_variants.csv (69)
    python3 ~/Scripts/merge_snps_indels.py interhost_variants/snippy/summary_snps_indels.csv Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt merged_variants.csv
    #check if the number of the output file is correct?
    comm -12 <(cut -d, -f2 interhost_variants/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq) | wc -l
    comm -12 <(cut -d, -f2 interhost_variants/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq)
    #The only difference is 58615
    #Manually check the final results and delete some strange results and save merged_variants.csv as variants.xlsx
    
    #sort interhost_index -u > interhost_index_sorted
    #sort intrahost_index -u > intrahost_index_sorted
    #comm interhost_index_sorted intrahost_index_sorted
    
    # !!!! Manually checking intrahost records, if one record in a sample-group > 0.5, it should be a record interhost, look for if the records in the spandx-result. If the record is there, copy it to the interhost variant sheet!
    The records in all records of intrahost variants should be always < 0.5, if a record is > 0.5, if should be in interhost variants. Delete all records from intrahost variants when a record > 0.5 and it is not occuring in All_SNPs_indels_annotated.txt !!!! Ausnahme ist the record such as 65225:
    #OP297860   65225   G   A   SNP G   G/A intragenic_variant  MODIFIER            n.65225G>A      UL30
    #OP297860   65225   HSV1_S1 HSV1_S1     G,A 0   intragenic_variant  n.65225G>A              UL30    Gene_63070_67475
    #OP297860   65225   HSV-Klinik_S2   HSV-Klinik_S2       G,A 0.891530460624071   intragenic_variant  n.65225G>A              UL30    Gene_63070_67475
    
    ##improve the header
    #sed -i '1s/_trimmed_P//g' merged_variants.csv
    
    ##check the REF and K1 have the same base and delete those records with difference.
    #cut -f3 -d',' merged_variants.csv > f3
    #cut -f6 -d',' merged_variants.csv > f6
    #diff f3 f6
    #awk -F, '$3 == $6 || NR==1' merged_variants.csv > filtered_merged_variants.csv #(93)
    #cut -f3 -d',' filtered_merged_variants.csv > f3
    #cut -f6 -d',' filtered_merged_variants.csv > f6
    #diff f3 f6
    
    ##MANUALLY REMOVE the column f6 in filtered_merged_variants.csv, and rename CHROM to HDRNA_01_K01 in the header, summarize chr and plasmids SNPs of a sample together to a single list, save as an Excel-file.
  5. (Optional, the step is currently only for intrahost variant calling) Filtering low complexity

    fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    
        Read1 before filtering:
        total reads: 1755209
        total bases: 163663141
        Q20 bases: 162306612(99.1711%)
        Q30 bases: 159234526(97.2941%)
    
        Read2 before filtering:
        total reads: 1755209
        total bases: 163045950
        Q20 bases: 161178082(98.8544%)
        Q30 bases: 157052184(96.3239%)
    
        Read1 after filtering:
        total reads: 1733241
        total bases: 161547828
        Q20 bases: 160217907(99.1768%)
        Q30 bases: 157196236(97.3063%)
    
        Read2 aftering filtering:
        total reads: 1733241
        total bases: 160825521
        Q20 bases: 159057902(98.9009%)
        Q30 bases: 155354052(96.5979%)
    
        Filtering result:
        reads passed filter: 3466482
        reads failed due to low quality: 550
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads failed due to low complexity: 43386
        reads with adapter trimmed: 21424
        bases trimmed due to adapters: 159261
    
        Duplication rate: 14.2379%
    
        Insert size peak (evaluated by paired-end reads): 41
    
        JSON report: fastp.json
        HTML report: fastp.html
    
        fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
        fastp v0.20.1, time used: 7 seconds
        Read1 before filtering:
        total reads: 2688264
        total bases: 330035144
        Q20 bases: 326999269(99.0801%)
        Q30 bases: 320136918(97.0009%)
    
        Read2 before filtering:
        total reads: 2688264
        total bases: 327364405
        Q20 bases: 323331005(98.7679%)
        Q30 bases: 314500076(96.0703%)
    
        Read1 after filtering:
        total reads: 2660598
        total bases: 326564634
        Q20 bases: 323572956(99.0839%)
        Q30 bases: 316783667(97.0049%)
    
        Read2 aftering filtering:
        total reads: 2660598
        total bases: 324709841
        Q20 bases: 320840657(98.8084%)
        Q30 bases: 312570288(96.2614%)
    
        Filtering result:
        reads passed filter: 5321196
        reads failed due to low quality: 1110
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads failed due to low complexity: 54222
        reads with adapter trimmed: 39080
        bases trimmed due to adapters: 357915
    
        Duplication rate: 9.91821%
    
        Insert size peak (evaluated by paired-end reads): 96
    
        JSON report: fastp.json
        HTML report: fastp.html
    
        fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
        fastp v0.20.1, time used: 15 seconds
  6. Using vrap to assembly and annotate the contigs, the spades-step was replaced with idba of DAMIAN; DAMIAN’s host-removal steps can also as the confirmation steps for viral-ngs.

    # Starting data: ln -s interhost_variants/trimmed .
    ln -s ~/Tools/vrap/ .
    #CHANGE the txid10298 in download_db.py: txid10298[Organism] AND complete genome[Title]
    gzip trimmed/*_R1.fastq trimmed/*_R2.fastq
    mv trimmed/*.gz ./
    #--host /home/jhuang/REFs/genome.fa --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr
    
    vrap/vrap.py  -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz  -o HSV1_S1_vrap_out_v3 --bt2idx=/home/jhuang/REFs/genome    -t 100 -l 200  -g
    vrap/vrap.py  -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz  -o HSV-Klinik_S2_vrap_out_v3 --bt2idx=/home/jhuang/REFs/genome    -t 100 -l 200  -g
    
    #--> If ERROR in spades-assembly, we usding idba from DAMIAN assembly, copy the assembly to spades. Then rerun vrap.py above!
    
    # * 4 nt_dbs (--virus, --host, download_db.py(nucleotide), nt), 2 prot_db (download_db.py(protein), nr) for blast, save under ./blast/db/virus, ./blast/db/host, vrap/database/viral_db/viral_nucleotide, vrap/database/viral_db/viral_protein
    # * 1 bowtie_database for host removal (--host), save under ./bowtie/host.
    # * bowtie run before assembly
    # * blast run after assembly for the contigs, therefore it does not exist the taxfilt step in vrap.
    # * checking the order of the databases for annotation step, namely which database will be taken firstly for annotionn after setting --virus?
    # * If --host is for both bowtie and blastn, if only --bt2idx define, only bowtie, no blastn! --> commented --host=/home/jhuang/REFs/genome.fa still has the host-removal step!
    # * "--virus=vrap/database/viral_db/nucleotide.fa" don't need give, since it is already defined in ./blast/db/virus
    # * the process: lighter (fast, memory-efficient tool for correcting sequencing errors) --> flash (tool to find the correct overlap between paired-end reads and extend the reads by stitching them together) --> bowtie (delete the host reads) --> spades --> cap3 (CAP3: A DNA sequence assembly program and it has a capability to clip 5' and 3' low-quality regions of reads) --> calculating orf density --> hmmer --> blast
    
    # Download all virus genomes
    mv datasets /usr/local/bin/
    chmod +x /usr/local/bin/datasets
    #datasets download virus genome --complete-only --assembly-source refseq
    datasets download virus genome taxon "Viruses" --complete-only --refseq
    #To check for RefSeq data only, look for NC_, NM_, or similar prefixes in sequence headers and identifiers.
    wget -r -np -nH --cut-dirs=3 ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/
    
    # The commends for more comprehensive blast annotation
    vrap/vrap.py  -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz  -o HSV1_S1_vrap_out_v4 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/vrap/database/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    vrap/vrap.py  -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz  -o HSV-Klinik_S2_vrap_out_v4 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/vrap/database/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    #END
    
    #using the bowtie of vrap to map the reads on ref_genome/reference.fasta
    vrap/vrap.py  -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz  -o HSV1_S1_vrap_out_v5 --host ref_genome/reference.fasta   -t 100 -l 200  -g
    vrap/vrap.py  -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz  -o HSV-Klinik_S2_vrap_out_v5 --host ref_genome/reference.fasta    -t 100 -l 200  -g
    cd bowtie
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    #106435 + 0 mapped (3.11% : N/A)
    #106435 + 0 primary mapped (3.11% : N/A)
    #8204 + 0 properly paired (0.26% : N/A)
    #63742 + 0 with itself and mate mapped
      8204+63742
    #1144448 + 0 mapped (26.25% : N/A)
    #1144448 + 0 primary mapped (26.25% : N/A)
    #124068 + 0 properly paired (3.76% : N/A)
    #581256 + 0 with itself and mate mapped
      124068+581256
    bamCoverage -b mapped_sorted.bam -o ../../HSV1_S1_reads_coverage2.bw
    bamCoverage -b mapped_sorted.bam -o ../../HSV-Klinik_S2_reads_coverage2.bw
    
    #Command line spades:
        /home/jhuang/miniconda3/envs/vrap/bin/spades.py   -1      /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.1.fastq -2      /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.2.fastq --s1    /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.fastq   -k      33,55,77,99,127 --cov-cutoff    off     --only-assembler        --careful       -t      100     -o      /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/spades
    #Command line cap3:
        /home/jhuang/Tools/vrap/external_tools/cap3 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/spades/contigs.fasta -y 100
    
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R2.fastq.gz --sample HSV1_S1_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R2.fastq.gz --sample HSV-Klinik_S2_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    [16:42:55 2024-11-12]   Removing adapter and host sequences
    Trimming readpair 1:  /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV1_S1_R1.fastq.gz and /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV1_S1_R2.fastq.gz
    Host reads:           11.71%
    Fragment size:        212 (sd:64)
    Subtracting host:     human3 (Homo_sapiens_UCSC_hg38 (dna))
    Alignment rate:       0.52%
    Subtracting host:     human3 (Homo sapiens (cdna))
    Alignment rate:       0.02%
    Subtracting host:     human3 (Homo sapiens (ncrna))
    Alignment rate:       0.01%
    
    [17:20:31 2024-11-12]   Removing adapter and host sequences
    Trimming readpair 1:  /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV-Klinik_S2_R1.fastq.gz and /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV-Klinik_S2_R2.fastq.gz
    Host reads:           44.47%
    Fragment size:        236 (sd:77)
    Subtracting host:     human3 (Homo_sapiens_UCSC_hg38 (dna))
    Alignment rate:       29.34%
    Subtracting host:     human3 (Homo sapiens (cdna))
    Alignment rate:       0.66%
    Subtracting host:     human3 (Homo sapiens (ncrna))
    Alignment rate:       0.64%
    [17:25:27 2024-11-12]   Assembling
    [17:38:39 2024-11-12]   Parsing assembly
    Large contigs (500bp and longer):     259
    Large orfs (75bp and longer):         843
    [17:38:58 2024-11-12]   Seeking protein domains
    Contigs with domains: 162
    [17:40:36 2024-11-12]   Annotating contigs
    
    cp ~/rtpd_files/HSV1_S1_megablast/idba_ud_assembly/contig.fa contigs.fasta
    cp ~/rtpd_files/HSV-Klinik_S2_megablast/idba_ud_assembly/contig.fa contigs.fasta
    
    #RERUN vrap/vrap.py again with the replaced contigs.fasta!
    #vrap/vrap.py  -1 Affe30_trimmed_R1.fastq.gz -2 Affe30_trimmed_R2.fastq.gz -o Affe30_trimmed_vrap_out   -t 40 -l 100
    #vrap/vrap.py  -1 Affe31_trimmed_R1.fastq.gz -2 Affe31_trimmed_R2.fastq.gz -o Affe31_trimmed_vrap_out   -t 40 -l 100
    
    # -- DEBUG_1 --
    #DO NOT use '-l 100' in command line
    #name 'generic_dna' is not defined
    mamba install biopython=1.77 python=3.9  #for supporting "generic_dna"
    
    # SET all records from vrap/database/viral_db/nucleotide.fa as lastal.acids, choose the most occurred in vrap_out as refsel.acids and the record for accessions_for_ref_genome_build in config.yaml.
    #   Query coverage
    Query sequence name Query length    ORF density Percentage identity Subject sequence length Subject accession   Subject name    E-value
    
    grep "Human alphaherpesvirus 1" HSV-Klinik_S2_contigs_summary.csv > HSV-Klinik_S2_contigs_summary_.csv
    
    #--> ON960057.1
    
        name: The name or identifier of the query sequence. This is typically the header from the input sequence file, such as a FASTA file.
    
        qleng: Query length, or the total length of the input sequence (in nucleotides or amino acids, depending on the input type).
    
        orf_d: ORF (Open Reading Frame) direction. This indicates the strand or frame in which the ORF was found, often shown as + for the forward direction or - for the reverse direction.
    
        hmmer_eval: The E-value from the HMMER (Hidden Markov Model) search. This represents the statistical significance of the match between the identified ORF and the reference HMM model. Lower values indicate more significant matches.
    
        hmm_model: The name of the HMM (Hidden Markov Model) profile matched by the ORF. This typically corresponds to a specific viral or protein family model from an HMM database, such as Pfam or custom models used by VRAP.
    
        ident: Percentage identity between the query sequence and the target model or database entry. This measures the similarity of the ORF to the matched model.
    
        qcov: Query coverage, or the percentage of the query sequence that aligns to the target model. This indicates how much of the ORF sequence aligns with the HMM profile.
    
        tcov: Target coverage, or the percentage of the target HMM profile that aligns with the query. This helps assess how well the ORF represents the entire HMM model.
    
        tlength: Target length, or the length of the HMM model sequence in the database. This value can be used to understand how much of the target model was covered by the ORF alignment.
    
        tid: Target identifier, often an accession or ID number for the matched HMM model. This is used to uniquely identify the model within the HMM database.
    
        tname: Target name or description, which provides more information about the HMM model or protein family that the ORF matches.
    
        mean_eval: Mean E-value for the HMMER match, averaged over multiple potential alignments (if any). Lower values imply higher significance, with the mean providing an aggregate metric if there were multiple HMM matches.
    
    #reads_and_contigs_on_JX878414.png
    
    #using the assembly for the calling!
    #TODO_TOMORROW: In the final results only mark the SNPs in the contigs > 500 nt (shown as in the figure), otherwise we have too much results! then merge snps (now there is an ERROR during merging!)
  7. Analyses using viral-ngs

    conda activate viral3
    #conda install -c anaconda openjdk=8
    
    ln -s ~/Tools/viral-ngs/Snakefile Snakefile
    ln -s ~/Tools/viral-ngs/bin bin
    
    cp  ~/Tools/viral-ngs/refsel.acids refsel.acids
    cp  ~/Tools/viral-ngs/lastal.acids lastal.acids
    cp  ~/Tools/viral-ngs/config.yaml config.yaml
    cp  ~/Tools/viral-ngs/samples-runs.txt samples-runs.txt
    cp  ~/Tools/viral-ngs/samples-depletion.txt samples-depletion.txt
    cp  ~/Tools/viral-ngs/samples-metagenomics.txt samples-metagenomics.txt
    cp  ~/Tools/viral-ngs/samples-assembly.txt samples-assembly.txt
    cp  ~/Tools/viral-ngs/samples-assembly-failures.txt samples-assembly-failures.txt
    mkdir data
    cd data
    mkdir 00_raw
    cd ../..
  8. Prepare lastal.acids, refsel.acids and accessions_for_ref_genome_build in config.yaml

    #Herpes simplex virus 1 (HSV-1) and Human alphaherpesvirus 1 (also known as Simplexvirus humanalpha1) are indeed the same virus.
    #The different names result from varied naming conventions:
    #    * Herpes simplex virus 1 (HSV-1) is the common name, often used in clinical and general contexts.
    #    * Human alphaherpesvirus 1 is the official taxonomic name, as defined by the International Committee on Taxonomy of Viruses (ICTV). This name is used in scientific classifications and databases like NCBI to specify its place in the Herpesviridae family under the Alphaherpesvirinae subfamily.
    #In some databases or references, it might also appear under Simplexvirus humanalpha1, which refers to its taxonomic classification at the genus level (Simplexvirus) and species level (Human alphaherpesvirus 1). However, all these terms refer to the same virus, commonly known as HSV-1.
    
    #https://www.uniprot.org/taxonomy?query=Human+herpesvirus
    #https://www.uniprot.org/taxonomy/3050292
    esearch -db nuccore -query "txid3050292[Organism]" | efetch -format fasta > taxon_3050292_sequences.fasta
    esearch -db nuccore -query "txid3050292[Organism]" | efetch -format acc > taxon_3050292_accessions.txt
    esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format fasta > taxon_3050292_complete_genomes.fasta
    esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format acc > taxon_10298_complete_genomes.acc  # 161 genomes
    mv taxon_10298_complete_genomes.acc lastal.acids
    
    https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10298
    Human alphaherpesvirus 1 (Herpes simplex virus type 1)     Click on organism name to get more information.
        Human alphaherpesvirus 1 strain 17
        Human alphaherpesvirus 1 strain A44
        Human alphaherpesvirus 1 strain Angelotti
        Human alphaherpesvirus 1 strain CL101
        Human alphaherpesvirus 1 strain CVG-2
        Human alphaherpesvirus 1 strain F
        Human alphaherpesvirus 1 strain H129
        Human alphaherpesvirus 1 strain HFEM
        Human alphaherpesvirus 1 strain HZT
        Human alphaherpesvirus 1 strain KOS
        Human alphaherpesvirus 1 strain MGH-10
        Human alphaherpesvirus 1 strain MP
        Human alphaherpesvirus 1 strain Patton
        Human alphaherpesvirus 1 strain R-15
        Human alphaherpesvirus 1 strain R19
        Human alphaherpesvirus 1 strain RH2
        Human alphaherpesvirus 1 strain SC16
  9. Trimming using trimmomatic

    # Starting data: ln -s interhost_variants/raw_data .
    mkdir bams
    for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    for sample in HSV1_S1; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ./raw_data/${sample}_R1.fastq.gz ./raw_data/${sample}_R2.fastq.gz trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz  ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; \
    done
  10. Mapping

    cd trimmed
    seqtk sample -s100 HSV1_S1_R1.fastq.gz 0.1 > HSV1_S1_sampled_R1.fastq
    seqtk sample -s100 HSV1_S1_R2.fastq.gz 0.1 > HSV1_S1_sampled_R2.fastq
    gzip HSV1_S1_sampled_R1.fastq HSV1_S1_sampled_R2.fastq
    
    ref_fa="NC_001806.fasta";
    for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    for sample in HSV1_S1; do
    for sample in HSV1_S1_sampled; do
        bwa index ${ref_fa}; \
        bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_R1.fastq.gz trimmed/${sample}_R2.fastq.gz | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
        #for table filling using the following commands! -->3000000 \
        #bwa mem -M -t 14 ${ref_fa} ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz | samtools view -bS -F 256 - > bams/${sample}_uniqmap.bam; \
    done
  11. AddOrReplaceReadGroup is IMPORTANT step, otherwise the step viral_ngs cannot run correctly

    for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    for sample in HSV1_S1; do
    for sample in HSV1_S1_sampled; do
        picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
    done
  12. Configure the viral-ngs conda environment

    conda config --add channels r
    conda config --add channels defaults
    conda config --add channels conda-forge
    conda config --add channels bioconda
    conda config --add channels broad-viral
    
    # -- Works not correctly --
    #conda list --export > environment2.yml
    #mamba create --name viral-ngs4 --file environment2.yml
    
    mamba env remove -n viral-ngs4
    mamba create -n viral-ngs4 python=3.6 blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
    conda activate viral-ngs4
    
    mamba create -n viral-ngs4 python=3.6
    conda activate viral-ngs4
    #vim requirements-conda.txt
    mamba install blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
    # -- Eventually DEBUG --
    #mamba remove picard
    #mamba clean --packages
    #mamba install -c bioconda picard
    ##mamba install libgfortran=5 sqlite=3.46.0
    ##mamba install picard --clobber
    ##mamba create -n viral-ngs-fresh -c bioconda -c conda-forge picard python=3.6 sqlite=3.46.0 libgfortran=5
    
    mamba install cd-hit cd-hit-auxtools diamond gap2seq=2.1 mafft=7.221 mummer4 muscle=3.8 parallel pigz prinseq samtools=1.6 tbl2asn trimmomatic trinity unzip vphaser2 bedtools -c r -c defaults -c conda-forge -c bioconda  #-c broad-viral
    mamba install snpeff=4.1l
    mamba install gatk=3.6
    mamba install bwa
    #IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
    #IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
    #IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
    #IMPORTANT_CHECK if it works
    #        java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
    #        /usr/local/bin/gatk -T RealignerTargetCreator --help
    #IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
    mamba install vphaser2=2.0
    
    # -- NO ERROR --> INSTALL END HERE --
    
    # -- DEBUG: ClobberError: This transaction has incompatible packages due to a shared path. --
    # SafetyError: The package for snpeff located at /home/jhuang/miniconda3/pkgs/snpeff-4.1l-hdfd78af_8
    # appears to be corrupted. The path 'share/snpeff-4.1l-8/snpEff.config'
    # has an incorrect size.
    # reported size: 9460047 bytes
    # actual size: 9460357 bytes
    #
    # ClobberError: This transaction has incompatible packages due to a shared path.
    # packages: bioconda/linux-64::bowtie2-2.5.4-h7071971_4, bioconda/linux-64::bowtie-1.3.1-py36h769816f_3
    # path: 'bin/scripts/convert_quals.pl'
    
    # sovle confilict between bowtie, bowtie2 and snpeff
    mamba remove bowtie
    mamba install bowtie2
    mamba remove snpeff
    mamba install snpeff=4.1l
    
    # -- WITH ERROR caused by bowtie and snpeff --> INSTALL END HERE --
    
    #mamba install -c bioconda viral-ngs  #so that gatk3-register and novoalign-license-register available --> ERROR
        #Due to license restrictions, the viral-ngs conda package cannot distribute and install GATK directly. To fully install GATK, you must download a licensed copy of GATK v3.8 from the Broad Institute, and call “gatk3-register,” which will copy GATK into your viral-ngs conda environment:
            mkdir -p /path/to/gatk_dir
            wget -O - 'https://software.broadinstitute.org/gatk/download/auth?package=GATK-archive&version=3.6-0-g89b7209' | tar -xjvC /path/to/gatk_dir
            gatk3-register /path/to/gatk_dir/GenomeAnalysisTK.jar
        #The single-threaded version of Novoalign is installed by default. If you have a license for Novoalign to enable multi-threaded operation, viral-ngs will copy it to the viral-ngs conda environment if the NOVOALIGN_LICENSE_PATH environment variable is set. Alternatively, the conda version of Novoalign can be overridden if the NOVOALIGN_PATH environment variable is set. If you obtain a Novoalign license after viral-ngs has already been installed, it can be added to the conda environment by calling:
            # obtain a Novoalign license file: novoalign.lic
            novoalign-license-register /path/to/novoalign.lic
    
    # # --We don't have registers, so we have to manually install novoalign and gatk--
    # #At first install novoalign, then samtools
    # mamba remove samtools
    # mamba install -c bioconda novoalign  # Eventually not necessary, since the path is defined in config.yaml NOVOALIGN_PATH: "/home/jhuang/Tools/novocraft_v3", and novoalign.lic is also in the same path.
    # mamba install -c bioconda samtools
    #
    # mamba install -c bioconda gatk #(3.8)  #IN /usr/local/bin/gatk FROM /home/jhuang/Tools/SPANDx_v3.2/GenomeAnalysisTK.jar
    # #UPDATED TO: '/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar'
    
    # # If necessary, clean up the conda cache. This will remove any partially installed or corrupted packages.
    # conda clean --all
    
    ## reinstall samtools 1.6 --> NOT RELEVANT
    #mamba install samtools=1.6
  13. Run snakemake

    #Set values in samples-*.txt before running viral-ngs
    
    rm -rf ref_genome refsel_db lastal_db
    mv data data_v1;
    mv tmp tmp_v1;
    mkdir data tmp
    mv data_v1/00_raw data
    snakemake --printshellcmds --cores 10
    
    #Manully remove the records in the intrahost-results when it occurs in the interhost-tables as save the final intrahost-results as a Excel-Sheet in the variants.xlsx.
    
    /usr/local/bin/gatk
    https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
    
    java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
    java -jar ~/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
    /usr/local/bin/gatk -T RealignerTargetCreator --help
    
    https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
    Djava.io.tmpdir=/tmp/tmp-assembly-refine_assembly-2d9z3pcr
    java -jar ~/Tools/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    ~/Tools/GenomeAnalysisTK-4.1.2.0/gatk -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    
    # -- DEBUG_1: Configure the Conda Environment to Use Host's Java (version 17) while keeping BLAST 2.6.0+ --
    
    bin/taxon_filter.py deplete data/00_raw/HSV1_S1.bam tmp/01_cleaned/HSV1_S1.raw.bam tmp/01_cleaned/HSV1_S1.bmtagger_depleted.bam tmp/01_cleaned/HSV1_S1.rmdup.bam data/01_cleaned/HSV1_S1.cleaned.bam --bmtaggerDbs /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19 --blastDbs /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters --threads 120 --srprismMemory 142500000 --JVMmemory 256g --loglevel DEBUG
    
    #2024-11-06 15:55:01,162 - __init__:444:_attempt_install - DEBUG - Currently installed version of blast: 2.16.0-hc155240_2
    #2024-11-06 15:55:01,162 - __init__:448:_attempt_install - DEBUG - Expected version of blast:            2.6.0
    #2024-11-06 15:55:01,162 - __init__:449:_attempt_install - DEBUG - Incorrect version of blast installed. Removing it...
    
    #  + (blast 2.6.0 needs java 17, therefore java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java" in /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard) blast                             2.6.0  boost1.64_2          bioconda        Cached
    #  + (bmtagger 3.101 needs blast 2.6.0) blast=2.6.0 + bmtagger 3.101  h470a237_4           bioconda        Cached
    #  + pango                            1.50.7  hbd2fdc8_0           conda-forge     Cached
    #  + openjdk                         11.0.15  hc6918da_0           conda-forge     Cached
    #  + r-base                            4.2.0  h1ae530e_0           pkgs/r          Cached
    #  + picard                            3.0.0  hdfd78af_0           bioconda        Cached
    #  + java -version                     openjdk version "11.0.15-internal" 2022-04-19
    
    Then, edit in the following file so that it can use the host java (version 17) for the viral-ngs2 picard 3.0.0! --
    vim /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard
    
    # ---------------------------------------------------------
    # Use Java installed with Anaconda to ensure correct version
    java="$ENV_PREFIX/bin/java"
    
    # if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
    if [ -n "${JAVA_HOME:=}" ]; then
    if [ -e "$JAVA_HOME/bin/java" ]; then
        java="$JAVA_HOME/bin/java"
    fi
    fi
    # -------------------------------------------------------->
    #COMMENTED
    # Use Java installed with Anaconda to ensure correct version
    #java="$ENV_PREFIX/bin/java"
    
    #MODIFIED
    ## if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
    #if [ -n "${JAVA_HOME:=}" ]; then
    #  if [ -e "$JAVA_HOME/bin/java" ]; then
    #      java="$JAVA_HOME/bin/java"
    #  fi
    #fi
    java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java"
    # ---------------------------------------------------------
    
    # -- DEBUG_2: lastal version not compatible --
    bin/ncbi.py fetch_fastas j.huang@uke.de lastal_db NC_001806.2 --combinedFilePrefix lastal --removeSeparateFiles --forceOverwrite --chunkSize 300
    bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
    mamba remove last
    mamba install -c bioconda last=876
    lastal -V
    bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
    
    # -- DEBUG_3: lastal version not compatible --
    bin/assembly.py gapfill_gap2seq tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffolded.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
    
    #2024-11-07 12:34:14,732 - __init__:460:_attempt_install - DEBUG - Attempting install...
    #2024-11-07 12:34:14,733 - __init__:545:install_package - DEBUG - Creating conda environment and installing package gap2seq=2.1
    mamba install gap2seq=2.1
    
    # -- DEBUG_4 --
    bin/assembly.py impute_from_reference tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffold_ref.fasta tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta --newName HSV1_S1_sampled --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index --loglevel DEBUG
    
    2024-11-07 14:05:20,438 - __init__:445:_attempt_install - DEBUG - Currently installed version of muscle: 5.2-h4ac6f70_0
    2024-11-07 14:05:20,438 - __init__:448:_attempt_install - DEBUG - Expected version of muscle:            3.8.1551
    2024-11-07 14:05:20,438 - __init__:449:_attempt_install - DEBUG - Incorrect version of muscle installed. Removing it...
    mamba install muscle=3.8
    #- muscle       5.2  h4ac6f70_0  bioconda     Cached
    #+ muscle  3.8.1551  h7d875b9_6  bioconda     Cached
    
    #/home/jhuang/Tools/novocraft_v3/novoalign -f data/01_per_sample/HSV1_S1.cleaned.bam -r Random -l 20 -g 40 -x 20 -t 100 -F BAM -d tmp/02_assembly/HSV1_S1.assembly4-refined.nix -o SAM
    
    # -- DEBUG_5 --
    bin/assembly.py refine_assembly tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV1_S1_sampled.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
    #Shebang in /usr/local/bin/gatk is corrupt.
    
    # -- DEBUG_6 --
    bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1_sampled.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
    2024-11-07 14:47:34,163 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.526-h4bc722e_0
    2024-11-07 14:47:34,163 - __init__:448:_attempt_install - DEBUG - Expected version of mafft:            7.221
    2024-11-07 14:47:34,164 - __init__:449:_attempt_install - DEBUG - Incorrect version of mafft installed. Removing it...
    mamba install mafft=7.221
    
    # -- DEBUG_7 --
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1_sampled.mapped.bam data/02_assembly/HSV1_S1_sampled.fasta data/04_intrahost/vphaser2.HSV1_S1_sampled.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    
    export TMPDIR=/home/jhuang/tmp
    (viral-ngs) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing$ /home/jhuang/miniconda3/envs/viral-ngs/bin/vphaser2 -i /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam -o /home/jhuang/tmp/tmpyg8mlj5qvphaser2
    
    samtools depth /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam > coverage.txt
    
    # -- DEBUG_8 --
    snakemake --printshellcmds --cores 100
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    snakemake --printshellcmds --cores 100
    
    # -- DEBUG_9 --
    bin/assembly.py refine_assembly tmp/02_assembly/HSV-Klinik_S2.assembly3-modify.fasta data/01_per_sample/HSV-Klinik_S2.cleaned.bam tmp/02_assembly/HSV-Klinik_S2.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV-Klinik_S2.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
    /usr/local/bin/gatk -Xmx20g -Djava.io.tmpdir=/home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p -T RealignerTargetCreator -I /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpwbzvjo9j.rmdup.bam -R /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpxq4obe29.deambig.fasta -o /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmptkw8zcf3.intervals --num_threads 120
    
    mamba install gatk=3.6
    #IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
    #IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
    #IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
    #IMPORTANT_CHECK if it works
    #        java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
    #        /usr/local/bin/gatk -T RealignerTargetCreator --help
    #IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
    
    # -- DEBUG_10 (if the sequencing is too shawlow, then seperate running) --
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam -o /tmp/tmp1x6jsiu_vphaser2
    [EXIT]: gather_alignments: Failed to set region for reference HSV-Klinik_S2-1 in file /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam
    # Run seperate intrahost.py --> no error:
    #342 reads
    
    2024-11-08 14:27:33,575 - intrahost:223:compute_library_bias - DEBUG - LB:standard has 161068 reads in 1 read group(s) (HSV-Klinik_S2)
    2024-11-08 14:27:34,875 - __init__:445:_attempt_install - DEBUG - Currently installed version of vphaser2: 2.0-h7a259b3_14
    
    samtools index HSV1_S1.mapped.bam
    samtools index HSV-Klinik_S2.mapped.bam
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz   --minReadsEach 1 --maxBias 2 --loglevel DEBUG   # --vphaserNumThreads 120 --removeDoublyMappedReads
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
    
    samtools idxstats data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    samtools view -H data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/output_dir
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
    
    samtools view -b data/02_align_to_self/HSV-Klinik_S2.mapped.bam "HSV-Klinik_S2-1" > subset.bam
    samtools index subset.bam
    @SQ     SN:HSV-Klinik_S2-1      LN:141125       AS:tmp35_s3ghx.ref_copy.nix
    samtools view -b subset.bam "HSV-Klinik_S2-1:1-10000" > small_subset.bam
    samtools index small_subset.bam
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i small_subset.bam -o /tmp/output_dir
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i subset.bam -o vphaser2_out
    
    # -- DEBUG_11 in step multi_align_mafft: aligned_1.fasta is always empty, we need generate it manually with mafft and mark it as complete --
    
    #[Fri Nov  8 14:51:45 2024]
    #rule multi_align_mafft:
    #    input: data/02_assembly/HSV1_S1.fasta, data/02_assembly/HSV-Klinik_S2.fasta, ref_genome/reference.fasta
    #    output: data/03_multialign_to_ref/sampleNameList.txt, data/03_multialign_to_ref/aligned_1.fasta, data/03_multialign_to_ref/aligned_2.fasta, ... data/03_multialign_to_ref/aligned_161.fasta
    #    jobid: 24
    #    resources: tmpdir=/tmp, mem=8, threads=120
    
    bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
    #b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    #-------
    #2024-11-08 14:51:46,324 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, #Feb 28 2019, 09:07:38)
    #[GCC 7.3.0]
    #2024-11-08 15:00:26,375 - cmd:195:main_argparse - INFO - command: bin/interhost.py multichr_mafft inFastas=['ref_genome/reference.fasta', 'data/02_assembly/HSV1_S1.fasta', 'data/02_assembly/HSV-Klinik_S2.fasta'] localpair=True globalpair=None preservecase=True reorder=None gapOpeningPenalty=1.53 ep=0.123 verbose=False outputAsClustal=None maxiters=1000 outDirectory=data/03_multialign_to_ref outFilePrefix=aligned sampleRelationFile=None sampleNameListFile=data/03_multialign_to_ref/sampleNameList.txt threads=120 loglevel=DEBUG tmp_dir=/tmp tmp_dirKeep=False
    #2024-11-08 15:00:26,375 - cmd:209:main_argparse - DEBUG - using tempDir: /tmp/tmp-interhost-multichr_mafft-sw91_svl
    #2024-11-08 15:00:27,718 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.221-0
    #2024-11-08 15:00:27,719 - mafft:141:execute - DEBUG - /home/jhuang/miniconda3/envs/viral-ngs4/bin/mafft --thread 120 --localpair --preservecase --op 1.53 --ep 0.123 --quiet --maxiterate 1000 /tmp/tmp-interhost-multichr_mafft-sw91_svl/tmp68_ln_ha.fasta
    
    snakemake --cleanup-metadata 03_multialign_to_ref --cores 4
    
    # -- DEBUG_12 --
    #[EXIT]: gather_alignments: Failed to set region for reference HSV-Klinik_S2-1 in file data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    
    #DEBUG_PROCESS1: rm temp/*.region
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -w 5000 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/temp
    #                        5209 snp
    #                        21 lv
    
    #SOLUTION: MODFIED AS 'cmd = [self.install_and_get_path(), '-w 5000', '-i', inBam, '-o', outDir]' in bin/tools/vphaser2.py
    #ADDED
    cmd.append('-w')
    cmd.append('25000')
    
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads  --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    
    #BEFORE_CHANGE:
    b'\n--------------------------------------------------------\nProgram runs with the following Parameter setting:\n\n\tinput BAM file\t=\t/tmp/tmpt6fgovqk.mapped-withdoublymappedremoved.bam\n\toutput Directory\t=\t/tmp/tmp53_oxecyvphaser2\n\terrModel\t\t=\tpileup + phase\n\talpha\t\t=\t0.05\n\tignoreBases \t=\t0\n\t(var_matepair, var_cycle, var_dt, var_qt)\t=\t1,1,1,20\n\tpSample\t\t=\t30%\n\twindowSz\t=\t500\n\tdelta\t=\t2\n\n
    
    #AFTER_CHANGE:
    windowSz=5000
    
    #mkdir 02_align_to_ref
    bin/read_utils.py align_and_fix data/01_per_sample/HSV1_S1.cleaned.bam refsel_db/refsel.fasta --outBamAll data/02_align_to_ref/HSV1_S1.bam --outBamFiltered data/02_align_to_ref/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    
    bin/read_utils.py align_and_fix data/01_per_sample/HSV-Klinik_S2.cleaned.bam refsel_db/refsel.fasta --outBamAll data/02_align_to_ref/HSV-Klinik_S2.bam --outBamFiltered data/02_align_to_ref/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    
    bin/intrahost.py vphaser_one_sample data/02_align_to_ref/HSV1_S1.mapped.bam refsel_db/refsel.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_ref.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    
    bin/intrahost.py vphaser_one_sample data/02_align_to_ref/HSV-Klinik_S2.mapped.bam refsel_db/refsel.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_ref.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -w 10000 -i data/02_align_to_ref/HSV-Klinik_S2.mapped.bam -o /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/temp
    
    mkdir 02_align_to_NC_001806
    bin/read_utils.py align_and_fix data/01_per_sample/HSV1_S1.cleaned.bam refsel_db/NC_001806.2.fasta --outBamAll data/02_align_to_NC_001806/HSV1_S1.bam --outBamFiltered data/02_align_to_NC_001806/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    
    bin/read_utils.py align_and_fix data/01_per_sample/HSV-Klinik_S2.cleaned.bam refsel_db/NC_001806.2.fasta --outBamAll data/02_align_to_NC_001806/HSV-Klinik_S2.bam --outBamFiltered data/02_align_to_NC_001806/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    
    bin/intrahost.py vphaser_one_sample data/02_align_to_NC_001806/HSV1_S1.mapped.bam refsel_db/NC_001806.2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_NC_001806.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    
    bin/intrahost.py vphaser_one_sample data/02_align_to_NC_001806/HSV-Klinik_S2.mapped.bam refsel_db/NC_001806.2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_NC_001806.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    
    #align to self
    #-rw-rw-r-- 1 jhuang jhuang  47M Nov  8 18:24 HSV-Klinik_S2.bam
    #-rw-rw-r-- 1 jhuang jhuang 6,3M Nov  8 18:24 HSV-Klinik_S2.mapped.bam
    #-rw-rw-r-- 1 jhuang jhuang  74M Nov  8 17:25 HSV1_S1.bam
    #-rw-rw-r-- 1 jhuang jhuang  25K Nov  8 17:25 HSV1_S1.mapped.bam
    
    #align to NC_001806
    #-rw-rw-r-- 1 jhuang jhuang  48M Nov 11 13:26 HSV-Klinik_S2.bam
    #-rw-rw-r-- 1 jhuang jhuang 4,9M Nov 11 13:26 HSV-Klinik_S2.mapped.bam
    #-rw-rw-r-- 1 jhuang jhuang  74M Nov 11 13:31 HSV1_S1.bam
    #-rw-rw-r-- 1 jhuang jhuang  34K Nov 11 13:31 HSV1_S1.mapped.bam
    
    #align to OP297860
    #-rw-rw-r-- 1 jhuang jhuang  47M Nov 12 12:35 HSV-Klinik_S2.bam
    #-rw-rw-r-- 1 jhuang jhuang 5,3M Nov 12 12:35 HSV-Klinik_S2.mapped.bam
    #-rw-rw-r-- 1 jhuang jhuang  74M Nov 12 12:31 HSV1_S1.bam
    #-rw-rw-r-- 1 jhuang jhuang  34K Nov 12 12:31 HSV1_S1.mapped.bam
    
    #align to self
    #-rw-rw-r-- 1 jhuang jhuang  47M Nov 11 21:44 HSV-Klinik_S2.bam
    #-rw-rw-r-- 1 jhuang jhuang 6,3M Nov 11 21:44 HSV-Klinik_S2.mapped.bam
    #-rw-rw-r-- 1 jhuang jhuang  74M Nov 11 21:09 HSV1_S1.bam
    #-rw-rw-r-- 1 jhuang jhuang  25K Nov 11 21:09 HSV1_S1.mapped.bam
    
    # -- DEBUG_13 --
    [Mon Nov 11 15:36:54 2024]
    rule isnvs_vcf:
        input: data/04_intrahost/vphaser2.HSV1_S1.txt.gz, data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz, data/03_multialign_to_ref/aligned_1.fasta, ref_genome/reference.fasta
        output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi
        jobid: 21
        resources: tmpdir=/tmp, mem=4
    
    b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    -------
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    -------
    2024-11-11 15:36:55,581 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
    [GCC 7.3.0]
    2024-11-11 15:36:55,581 - cmd:195:main_argparse - INFO - command: bin/intrahost.py merge_to_vcf refFasta=ref_genome/reference.fasta outVcf=data/04_intrahost/isnvs.vcf.gz samples=['HSV1_S1', 'HSV-Klinik_S2'] isnvs=['data/04_intrahost/vphaser2.HSV1_S1.txt.gz', 'data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz'] alignments=['data/03_multialign_to_ref/aligned_1.fasta'] strip_chr_version=True naive_filter=False parse_accession=True loglevel=INFO
    2024-11-11 15:36:55,581 - intrahost:476:merge_to_vcf - INFO - loaded CoordMapper for all genomes, starting VCF merge...
    Traceback (most recent call last):
    File "bin/intrahost.py", line 1152, in 
    util.cmd.main_argparse(__commands__, __doc__) File “/home/jhuang/Tools/viral-ngs/bin/util/cmd.py”, line 221, in main_argparse ret = args.func_main(args) File “/home/jhuang/Tools/viral-ngs/bin/util/cmd.py”, line 102, in _main mainfunc(**args2) File “bin/intrahost.py”, line 677, in merge_to_vcf (sample, (s_pos, samp_offsets[sample]), ref_sequence.id, pos)) NotImplementedError: Sample HSV-Klinik_S2-1 has variants at 2 positions (8704, 8703) mapped to same reference position (AB291960.1:63) [Mon Nov 11 15:36:56 2024] Error in rule isnvs_vcf: jobid: 0 output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi RuleException: CalledProcessError in line 61 of /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules: Command ‘set -euo pipefail; bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz –samples HSV1_S1 HSV-Klinik_S2 –isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz –alignments data/03_multialign_to_ref/aligned_1.fasta –strip_chr_version –parse_accession’ returned non-zero exit status 1. File “/mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules”, line 61, in __rule_isnvs_vcf File “/usr/lib/python3.10/concurrent/futures/thread.py”, line 58, in run Exiting because a job execution failed. Look above for error message Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/.snakemake/log/2024-11-11T151925.063825.snakemake.log # –DEBUG_14 — bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref –ep 0.123 –maxiters 1000 –preservecase –localpair –outFilePrefix aligned –sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt –threads 120 bin/read_utils.py bwamem_idxstats inBam=data/01_cleaned/HSV-Klinik_S2.cleaned.bam refFasta=/home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta outBam=None outStats=reports/spike_count/HSV-Klinik_S2.spike_count.txt min_score_to_filter=60 aligner_options=None loglevel=INFO tmp_dir=/tmp tmp_dirKeep=False loglevel=DEBUG
  14. Assembly results (look what are difference of the four versions 15K vs 73K in ~/DATA/Data_Nicole_CaptureProbeSequencing/tmp/02_assembly)

    HSV1_S1.assembly2-gapfilled.fasta vs HSV-Klinik_S2.assembly2-gapfilled.fasta
    
    -rw-rw-r-- 1 jhuang jhuang  15K Nov  8 17:12 HSV1_S1.assembly1-spades.fasta
    -rw-rw-r-- 1 jhuang jhuang 155K Nov  8 17:12 HSV1_S1.assembly2-scaffold_ref.fasta
    -rw-rw-r-- 1 jhuang jhuang 130K Nov  8 17:12 HSV1_S1.assembly2-scaffolded.fasta
    -rw-rw-r-- 1 jhuang jhuang  176 Nov  8 17:12 HSV1_S1.assembly2-alternate_sequences.fasta
    -rw-rw-r-- 1 jhuang jhuang 130K Nov  8 17:14 HSV1_S1.assembly2-gapfilled.fasta
    -rw-rw-r-- 1 jhuang jhuang   26 Nov  8 17:18 HSV1_S1.assembly3-modify.fasta.fai
    -rw-rw-r-- 1 jhuang jhuang  182 Nov  8 17:18 HSV1_S1.assembly3-modify.dict
    -rw-r--r-- 1 jhuang jhuang 1,7M Nov  8 17:18 HSV1_S1.assembly3-modify.nix
    -rw-rw-r-- 1 jhuang jhuang 155K Nov  8 17:18 HSV1_S1.assembly3-modify.fasta
    -rw-rw-r-- 1 jhuang jhuang  212 Nov  8 17:21 HSV1_S1.assembly3.vcf.gz.tbi
    -rw-rw-r-- 1 jhuang jhuang  183 Nov  8 17:21 HSV1_S1.assembly4-refined.dict
    -rw-rw-r-- 1 jhuang jhuang   26 Nov  8 17:21 HSV1_S1.assembly4-refined.fasta.fai
    -rw-r--r-- 1 jhuang jhuang 1,2M Nov  8 17:21 HSV1_S1.assembly4-refined.nix
    -rw-rw-r-- 1 jhuang jhuang 137K Nov  8 17:21 HSV1_S1.assembly4-refined.fasta
    -rw-rw-r-- 1 jhuang jhuang 494K Nov  8 17:21 HSV1_S1.assembly3.vcf.gz
    -rw-rw-r-- 1 jhuang jhuang  203 Nov  8 17:22 HSV1_S1.assembly4.vcf.gz.tbi
    -rw-rw-r-- 1 jhuang jhuang 428K Nov  8 17:22 HSV1_S1.assembly4.vcf.gz
    -rw-rw-r-- 1 jhuang jhuang  73K Nov  8 18:03 HSV-Klinik_S2.assembly1-spades.fasta
    -rw-rw-r-- 1 jhuang jhuang 144K Nov  8 18:03 HSV-Klinik_S2.assembly2-scaffolded.fasta
    -rw-rw-r-- 1 jhuang jhuang    0 Nov  8 18:03 HSV-Klinik_S2.assembly2-alternate_sequences.fasta
    -rw-rw-r-- 1 jhuang jhuang 155K Nov  8 18:03 HSV-Klinik_S2.assembly2-scaffold_ref.fasta
    -rw-rw-r-- 1 jhuang jhuang 144K Nov  8 18:07 HSV-Klinik_S2.assembly2-gapfilled.fasta
    -rw-rw-r-- 1 jhuang jhuang   32 Nov  8 18:12 HSV-Klinik_S2.assembly3-modify.fasta.fai
    -rw-rw-r-- 1 jhuang jhuang  194 Nov  8 18:12 HSV-Klinik_S2.assembly3-modify.dict
    -rw-r--r-- 1 jhuang jhuang 1,7M Nov  8 18:12 HSV-Klinik_S2.assembly3-modify.nix
    -rw-rw-r-- 1 jhuang jhuang 155K Nov  8 18:12 HSV-Klinik_S2.assembly3-modify.fasta
  15. draw coverages

    * Mapping the contig on the reference JX878414
        bowtie2-build refsel_db/refsel.fasta refsel_index
    
        #spades/contigs.fasta
        #bowtie2 -f -x refsel_index -U HSV1_S1_vrap_out/HSV1_S1_contigs.fasta -N 1 --score-min L,0,-1 --rdg 5,3 --rfg 5,3 -S HSV1_S1_contigs_aligned.sam
        bowtie2 -f -x refsel_index -U HSV1_S1_vrap_out/HSV1_S1_contigs.fasta -S HSV1_S1_contigs_aligned.sam
        samtools view -bS -F 4 HSV1_S1_contigs_aligned.sam > HSV1_S1_contigs_aligned.bam
        #samtools view -S -b HSV1_S1_contigs_aligned.sam > HSV1_S1_contigs_aligned.bam
        samtools sort HSV1_S1_contigs_aligned.bam -o HSV1_S1_contigs_aligned_sorted.bam
        samtools index HSV1_S1_contigs_aligned_sorted.bam
        samtools view HSV1_S1_contigs_aligned_sorted.bam > HSV1_S1_contigs_aligned_sorted.sam
    
        Query sequence name Query length    ORF density Percentage identity Subject sequence length Subject accession   Subject name    E-value
        #TODO: Analyis in next time consider keep the column of query_coverage for quality control?
    
        #2486 reads; of these:
        #  2486 (100.00%) were unpaired; of these:
        #    2407 (96.82%) aligned 0 times
        #    79 (3.18%) aligned exactly 1 time
        #    0 (0.00%) aligned >1 times
        #3.18% overall alignment rate
    
        11 reads; of these:
        11 (100.00%) were unpaired; of these:
            8 (72.73%) aligned 0 times
            3 (27.27%) aligned exactly 1 time
            0 (0.00%) aligned >1 times
        27.27% overall alignment rate
    
        NODE_14_length_862_cov_192.742857
        NODE_19_length_621_cov_61.380567
        CAP_16_length_559
    
        gi|946552631|gb|KT425109.1| Human alphaherpesvirus 1 strain KOS79
        gi|2549839763|gb|OQ724891.1|    Human alphaherpesvirus 1 strain BP-K5
        gi|2228071600|gb|ON007132.1|    Human alphaherpesvirus 1 strain v40_unk_gen
    
        samtools faidx HSV1_S1_contigs.fasta 'NODE_14_length_862_cov_192.742857' > HSV1_S1_contigs_.fasta
        samtools faidx HSV1_S1_contigs.fasta 'NODE_19_length_621_cov_61.380567' >> HSV1_S1_contigs_.fasta
        samtools faidx HSV1_S1_contigs.fasta 'CAP_16_length_559' >> HSV1_S1_contigs_.fasta
    
        bowtie2 -f -x refsel_index -U HSV-Klinik_S2_vrap_out/HSV-Klinik_S2_contigs.fasta -S HSV-Klinik_S2_contigs_aligned.sam
        samtools view -bS -F 4 HSV-Klinik_S2_contigs_aligned.sam > HSV-Klinik_S2_contigs_aligned.bam
        #samtools view -S -b HSV-Klinik_S2_contigs_aligned.sam > HSV-Klinik_S2_contigs_aligned.bam
        samtools sort HSV-Klinik_S2_contigs_aligned.bam -o HSV-Klinik_S2_contigs_aligned_sorted.bam
        samtools index HSV-Klinik_S2_contigs_aligned_sorted.bam
        samtools view HSV-Klinik_S2_contigs_aligned_sorted.bam > HSV-Klinik_S2_contigs_aligned_sorted.sam
    
        31 reads; of these:
        31 (100.00%) were unpaired; of these:
            8 (25.81%) aligned 0 times
            21 (67.74%) aligned exactly 1 time
            2 (6.45%) aligned >1 times
        74.19% overall alignment rate
    
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_14_length_2544_cov_467.428217' > HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_81_length_1225_cov_1080.820583' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_114_length_1046_cov_1018.474429' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_117_length_1033_cov_1618.421858' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_152_length_927_cov_105.347500' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_161_length_900_cov_3.283312' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_220_length_795_cov_0.748879' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_245_length_763_cov_900.518868' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_352_length_664_cov_61.363128' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_368_length_644_cov_489.846591' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_373_length_653_cov_0.340304' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_414_length_634_cov_2501.944773' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_626_length_568_cov_1.630385' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_1026_length_506_cov_2.593668' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_7_length_1389' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_8_length_1267' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_9_length_1581' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_18_length_896' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_25_length_841' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_52_length_1849' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_53_length_665' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_54_length_820' >> HSV-Klinik_S2_contigs_.fasta
        samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_56_length_1189' >> HSV-Klinik_S2_contigs_.fasta
    
        gi|1059802459|gb|LT594105.1|
        gi|2315197778|gb|OP297860.1|
        gi|2549840487|gb|OQ724911.1|
        gi|1059802767|gb|LT594109.1|
        gi|2620238293|gb|OR771685.1|
        gi|2620238293|gb|OR771685.1|
        gi|2315199769|gb|OP297886.1|
        gi|2549841171|gb|OQ724933.1|
        gi|2620238293|gb|OR771685.1|
        gi|1809626902|gb|MN925871.1|
        gi|2618798953|gb|OR723971.1|
        gi|2315197778|gb|OP297860.1|
        gi|2277963097|gb|ON960059.1|
        gi|2620238293|gb|OR771685.1|
        gi|2549599151|gb|OQ724836.1|
        gi|1717903527|gb|MN136523.1|
        gi|1059802459|gb|LT594105.1|
        gi|2549841171|gb|OQ724933.1|
        gi|2315197778|gb|OP297860.1|
        gi|2620238293|gb|OR771685.1|
        gi|2315197778|gb|OP297860.1|
        gi|1809626902|gb|MN925871.1|
    
        Human herpesvirus 1 isolate 172_2010 genome assembly
        Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
        Human alphaherpesvirus 1 strain BP-K12
        Human herpesvirus 1 isolate 270_2007 genome assembly
        Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
        Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
        Human alphaherpesvirus 1 strain HSV1-v72_d53_cu_gen_les
        Human alphaherpesvirus 1 strain BP-L2
        Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
        UNVERIFIED: Human alphaherpesvirus 1 strain Sample4_DOCK8
        Mutant Human alphaherpesvirus 1 isolate dsncRNA12
        Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
        Human alphaherpesvirus 1 strain HSV1-San-Francisco-USA-1974-HTZ
        Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
        UNVERIFIED: Human alphaherpesvirus 1 strain BP-C8
        Human alphaherpesvirus 1 strain MacIntyre
        Human herpesvirus 1 isolate 172_2010 genome assembly
        Human alphaherpesvirus 1 strain BP-L2
        Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
        Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
        Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
        UNVERIFIED: Human alphaherpesvirus 1 strain Sample4_DOCK8
        Human alphaherpesvirus 1 strain HSV1-v67_d346_cu_gen_les
    
        #-->OR771685.1
    
        #8278 reads; of these:
        #  8278 (100.00%) were unpaired; of these:
        #    3775 (45.60%) aligned 0 times
        #    4500 (54.36%) aligned exactly 1 time
        #    3 (0.04%) aligned >1 times
        #54.40% overall alignment rate
    
    * Generate Coverage Profile for Reads (from Fastq): Align the trimmed fastq reads to the reference genome using a mapper like BWA or Bowtie2 (WRONG), we should use novoalign
        #bwa index refsel_db/refsel.fasta
        #bwa mem refsel_db/refsel.fasta trimmed/HSV1_S1_R1.fastq.gz trimmed/HSV1_S1_R2.fastq.gz > HSV1_S1_reads_aligned.sam
        #samtools view -Sb HSV1_S1_reads_aligned.sam | samtools sort -o HSV1_S1_reads_aligned_sorted.bam
        #samtools index HSV1_S1_reads_aligned_sorted.bam
        #bwa mem refsel_db/refsel.fasta trimmed/HSV-Klinik_S2_R1.fastq.gz trimmed/HSV-Klinik_S2_R2.fastq.gz > HSV-Klinik_S2_reads_aligned.sam
        #samtools view -Sb HSV-Klinik_S2_reads_aligned.sam | samtools sort -o HSV-Klinik_S2_reads_aligned_sorted.bam
        #samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
    
        cd data
        mkdir 02_align_to_OP297860
        ../bin/read_utils.py align_and_fix 01_per_sample/HSV1_S1.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV1_S1.bam --outBamFiltered 02_align_to_OP297860/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
        ../bin/read_utils.py align_and_fix 01_per_sample/HSV-Klinik_S2.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV-Klinik_S2.bam --outBamFiltered 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
        samtools sort 02_align_to_OP297860/HSV1_S1.mapped.bam -o HSV1_S1_reads_aligned_sorted.bam
        samtools index HSV1_S1_reads_aligned_sorted.bam
        samtools sort 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam -o HSV-Klinik_S2_reads_aligned_sorted.bam
        samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
        mv 02_align_to_OP297860/*.bam ..
        rmdir 02_align_to_OP297860
    
    * Generate Coverage Tracks: Use BamCoverage to generate coverage files (in bigWig format) for both the reads and contigs.
        #find . -name "*_aligned_sorted.bam"
        bamCoverage -b ./HSV1_S1_reads_aligned_sorted.bam -o HSV1_S1_reads_coverage.bw
        bamCoverage -b ./HSV1_S1_contigs_aligned_sorted.bam -o HSV1_S1_contigs_coverage.bw
        bamCoverage -b ./HSV-Klinik_S2_reads_aligned_sorted.bam -o HSV-Klinik_S2_reads_coverage.bw
        bamCoverage -b ./HSV-Klinik_S2_contigs_aligned_sorted.bam -o HSV-Klinik_S2_contigs_coverage.bw
    
    * Visualize Alignments: Use tools like IGV (Integrative Genomics Viewer)
  16. Reproduce 03_multialign_to_ref by generating consensus fasta

    #bedtools bamtobed -i HSV-Klinik_S2_contigs_aligned_sorted.bam > contigs.bed
    bedtools bamtobed -i HSV1_S1_vrap_out_v5/bowtie/mapped_sorted.bam > contigs.bed
    bedtools merge -i contigs.bed > merged_contigs_coverage.bed
    awk '{sum += $3 - $2} END {print sum}' merged_contigs_coverage.bed
    #20916
    
    #generate alignment form contigs.bam and refsel.fasta
    bcftools mpileup -f refsel_db/refsel.fasta -d 1000000 HSV-Klinik_S2_contigs_aligned_sorted.bam | bcftools call -mv --ploidy 1 -Ov -o contigs_variants.vcf
    bgzip contigs_variants.vcf
    tabix -p vcf contigs_variants.vcf.gz
    cat refsel_db/refsel.fasta | bcftools consensus contigs_variants.vcf.gz > aligned_contigs_to_reference.fasta
    
    #        tabix -p vcf contigs_variants.vcf.gz
    #        cat refsel_db/refsel.fasta | bcftools consensus contigs_variants.vcf.gz > aligned_contigs_to_reference.fasta
    #Note: the --sample option not given, applying all records regardless of the genotype
    #Applied 30 variants
    cat refsel_db/refsel.fasta aligned_contigs_to_reference.fasta > aligned_1.fasta
    #Header of the 2nd record is >HSV-Klinik_S2-1
    mafft aligned_1.fasta | sed '/^>/! s/[a-z]/\U&/g' > data/03_multialign_to_ref/aligned_1.fasta
  17. Reproduce 04_intrahost, #DEBUG_IMPORTANT_NOT_SAME_BETWEEN_VPHASER2_AND_FREEBAYES: why not intrahost variant calling not having the frequencies between 0.2 and 0.8. The list is also total different to the results from freebayes. try different combination of “”–removeDoublyMappedReads –minReadsEach 5 –maxBias 0″

    awk '$6 >= 0.05' isnvs.annot.txt > isnvs.annot_.txt
    
    chr     pos     sample  patient time    alleles iSNV_freq       Hw      Hs      eff_type        eff_codon_dna   eff_aa  eff_aa_pos      eff_prot_len    eff_gene        eff_protein
    * OP297860        13203   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1       0.0165025249227804      1       synonymous_variant,intragenic_variant   1614A>G,1614A>T,n.13203T>C,n.13203T>A   Val538Val       538     882     UL5     UXY89136.1,Gene_11440_14815
    * OP297860        47109   HSV-Klinik_S2   HSV-Klinik_S2           T,G     0.992975413948821       0.0139504824839776      1       missense_variant        1126A>C Asn376His       376     376     UL23    UXY89153.1
    OP297860        47989   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.0537303216974675      0.101686748455508       1       synonymous_variant      246C>A  Pro82Pro        82      376     UL23    UXY89153.1
    OP297860        55501   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1       0.0204843614284831      1       synonymous_variant,intragenic_variant   720A>G,720A>T,n.55501T>C,n.55501T>A     Ala240Ala       240     904     UL27,UL28       UXY89158.1,Gene_53483_58584
    OP297860        55807   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.0622837370242215      0.116808946253038       1       missense_variant,intragenic_variant     414G>T,n.55807C>A       Glu138Asp       138     904     UL27,UL28       UXY89158.1,Gene_53483_58584
    * OP297860        65225   HSV-Klinik_S2   HSV-Klinik_S2           G,A     0.891530460624071       0.193407796807005       1       intragenic_variant      n.65225G>A                              UL30    Gene_63070_67475
    * OP297860        65402   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.102222222222222       0.183545679012346       1       intragenic_variant      n.65402C>A                              UL30    Gene_63070_67475
    OP297860        66570   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.0518433179723502      0.0983111767079359      1       intragenic_variant      n.66570G>T                              UL30    Gene_63070_67475
    OP297860        94750   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.0528511821974965      0.100115869475647       1       missense_variant        108G>T  Gln36His        36      488     UL42    UXY89171.1
    
    samtools faidx aligned_1.fasta "OP297860.1":13203-13203  #T
    samtools faidx aligned_1.fasta HSV-Klinik_S2-1:13203-13203  #T
    samtools faidx aligned_1.fasta "OP297860.1":47109-47109  #T
    samtools faidx aligned_1.fasta HSV-Klinik_S2-1:47109-47109  #T
    samtools faidx aligned_1.fasta "OP297860.1":47989-47989  #G
    samtools faidx aligned_1.fasta HSV-Klinik_S2-1:47989-47989  #G
    samtools faidx aligned_1.fasta "OP297860.1":65225-65225  #G
    samtools faidx aligned_1.fasta HSV-Klinik_S2-1:65225-65225  #A
    
    #DEBUG_IMPORTANT_NOT_SAME_BETWEEN_VPHASER2_AND_FREEBAYES: why not intrahost variant calling not located in 0.6, 0.4
    
    vim bin/tools/vphaser2.py  # set w=25000
    rm -rf data/04_intrahost
    snakemake --printshellcmds --cores 10
    
    samtools index data/02_align_to_self/HSV1_S1.mapped.bam
    samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 -loglevel DEBUG
    
    #interhost variant calling, the number below should be not the same to the intrahost variant calling (the varaints from the isolate to its consensus assemby, this is why the frequency theoretically under 0.5. In intrahost variant calling, the REF refers to the base OP297860.1. It is possible that a ALT has 90% in the clinical samples --> All positions with > 0.5 means the consensus sequences are different to the CHROM. The frequences varies 0.00000001 to 1.0, since if the frequences with 0.0 will be not reported.)
    #The contigs contains a lot of positions wrongly assembled, so it is actually only much fewer following positions are interhost variants.
    
    samtools index HSV1_S1.mapped.bam
    samtools index HSV-Klinik_S2.mapped.bam
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120  --minReadsEach 5 --maxBias 0 --loglevel DEBUG
    awk '$7 >= 5' vphaser2.HSV-Klinik_S2_v2.txt > vphaser2.HSV-Klinik_S2_v2_.txt
    
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 0 --loglevel DEBUG
  18. Manully complete the assemblies with the reference genome and recreated 02_assembly, then rerun the pipelines for the steps after 02_align_to_self

    ~/Scripts/convert_fasta_to_clustal.py aligned_1.fasta_orig aligned_1.aln
    ~/Scripts/convert_clustal_to_clustal.py aligned_1.aln aligned_1_.aln
    #manully delete the postion with all or '-' in aligned_1_.aln
    ~/Scripts/check_sequence_differences.py aligned_1_.aln
    
    #Differences found at the following positions (150):
    Position 8956: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 8991: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 8992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 8995: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 9190: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
    Position 9294: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 9298: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 9319: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 9324: OP297860.1 = T, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 9352: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 9368: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 10036: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 12006: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 12131: OP297860.1 = C, HSV1_S1-1 = M, HSV-Klinik_S2-1 = C
    Position 12748: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 12753: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    * Position 13203: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    * Position 13522: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 13557: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 13637: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    * Position 13659: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 13731: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 13755: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 13778: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 14835: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 34549: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 34705: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 41118: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 41422: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 44110: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 44137: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 44190: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 44227: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    Position 44295: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 46861: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    # Position 47109: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 47170: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = T
    Position 47182: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 47320: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 47375: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 47377: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 47393: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 47433: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 47436: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 47484: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 47516: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 47563: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 47660: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 47707: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 47722: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = G
    * Position 47969: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 48064: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = A
    Position 48113: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 48129: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 48167: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 48219: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 48255: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 48384: OP297860.1 = C, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
    Position 53216: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 53254: OP297860.1 = C, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
    Position 53265: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 53291: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 53298: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = G
    Position 53403: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 53423: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 53445: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 53450: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 53460: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 53659: OP297860.1 = A, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    * Position 53691: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 54007: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 54013: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 54025: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 54073: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 54408: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 54568: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 54708: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 54709: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    * Position 55501: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 55507: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 55543: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 56493: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 56753: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 56981: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 58075: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 58078: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 58526: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 58550: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 58604: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 58615: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 58789: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    * Position 63248: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 63799: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    * Position 64328: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 65179: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    * Position 65225: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 65992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 66677: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 67336: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 87848: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 87866: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    Position 87942: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    Position 87949: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    * Position 95302: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 95320: OP297860.1 = G, HSV1_S1-1 = K, HSV-Klinik_S2-1 = G
    Position 95992: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 96124: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 96138: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 96145: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    Position 100159: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 107885: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    Position 114972: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 117663: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 117802: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = A
    Position 117834: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 117841: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 118616: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 119486: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 119519: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 120688: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 120690: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 120711: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 120714: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 133842: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 133894: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    Position 134778: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 134788: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 134867: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 134895: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 134898: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 134942: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    Position 136436: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 136900: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    Position 137047: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 137155: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    Position 137527: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
    Position 137569: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 137602: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 137944: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
    Position 138170: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    Position 138343: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    Position 138880: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 139104: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    Position 140457: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = M
    Position 141865: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    Position 141889: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    Position 141937: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
    Position 142056: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    Position 144444: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    
    ~/Scripts/convert_clustal_to_fasta.py aligned_1_.aln aligned_1.fasta
    samtools faidx aligned_1.fasta
    samtools faidx aligned_1.fasta OP297860.1 > OP297860.1.fasta
    samtools faidx aligned_1.fasta HSV1_S1-1 > HSV1_S1-1.fasta
    samtools faidx aligned_1.fasta HSV-Klinik_S2-1 > HSV-Klinik_S2-1.fasta
    seqkit seq OP297860.1.fasta -w 70 > OP297860.1_w70.fasta
    diff OP297860.1_w70.fasta ../../refsel_db/refsel.fasta
    #< >OP297860.1
    #---
    #> >OP297860.1 Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les, complete genome
    #2180c2180,2181
    #< ACGGGCCCCCCCCCGAAACACACCCCCCGGGGGTCGCGCGCGGCCCTT
    #---
    #> ACGGGCCCCCCCCCGAAACACACCCCCCGGGGGTCGCGCGCGGCCCTTTAAAAAGGCGGGGCGGGT
    mv 02_assembly 02_assembly_v1
    mv 02_align_to_self 02_align_to_self_v1
    mv 03_multialign_to_ref/ 03_multialign_to_ref_v1
    mv 04_intrahost 04_intrahost_v1
    mkdir 02_assembly
    cp 03_multialign_to_ref_v1/HSV1_S1-1.fasta 02_assembly/HSV1_S1.fasta
    cp 03_multialign_to_ref_v1/HSV-Klinik_S2-1.fasta 02_assembly/HSV-Klinik_S2.fasta
    
    samtools faidx HSV1_S1.fasta
    picard CreateSequenceDictionary R=HSV1_S1.fasta O=HSV1_S1.dict
    ~/Tools/novocraft_v3/novoindex HSV1_S1.nix HSV1_S1.fasta
    samtools faidx HSV-Klinik_S2.fasta
    picard CreateSequenceDictionary R=HSV-Klinik_S2.fasta O=HSV-Klinik_S2.dict
    ~/Tools/novocraft_v3/novoindex HSV-Klinik_S2.nix HSV-Klinik_S2.fasta
  19. If the reads in mapped.bam too few, we can manully rerun the following steps with custom defined bam, for example cleaned.bam or taxfilt.bam files (see the point 1).

    # -- Adjust Novoalign parameter to increase the mapped reads in 02_align_to_self --
    
    If you are working with NovoAlign for virus variant calling and find that very few reads are retained, you can adjust certain parameters to increase the read count while still maintaining high mapping quality. Here are some suggestions for tuning the parameters in NovoAlign:
    
        Reduce the Minimum Alignment Score Threshold (-t):
            Current Setting: -t 100
            Suggestion: Try reducing this threshold to around -t 90 or -t 80.
            Explanation: The -t parameter in NovoAlign sets the minimum alignment score, which is the threshold for accepting an alignment. Lowering this score allows more alignments to pass through, increasing read retention. Reducing it slightly can retain quality while increasing the number of mapped reads.
    
        Adjust the Gap Penalty (-g):
            Current Setting: -g 40
            Suggestion: Try using a slightly lower gap penalty, such as -g 20 or -g 30.
            Explanation: Lowering the gap penalty allows reads with minor gaps to align more easily, which may be beneficial for viral genomes with regions that might induce small indels. This adjustment should increase read retention without sacrificing too much mapping quality.
    
        Lower the Mismatch Penalty (-x):
            Current Setting: -x 20
            Suggestion: Try reducing this to -x 15 or -x 10.
            Explanation: A lower mismatch penalty allows more reads with minor mismatches to map, increasing the number of mapped reads. For viral genomes, this can be helpful because some variability is expected, especially in variant-calling workflows.
    
        Experiment with the Random Alignment Option (-r):
            Current Setting: -r Random
            Suggestion: If applicable, you might try other random alignment settings in NovoAlign or disable it temporarily to see if deterministic behavior (i.e., -r All) provides more reads without sacrificing quality.
            Explanation: This option controls how NovoAlign treats random alignments. Testing with or without it may affect read retention, especially if many reads align equally well to multiple sites.
    
        Increase the Soft-Clipping Parameter (-l):
            Current Setting: -l 20
            Suggestion: Try increasing to -l 30 or -l 40.
            Explanation: Higher soft-clipping allows NovoAlign to discard low-quality or mismatched bases at the read ends, which can improve alignment quality and retention by allowing reads that otherwise would be discarded due to terminal mismatches.
    
        Consider Using Paired-End Data:
            Suggestion: If you have paired-end data available, align the reads as paired rather than single-ended.
            Explanation: Paired-end alignment can improve mapping quality and retention since the additional read information helps resolve ambiguous mappings. It also increases the reliability of alignments by adding context from both read ends.
    
    Sample Adjusted Command
    
    Here’s an example of a modified NovoAlign command incorporating the above suggestions:
    
    novoalign -d reference.nix -f reads.fq -r Random -l 40 -g 30 -x 15 -t 80 -k > output.sam
    
    Additional Steps for High-Quality Variant Calling:
    
        Use a Quality Filter Post-Alignment: After alignment, apply a quality filter on the mapped reads using a tool like Samtools to ensure that only high-confidence alignments are used for variant calling.
        Optimize Read Processing: Trim low-quality bases and remove adapters prior to alignment to ensure that only high-quality reads are used. This will increase both the retention rate and the quality of the alignments.
        Review Variant Calling Parameters: After alignment, check the variant-calling parameters to ensure they are suitable for low-complexity viral genomes and that high-quality mapping is prioritized.
    
    These adjustments should help you retain more reads while still maintaining alignment quality suitable for variant calling. Adjust these parameters iteratively, evaluating the resulting alignments and variant calls to find the best balance between read count and quality.
    
    # (TODO) look which configuration can reach the similar results as the freebayes?
    vim bin/tools/vphaser2.py  # edit '-w 22000'
    mkdir 04_intrahost
    #[E::idx_find_and_load] Could not retrieve index file for 'data/02_align_to_self/HSV-Klinik_S2.mapped.bam'
    #[E::idx_find_and_load] Could not retrieve index file for 'data/02_align_to_self/HSV-Klinik_S2.mapped.bam'
    samtools index data/02_align_to_self/HSV1_S1.mapped.bam
    samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w22000.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 1000000 --loglevel DEBUG
    
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w22000.txt --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    
    #---- (Maybe next time, this time, it is not necessary): running once for l20_g40_x20_t100, once for l40_g30_x15_t80, which is option for novoalign in config.yaml, Note that we need rerun rerun 02_align_to_self.
    # -- 04_intrahost_--removeDoublyMappedReads_--minReadsEach5_--maxBias10 --
    mkdir data/04_intrahost
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    #bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    
    mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_removeDoublyMappedReads_minReadsEach5_maxBias10
    cd data/04_intrahost_l20_g40_x20_t100_removeDoublyMappedReads_minReadsEach5_maxBias10
    gunzip isnvs.annot.txt.gz
    ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    chr     pos     sample  patient time    alleles iSNV_freq
    OP297860        13203   HSV1_S1 HSV1_S1         T,C,A   1.0
    OP297860        13203   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
    OP297860        13522   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13522   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008905554253573941
    OP297860        13659   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13659   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008383233532934131
    OP297860        47109   HSV1_S1 HSV1_S1         T,G     0.0
    OP297860        47109   HSV-Klinik_S2   HSV-Klinik_S2           T,G     0.9929754139488208
    OP297860        47969   HSV1_S1 HSV1_S1         C,T,A   1.0
    OP297860        47969   HSV-Klinik_S2   HSV-Klinik_S2           C,T,A   0.017707985299031073
    OP297860        47989   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        47989   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.053730321697467484
    OP297860        53691   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        53691   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.023529411764705882
    OP297860        55501   HSV1_S1 HSV1_S1         T,C,A   1.0
    OP297860        55501   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
    OP297860        55807   HSV1_S1 HSV1_S1         C,A     0.0
    OP297860        55807   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.062176165803108814
    OP297860        63248   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        63248   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.016983016983016984
    OP297860        64328   HSV1_S1 HSV1_S1         C,A     1.0
    OP297860        64328   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.008469449485783423
    OP297860        65225   HSV1_S1 HSV1_S1         G,A     0.0
    OP297860        65225   HSV-Klinik_S2   HSV-Klinik_S2           G,A     0.8915304606240714
    OP297860        65402   HSV1_S1 HSV1_S1         C,A     0.0
    OP297860        65402   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.10222222222222224
    OP297860        66570   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        66570   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.05144291091593475
    OP297860        94750   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        94750   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.052851182197496516
    OP297860        95302   HSV1_S1 HSV1_S1         C,A     1.0
    OP297860        95302   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.01276595744680851
    
    #mv data/04_intrahost data/04_intrahost_l40_g30_x15_t80_removeDoublyMappedReads_minReadsEach5_maxBias10
    #cd data/04_intrahost_l40_g30_x15_t80_removeDoublyMappedReads_minReadsEach5_maxBias10
    #gunzip isnvs.annot.txt.gz
    #Keep groups where at least one record has iSNV_freq >= 0.05
    #~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    #cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    
    # -- 04_intrahost_--minReadsEach5_--maxBias10 --
    mkdir data/04_intrahost
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10
    #bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    
    mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias10
    cd data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias10
    gunzip isnvs.annot.txt.gz
    ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    
    chr     pos     sample  patient time    alleles iSNV_freq
    OP297860        13203   HSV1_S1 HSV1_S1         T,C,A   1.0
    OP297860        13203   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
    OP297860        13522   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13522   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008888888888888889
    OP297860        13659   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13659   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008359207069500836
    OP297860        47109   HSV1_S1 HSV1_S1         T,G     0.0
    OP297860        47109   HSV-Klinik_S2   HSV-Klinik_S2           T,G     0.9930174563591022
    OP297860        47969   HSV1_S1 HSV1_S1         C,T,A   1.0
    OP297860        47969   HSV-Klinik_S2   HSV-Klinik_S2           C,T,A   0.01828457446808511
    OP297860        47989   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        47989   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.053474114441416885
    OP297860        53691   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        53691   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.02342786683107275
    OP297860        55501   HSV1_S1 HSV1_S1         T,C,A   1.0
    OP297860        55501   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
    OP297860        55807   HSV1_S1 HSV1_S1         C,A     0.0
    OP297860        55807   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.061538461538461535
    OP297860        63248   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        63248   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.016815034619188922
    OP297860        64328   HSV1_S1 HSV1_S1         C,A     1.0
    OP297860        64328   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.008433734939759036
    OP297860        65225   HSV1_S1 HSV1_S1         G,A     0.0
    OP297860        65225   HSV-Klinik_S2   HSV-Klinik_S2           G,A     0.8916728076639646
    OP297860        65402   HSV1_S1 HSV1_S1         C,A     0.0
    OP297860        65402   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.1018149623727313
    OP297860        66570   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        66570   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.05112219451371571
    OP297860        94750   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        94750   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.052851182197496516
    OP297860        95302   HSV1_S1 HSV1_S1         C,A     1.0
    OP297860        95302   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.012725344644750796
    
    # -- 04_intrahost_--minReadsEach5_--maxBias1000000 --
    mkdir data/04_intrahost
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000
    #bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    
    mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias1000000
    cd data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias1000000
    gunzip isnvs.annot.txt.gz
    ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    
    chr     pos     sample  patient time    alleles iSNV_freq
    OP297860        13203   HSV1_S1 HSV1_S1         T,C,A   1.0
    OP297860        13203   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
    OP297860        13522   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13522   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008888888888888889
    OP297860        13659   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        13659   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008359207069500836
    OP297860        47109   HSV1_S1 HSV1_S1         T,G     0.0
    OP297860        47109   HSV-Klinik_S2   HSV-Klinik_S2           T,G     0.9930174563591022
    OP297860        47778   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        47778   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.05263157894736842
    OP297860        47969   HSV1_S1 HSV1_S1         C,T,A   1.0
    OP297860        47969   HSV-Klinik_S2   HSV-Klinik_S2           C,T,A   0.01828457446808511
    OP297860        47989   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        47989   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.053474114441416885
    OP297860        53691   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        53691   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.02342786683107275
    OP297860        55501   HSV1_S1 HSV1_S1         T,C,A   1.0
    OP297860        55501   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
    OP297860        55807   HSV1_S1 HSV1_S1         C,A     0.0
    OP297860        55807   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.061538461538461535
    OP297860        63248   HSV1_S1 HSV1_S1         G,T     1.0
    OP297860        63248   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.016815034619188922
    OP297860        64328   HSV1_S1 HSV1_S1         C,A     1.0
    OP297860        64328   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.008433734939759036
    OP297860        65225   HSV1_S1 HSV1_S1         G,A     0.0
    OP297860        65225   HSV-Klinik_S2   HSV-Klinik_S2           G,A     0.8916728076639646
    OP297860        65402   HSV1_S1 HSV1_S1         C,A     0.0
    OP297860        65402   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.1018149623727313
    OP297860        66570   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        66570   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.05112219451371571
    OP297860        94750   HSV1_S1 HSV1_S1         G,T     0.0
    OP297860        94750   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.052851182197496516
    OP297860        95302   HSV1_S1 HSV1_S1         C,A     1.0
    OP297860        95302   HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.012725344644750796
  20. Install a new viral-ngs including interhost and annotation steps (Failed!)

    #https://viral-ngs.readthedocs.io/en/latest/install.html
    wget https://raw.githubusercontent.com/broadinstitute/viral-ngs/master/easy-deploy-script/easy-deploy-viral-ngs.sh && chmod a+x ./easy-deploy-viral-ngs.sh && reuse UGER && qrsh -l h_vmem=10G -cwd -N "viral-ngs_deploy" -q interactive ./easy-deploy-viral-ngs.sh setup
    source ./easy-deploy-viral-ngs.sh load
    ./easy-deploy-viral-ngs.sh create-project HSV1_Capture
    
    #docker installation
    sudo usermod -aG docker jhuang
    #newgrp docker
    groups jhuang
    docker pull quay.io/broadinstitute/viral-ngs
    
    docker run -it quay.io/broadinstitute/viral-ngs /bin/bash
  21. Note that the intrahost results does not include the interhost results. Checking process.

    #Under data/02_assembly
    cp ../../ref_genome/reference.fasta HSV1_S1.fasta        #>HSV1_S1-1
    cp ../../ref_genome/reference.fasta HSV-Klinik_S2.fasta  #>HSV-Klinik_S2-1
    samtools faidx HSV1_S1.fasta
    picard CreateSequenceDictionary R=HSV1_S1.fasta O=HSV1_S1.dict
    ~/Tools/novocraft_v3/novoindex HSV1_S1.nix HSV1_S1.fasta
    samtools faidx HSV-Klinik_S2.fasta
    picard CreateSequenceDictionary R=HSV-Klinik_S2.fasta O=HSV-Klinik_S2.dict
    ~/Tools/novocraft_v3/novoindex HSV-Klinik_S2.nix HSV-Klinik_S2.fasta
    
    #total 128140
    #-rw-rw-r-- 1 jhuang jhuang 76693037 Nov 13 09:59 HSV1_S1.bam
    #-rw-rw-r-- 1 jhuang jhuang    34590 Nov 13 09:59 HSV1_S1.mapped.bam
    #-rw-rw-r-- 1 jhuang jhuang 48946378 Nov 13 10:03 HSV-Klinik_S2.bam
    #-rw-rw-r-- 1 jhuang jhuang  5537247 Nov 13 10:03 HSV-Klinik_S2.mapped.bam
    # vs
    #total 128140
    #-rw-rw-r-- 1 jhuang jhuang 76693095 Nov 15 12:47 HSV1_S1.bam
    #-rw-rw-r-- 1 jhuang jhuang    34587 Nov 15 12:47 HSV1_S1.mapped.bam
    #-rw-rw-r-- 1 jhuang jhuang 48946337 Nov 15 12:48 HSV-Klinik_S2.bam
    #-rw-rw-r-- 1 jhuang jhuang  5537246 Nov 15 12:48 HSV-Klinik_S2.mapped.bam
    
    #Manually generate the aligned_1.fasta due to too long runtime.
    cat ../../ref_genome/reference.fasta ../02_assembly/HSV1_S1.fasta ../02_assembly/HSV-Klinik_S2.fasta > aligned_1.fasta
    #>OP297860.1 Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les, complete genome
    #>HSV1_S1-1
    #>HSV-Klinik_S2-1
    
    #If this results is similar to freebayes, means the results successfully include interhost-results.
    #TODO: In next step, we should feed another bam-files, e.g. the cleaned bam-file into the pipelines!
    #DOESN'T WORK: snakemake --cleanup-metadata data/03_multialign_to_ref/sampleNameList.txt data/03_multialign_to_ref/aligned_1.fasta  --cores 1
    snakemake --printshellcmds --cores all
    
    mkdir data/04_intrahost
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    
    mv data/04_intrahost data/04_intrahost_including_interhost
    cd data/04_intrahost_including_interhost
    gunzip isnvs.annot.txt.gz
    ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    
    awk '$7 >= 5' vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000.txt > vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000_0.05.txt
    awk '$7 >= 50' vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000.txt > vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000_0.5.txt
    # How many SNPs?
    
    #bin/intrahost.py vphaser_one_sample data_v2/02_align_to_self/HSV-Klinik_S2.mapped.bam data_v2/02_assembly/HSV-Klinik_S2.fasta data_v2/04_intrahost/vphaser2.HSV-Klinik_S2_v2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000 --loglevel DEBUG
    #mv vphaser2.HSV-Klinik_S2.txt.gz
    # How many SNPs?
    awk '$7 >= 5' vphaser2.HSV-Klinik_S2_v2.txt > vphaser2.HSV-Klinik_S2_v2_.txt
    
    bin/intrahost.py vphaser_one_sample data_v2/02_align_to_self/HSV-Klinik_S2.mapped.bam data_v2/02_assembly/HSV-Klinik_S2.fasta data_v2/04_intrahost/vphaser2.HSV-Klinik_S2_v3.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    # How many SNPs?
    awk '$6 >= 0.05' isnvs.annot.txt > isnvs.annot_.txt
    
    -------
    
    #I used the viral-ngs get a table as follows:
            chr     pos     sample  patient time    alleles iSNV_freq       Hw      Hs      eff_type        eff_codon_dna   eff_aa  eff_aa_pos      eff_prot_len    eff_gene        eff_protein
    OP297860        9012    HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.0155954631379962      0.0307044893350152      1       intergenic_region       n.9012C>A                               RL2-UL1 Gene_1996_5580-Gene_9025_10636
    OP297860        9017    HSV-Klinik_S2   HSV-Klinik_S2           C,A     0.0408905043162199      0.0784369419459701      1       intergenic_region       n.9017C>A                               RL2-UL1 Gene_1996_5580-Gene_9025_10636
            In the process, the intrahost.py was used.
            intrahost.py - within-host genetic variation (iSNVs) The output has only contains
            I can so understand, the intrahost variants only reported. The chr OP297860 is only for the annotation. If a position in my clinical sample HSV-Klinik_S2 is different to OP297860, it will be not reported and not exists in the table.
    
    Column Descriptions in the Output Table
    
    The output table generated by this script will contain the following columns:
    
        chr: Chromosome or contig where the variant is located.
        pos: Position on the chromosome/contig of the variant.
        sample: The sample identifier for this variant.
        patient: Patient ID extracted from the sample name (assumes the format sample.patient).
        time: Time point of sample collection, extracted from the sample name (if present).
        alleles: The alleles involved in the variant. For example, C,A means Cytosine (C) and Adenine (A).
        iSNV_freq: Frequency of the variant in the sample. This is the sum of the frequencies of the variant alleles.
        Hw: Hardy-Weinberg equilibrium p-value for the variant. This is calculated from the genotype frequencies in the sample and indicates how well they conform to random mating expectations.
        Hs: Heterozygosity in the population based on consensus genotypes. It measures genetic diversity based on observed genotypes.
        eff_type: The type of effect the variant has on the gene, such as intergenic_region, start_lost, etc.
        eff_codon_dna: The effect of the variant at the DNA level (e.g., n.9012C>A).
        eff_aa: The amino acid effect of the variant (e.g., a change from one amino acid to another or a frameshift).
        eff_aa_pos: The position of the amino acid affected by the variant.
        eff_prot_len: The length of the protein after the variant is applied, which may be truncated if the variant causes a frameshift or a stop codon.
        eff_gene: The gene affected by the variant.
        eff_protein: The protein affected by the variant (e.g., a protein identifier like UXY89132.1).
    
    b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    -------
    2024-11-12 13:22:47,892 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
    [GCC 7.3.0]
    2024-11-12 13:22:47,893 - cmd:195:main_argparse - INFO - command: bin/intrahost.py merge_to_vcf refFasta=ref_genome/reference.fasta outVcf=data/04_intrahost/isnvs.vcf.gz samples=['HSV-Klinik_S2'] isnvs=['data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz'] alignments=['data/03_multialign_to_ref/aligned_1.fasta'] strip_chr_version=True naive_filter=False parse_accession=True loglevel=INFO
    2024-11-12 13:22:47,893 - intrahost:476:merge_to_vcf - INFO - loaded CoordMapper for all genomes, starting VCF merge...
    Traceback (most recent call last):
    File "bin/intrahost.py", line 1152, in 
    util.cmd.main_argparse(__commands__, __doc__) File “/home/jhuang/Tools/viral-ngs/bin/util/cmd.py”, line 221, in main_argparse ret = args.func_main(args) File “/home/jhuang/Tools/viral-ngs/bin/util/cmd.py”, line 102, in _main mainfunc(**args2) File “bin/intrahost.py”, line 530, in merge_to_vcf raise LookupError(“Not all reference sequences found in alignments.”) LookupError: Not all reference sequences found in alignments. [Tue Nov 12 13:22:47 2024] Error in rule isnvs_vcf: jobid: 0 output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi RuleException: CalledProcessError in line 61 of /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules: Command ‘set -euo pipefail; bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz –samples HSV-Klinik_S2 –isnvs data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz –alignments data/03_multialign_to_ref/aligned_1.fasta –strip_chr_version –parse_accession’ returned non-zero exit status 1. File “/mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules”, line 61, in __rule_isnvs_vcf File “/usr/lib/python3.10/concurrent/futures/thread.py”, line 58, in run Exiting because a job execution failed. Look above for error message Columns Breakdown: Ref_Pos (e.g., 55, 104, 210): This refers to the position in the genome where the variant occurs. In this example, the variants occur at positions 55, 104, and 210. Var (e.g., T C, G A, T C): This is the variant observed at that position. It shows the reference base (before the variant) and the alternate base (after the variant). For example: At position 55, the reference base is T and the alternate base is C. At position 104, the reference base is G and the alternate base is A. At position 210, the reference base is T and the alternate base is C. Cons (e.g., 0.8156, 0.1674, 0.1065): This represents the variant frequency (or proportion) in the sample, expressed as a decimal. It shows the fraction of reads supporting the alternate base (C, A, etc.). For example: At position 55, 81.56% of the reads support the alternate base C. At position 104, 16.74% of the reads support the alternate base A. At position 210, 10.65% of the reads support the alternate base C. Strd_bias_pval (e.g., 0.8156, 0.1674, 0.1065): This represents the strand bias p-value for the variant. It tests if there is an uneven distribution of reads between the forward and reverse strands for the variant. A higher p-value suggests no significant strand bias. A lower p-value suggests a possible strand bias, meaning the variant might be incorrectly called due to a bias in sequencing reads from one strand. Type (e.g., snp): This indicates the type of variant. In this case, it’s a SNP (single nucleotide polymorphism). It means that a single nucleotide in the genome has been altered. Var_perc (e.g., 16.1, 14.07, 10.58): This represents the percentage of variants for each alternate base, which is very similar to the Cons column but expressed as a percentage. For example: At position 55, the alternate base C is observed in 16.1% of the reads. At position 104, the alternate base A is observed in 14.07% of the reads. At position 210, the alternate base C is observed in 10.58% of the reads. SNP_or_LP_Profile (e.g., C:65:34 T:13:6): This contains information on the read counts for the reference base (T, G, etc.) and the alternate base (C, A, etc.). The format is: Reference base count (forward strand : reverse strand) Alternate base count (forward strand : reverse strand) For example, at position 55: C (alternate base) has 65 reads on the forward strand and 34 on the reverse strand. T (reference base) has 13 reads on the forward strand and 6 on the reverse strand. Summary: SNPV and LPV The last line of the output gives a summary of the total number of SNPs and LPs (likely Low-Quality Polymorphisms or Low Probability Variants): # Summary: SNPV: 132; LPV: 0 SNPV: 132: This indicates the total number of SNP variants detected in the data. In this case, there are 132 SNPs identified. LPV: 0: This indicates the number of Low Probability Variants (LPVs). A value of 0 means no low-quality variant calls were detected, indicating that the analysis did not identify any variants with low confidence. # Minimum number of reads on each strand vphaser_min_reads_each: 5 # Maximum allowable ratio of number of reads on the two # strands. Ignored if vphaser_max_bins=0. vphaser_max_bins: 10 # A simple filter for the VCF merge step. # If set to true, keep only the alleles that have at least two # independent libraries of support and # allele freq > 0.005. If false, no filtering is performed. vcf_merge_naive_filter: false
  22. (Optional)

    152526
    GapFiller.pl -l libraries_p2564.txt -s data/02_assembly/p2564.fasta
    #parainfluenza bwa /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R1.fastq.gz /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R2.fastq.gz 300 1.0 FR
    
    #since HSV1 and HSV-Klinik_S2 has different regions covered --> multialign_to_ref is none!
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    
    (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.mapped.bam
    162156 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    162156 + 0 mapped (100.00% : N/A)
    162156 + 0 paired in sequencing
    81048 + 0 read1
    81108 + 0 read2
    161068 + 0 properly paired (99.33% : N/A)
    161630 + 0 with itself and mate mapped
    526 + 0 singletons (0.32% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/01_per_sample$ samtools flagstat HSV-Klinik_S2.taxfilt.bam
    800454 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 mapped (0.00% : N/A)
    800454 + 0 paired in sequencing
    400227 + 0 read1
    400227 + 0 read2
    0 + 0 properly paired (0.00% : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.bam
    885528 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    191932 + 0 duplicates
    354088 + 0 mapped (39.99% : N/A)
    885528 + 0 paired in sequencing
    442764 + 0 read1
    442764 + 0 read2
    323502 + 0 properly paired (36.53% : N/A)
    324284 + 0 with itself and mate mapped
    29804 + 0 singletons (3.37% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
  23. Summarize statistics from snakemake-output

    samples-runs.txt
    
    samtools flagstat data/02_align_to_self/838_S1.mapped.bam
    samtools flagstat data/02_align_to_self/840_S2.mapped.bam
    samtools flagstat data/02_align_to_self/820_S3.mapped.bam
    samtools flagstat data/02_align_to_self/828_S4.mapped.bam
    samtools flagstat data/02_align_to_self/815_S5.mapped.bam
    samtools flagstat data/02_align_to_self/834_S6.mapped.bam
    samtools flagstat data/02_align_to_self/808_S7.mapped.bam
    samtools flagstat data/02_align_to_self/811_S8.mapped.bam
    samtools flagstat data/02_align_to_self/837_S9.mapped.bam
    samtools flagstat data/02_align_to_self/768_S10.mapped.bam
    samtools flagstat data/02_align_to_self/773_S11.mapped.bam
    samtools flagstat data/02_align_to_self/767_S12.mapped.bam
    samtools flagstat data/02_align_to_self/810_S13.mapped.bam
    samtools flagstat data/02_align_to_self/814_S14.mapped.bam
    samtools flagstat data/02_align_to_self/10121-16_S15.mapped.bam     -->           3c
    Origin of hepatitis C virus genotype 3 in Africa as estimated
                through an evolutionary analysis of the full-length genomes of nine
                subtypes, including the newly sequenced 3d and 3e
    
    samtools flagstat data/02_align_to_self/7510-15_S16.mapped.bam      -->
    samtools flagstat data/02_align_to_self/828-17_S17.mapped.bam
    samtools flagstat data/02_align_to_self/8806-15_S18.mapped.bam
    samtools flagstat data/02_align_to_self/9881-16_S19.mapped.bam
    samtools flagstat data/02_align_to_self/8981-14_S20.mapped.bam
  24. Consensus sequences of each and of all isolates

    cp data/02_assembly/*.fasta ./
    for sample in 838_S1 840_S2 820_S3 828_S4 815_S5 834_S6 808_S7 811_S8 837_S9 768_S10 773_S11 767_S12 810_S13 814_S14 10121-16_S15 7510-15_S16 828-17_S17 8806-15_S18 9881-16_S19 8981-14_S20; do
    for sample in p953-84660-tsek p938-16972-nra p942-88507-nra p943-98523-nra p944-103323-nra p947-105565-nra p948-112830-nra; do \
    mv ${sample}.fasta ${sample}.fa
    cat all.fa ${sample}.fa >> all.fa
    done
    
    cat RSV_dedup.fa all.fa > RSV_all.fa
    mafft --adjustdirection RSV_all.fa > RSV_all.aln
    snp-sites RSV_all.aln -o RSV_all_.aln
  25. Finding the next strain with Phylogenetics: send both HCV231_all.png and HCV231_all.pdf to the Nicole

    #1, generate tree
    cat SARS-CoV-2_len25000_w60_newheader.fa ~/rtpd_files/2029-AW_S5/idba_ud_assembly/gapped_contig.fa > CoV2_all.fa
    mafft --adjustdirection CoV2_all.fa > CoV2_all.aln
    snp-sites CoV2_all.aln -o CoV2_all_.aln
    fasttree -gtr -nt RSV_all_.aln > RSV_all.tree
    fasttree -gtr -nt Ortho_SNP_matrix_RAxML.fa > Ortho_SNP_matrix_RAxML.tree
    raxml-ng --all --model GTR+G+ASC_LEWIS --prefix CoV2_all_raxml.aln --threads 1 --msa CoV2_all_.aln --bs-trees 1000 --redo
    #raxml-ng --all --model GTR+G+ASC_LEWIS --prefix raxml-ng/snippy.core.aln --threads 1 --msa variants/snippy.core.aln --bs-trees 1000 --redo
    
    #2, open tree on Dendroscope, from phylogenetic tree, get genotype-refs as follows,
    1a: S10, S11, 814_S14(3-->1a?), S18 --> 1a_EF407457
    1b: S12 --> 1b_M58335
    2a: 815_S5(3-->2a?) --> 2a_D00944
    2c: S20 --> 2c_D50409
    3a: S3, S7, S8, S13, S15, S16, S19 --> 3c_KY620605
    4d: S1, S2, S9 --> 4d_EU392172
    4k: S4, S6 --> 4k_EU392173
    
    --> KX249682.1
    --> KX765935.1
    --> KM517573.1
    
    cd data/02_assembly/
    cat p2.fasta p3e.fasta p4e.fasta p5e.fasta > all.fasta
    sed -i -e 's/-1//g' all.fasta
    #sed -i -e 's/e-1//g' all.fasta
    mafft --adjustdirection --clustalout all.fasta > all.aln
    # MANUALLY CORRECTION!
    
    ##POLISH the assembled contigs
    #for sample in p953 p938 p942 p943 p944 p947 p948  p955 p954 p952 p951 p946 p945 p940; do
    #  rm ${sample}_polished.fa
    #  #seqtk sample ../../trimmed/${sample}_R1.fastq.gz 0.1 > ${sample}_0.1_R1.fastq
    #  #seqtk sample ../../trimmed/${sample}_R2.fastq.gz 0.1 > ${sample}_0.1_R2.fastq
    #  polish_viral_ref.sh -1 ../../trimmed/${sample}_R1.fastq.gz -2 ../../trimmed/${sample}_R2.fastq.gz -r ${sample}.fasta -o ${sample}_polished.fa -t 6
    #done
    
    for sample in p946 p954 p952 p948 p945 p947  p955 p943 p951 p942; do  #all.aln
    for sample in p944 p938 p953 p940; do  #all2.aln
    for sample in p2 p3 p4 p5; do
    grep "${sample}" all.aln > REF${sample}.fasta
    #cut -f2-2 -d$'\t' REF${sample}.fasta > REF${sample}.fast
    sed -i -e "s/${sample}              //g" REF${sample}.fasta
    sed -i -e "s/${sample}-1            //g" REF${sample}.fasta
    sed -i -e 's/-//g' REF${sample}.fasta
    echo ">REF${sample}" > REF${sample}.header
    cat REF${sample}.header REF${sample}.fasta > REF${sample}.fas
    seqkit seq -u REF${sample}.fas -o REF${sample}.fa
    cp REF${sample}.fa ${sample}.fa
    mv REF${sample}.fa ../..
    sed -i -e "s/REF//g" ${sample}.fa    #still under data/02_assembly/
    done
    
    #ReferenceSeeker determines closely related reference genomes
    #https://github.com/oschwengers/referenceseeker
    (referenceseeker) jhuang@hamburg:~/DATA/Data_Holger_Efaecium$ ~/Tools/referenceseeker/bin/referenceseeker -v ~/REFs/bacteria-refseq/ shovill/noAB_wildtype/contigs.fasta
    
    # Annotating the fasta using VAPiD
    makeblastdb -in *.fasta -dbtype nucl
    python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta p946R.fa ~/REFs/template_Less.sbt
    python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta REFp944.fa ~/REFs/template_Less.sbt   # KT581445.1 selected!
    python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta contigs_final.fasta ~/REFs/template_Amir.sbt
    python ~/Tools/VAPiD/vapid3.py --online contigs_final.fasta ~/REFs/template_Amir.sbt
  26. All packages under the viral-ngs4 env, note that novoalign is not installed. The used Novoalign path: /home/jhuang/Tools/novocraft_v3/novoalign; the used gatk: /usr/local/bin/gatk using /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar (see the point 9).

    mamba remove viral-ngs --all
    mamba remove viral-ngs-env --all
    conda remove viral-ngs-java7 --all
    conda remove viral-ngs-java8 --all
    conda remove viral-ngs-py36 --all
    conda remove viral-ngs2 --all
    conda remove viral-ngs3 --all
    jhuang@WS-2290C:~$ conda activate viral-ngs4
    (viral-ngs4) jhuang@WS-2290C:~$ conda list
    # packages in environment at /home/jhuang/miniconda3/envs/viral-ngs4:
    #
    # Name                    Version                   Build  Channel
    _libgcc_mutex             0.1                 conda_forge    conda-forge
    _openmp_mutex             4.5                       2_gnu    conda-forge
    _r-mutex                  1.0.1               anacondar_1    conda-forge
    alsa-lib                  1.2.3.2              h166bdaf_0    conda-forge
    bamtools                  2.5.2                hdcf5f25_5    bioconda
    bedtools                  2.31.1               hf5e1c6e_2    bioconda
    binutils_impl_linux-64    2.43                 h4bf12b8_2    conda-forge
    binutils_linux-64         2.43                 h4852527_2    conda-forge
    biopython                 1.79             py36h8f6f2f9_0    conda-forge
    blast                     2.6.0               boost1.64_2    bioconda
    bmfilter                  3.101                h4ac6f70_5    bioconda
    bmtagger                  3.101                h470a237_4    bioconda
    bmtool                    3.101                hdbdd923_5    bioconda
    boost                     1.64.0                   py36_4    conda-forge
    boost-cpp                 1.64.0                        1    conda-forge
    bowtie                    1.3.1            py36h769816f_3    bioconda
    bowtie2                   2.5.4                h7071971_4    bioconda
    bwa                       0.7.18               he4a0461_1    bioconda
    bwidget                   1.9.14               ha770c72_1    conda-forge
    bzip2                     1.0.8                h4bc722e_7    conda-forge
    c-ares                    1.34.2               heb4867d_0    conda-forge
    ca-certificates           2024.9.24            h06a4308_0
    cairo                     1.16.0            h18b612c_1001    conda-forge
    cd-hit                    4.8.1               h43eeafb_10    bioconda
    cd-hit-auxtools           4.8.1                h4ac6f70_3    bioconda
    certifi                   2021.5.30        py36h5fab9bb_0    conda-forge
    curl                      7.68.0               hf8cf82a_0    conda-forge
    cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
    dbus                      1.13.6               hfdff14a_1    conda-forge
    diamond                   2.1.10               h43eeafb_2    bioconda
    expat                     2.6.4                h5888daf_0    conda-forge
    extract_fullseq           3.101                h4ac6f70_5    bioconda
    fastqc                    0.12.1               hdfd78af_0    bioconda
    font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
    fontconfig                2.14.1               hef1e5e3_0
    freetype                  2.12.1               h267a509_2    conda-forge
    fribidi                   1.0.10               h36c2ea0_0    conda-forge
    future                    0.18.2           py36h5fab9bb_3    conda-forge
    gap2seq                   2.1                 boost1.64_1    bioconda
    gatk                      3.6                 hdfd78af_11    bioconda
    gcc_impl_linux-64         14.2.0               h6b349bd_1    conda-forge
    gcc_linux-64              14.2.0               h5910c8f_5    conda-forge
    gettext                   0.22.5               he02047a_3    conda-forge
    gettext-tools             0.22.5               he02047a_3    conda-forge
    gfortran_impl_linux-64    14.2.0               hc73f493_1    conda-forge
    gfortran_linux-64         14.2.0               hda50785_5    conda-forge
    giflib                    5.2.2                hd590300_0    conda-forge
    glib                      2.66.3               h58526e2_0    conda-forge
    graphite2                 1.3.13            h59595ed_1003    conda-forge
    gsl                       2.4               h294904e_1006    conda-forge
    gst-plugins-base          1.14.5               h0935bb2_2    conda-forge
    gstreamer                 1.14.5               h36ae1b5_2    conda-forge
    gxx_impl_linux-64         14.2.0               h2c03514_1    conda-forge
    gxx_linux-64              14.2.0               h9423afd_5    conda-forge
    harfbuzz                  2.4.0                h37c48d4_1    conda-forge
    icu                       58.2              hf484d3e_1000    conda-forge
    jpeg                      9e                   h0b41bf4_3    conda-forge
    kernel-headers_linux-64   3.10.0              he073ed8_18    conda-forge
    keyutils                  1.6.1                h166bdaf_0    conda-forge
    kiwisolver                1.3.1            py36h605e78d_1    conda-forge
    kmer-jellyfish            2.3.1                h4ac6f70_2    bioconda
    krb5                      1.16.4               h2fd8d38_0    conda-forge
    last                      876                      py36_0    bioconda
    lcms2                     2.12                 hddcbb42_0    conda-forge
    ld_impl_linux-64          2.43                 h712a8e2_2    conda-forge
    libasprintf               0.22.5               he8f35ee_3    conda-forge
    libasprintf-devel         0.22.5               he8f35ee_3    conda-forge
    libblas                   3.9.0           25_linux64_openblas    conda-forge
    libcblas                  3.9.0           25_linux64_openblas    conda-forge
    libcurl                   7.68.0               hda55be3_0    conda-forge
    libdeflate                1.21                 h4bc722e_0    conda-forge
    libedit                   3.1.20191231         he28a2e2_2    conda-forge
    libev                     4.33                 hd590300_2    conda-forge
    libexpat                  2.6.4                h5888daf_0    conda-forge
    libffi                    3.2.1             he1b5a44_1007    conda-forge
    libgcc                    14.2.0               h77fa898_1    conda-forge
    libgcc-devel_linux-64     14.2.0             h41c2201_101    conda-forge
    libgcc-ng                 14.2.0               h69a702a_1    conda-forge
    libgettextpo              0.22.5               he02047a_3    conda-forge
    libgettextpo-devel        0.22.5               he02047a_3    conda-forge
    libgfortran-ng            7.5.0               h14aa051_20    conda-forge
    libgfortran4              7.5.0               h14aa051_20    conda-forge
    libgfortran5              14.2.0               hd5240d6_1    conda-forge
    libglib                   2.66.3               hbe7bbb4_0    conda-forge
    libgomp                   14.2.0               h77fa898_1    conda-forge
    libiconv                  1.17                 hd590300_2    conda-forge
    libidn11                  1.33                 h7b6447c_0
    liblapack                 3.9.0           25_linux64_openblas    conda-forge
    libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
    libnsl                    2.0.1                hd590300_0    conda-forge
    libopenblas               0.3.28          pthreads_h94d23a6_0    conda-forge
    libpng                    1.6.43               h2797004_0    conda-forge
    libsanitizer              14.2.0               h2a3dede_1    conda-forge
    libsqlite                 3.46.0               hde9e2c9_0    conda-forge
    libssh2                   1.10.0               haa6b8db_3    conda-forge
    libstdcxx                 14.2.0               hc0a3c3a_1    conda-forge
    libstdcxx-devel_linux-64  14.2.0             h41c2201_101    conda-forge
    libstdcxx-ng              14.2.0               h4852527_1    conda-forge
    libtiff                   4.2.0                hf544144_3    conda-forge
    libuuid                   1.0.3                h7f8727e_2
    libwebp-base              1.4.0                hd590300_0    conda-forge
    libxcb                    1.17.0               h8a09558_0    conda-forge
    libxcrypt                 4.4.36               hd590300_1    conda-forge
    libxml2                   2.9.14               h74e7548_0
    libzlib                   1.2.13               h4ab18f5_6    conda-forge
    llvm-openmp               8.0.1                hc9558a2_0    conda-forge
    mafft                     7.221                         0    bioconda
    make                      4.4.1                hb9d3cd8_2    conda-forge
    matplotlib                3.3.4            py36h5fab9bb_0    conda-forge
    matplotlib-base           3.3.4            py36hd391965_0    conda-forge
    mummer4                   4.0.0rc1        pl5321hdbdd923_7    bioconda
    muscle                    3.8.1551             h7d875b9_6    bioconda
    mvicuna                   1.0                 h4ac6f70_10    bioconda
    ncurses                   6.5                  he02047a_1    conda-forge
    numpy                     1.19.5           py36hfc0c790_2    conda-forge
    olefile                   0.46               pyh9f0ad1d_1    conda-forge
    openjdk                   8.0.412              hd590300_1    conda-forge
    openjpeg                  2.4.0                hb52868f_1    conda-forge
    openmp                    8.0.1                         0    conda-forge
    openssl                   1.1.1w               hd590300_0    conda-forge
    pandas                    1.1.5            py36h284efc9_0    conda-forge
    pango                     1.42.4               h7062337_4    conda-forge
    parallel                  20240922             ha770c72_0    conda-forge
    pcre                      8.45                 h9c3ff4c_0    conda-forge
    perl                      5.32.1          7_hd590300_perl5    conda-forge
    picard                    3.0.0                hdfd78af_0    bioconda
    pigz                      2.6                  h27cfd23_0
    pillow                    8.2.0            py36ha6010c0_1    conda-forge
    pip                       21.3.1             pyhd8ed1ab_0    conda-forge
    pixman                    0.38.0            h516909a_1003    conda-forge
    prinseq                   0.20.4               hdfd78af_5    bioconda
    pthread-stubs             0.4               hb9d3cd8_1002    conda-forge
    pybedtools                0.9.0            py36h7281c5b_1    bioconda
    pyparsing                 3.1.4              pyhd8ed1ab_0    conda-forge
    pyqt                      5.9.2            py36hcca6a23_4    conda-forge
    pysam                     0.16.0           py36h873a209_0    bioconda
    python                    3.6.7             h381d211_1004    conda-forge
    python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
    python_abi                3.6                     2_cp36m    conda-forge
    pytz                      2023.3.post1       pyhd8ed1ab_0    conda-forge
    pyyaml                    5.4.1            py36h8f6f2f9_1    conda-forge
    qt                        5.9.7                h52cfd70_2    conda-forge
    r-assertthat              0.2.1             r36h6115d3f_2    conda-forge
    r-backports               1.2.1             r36hcfec24a_0    conda-forge
    r-base                    3.6.1                h9bb98a2_1
    r-bitops                  1.0_7             r36hcfec24a_0    conda-forge
    r-brio                    1.1.2             r36hcfec24a_0    conda-forge
    r-callr                   3.7.0             r36hc72bb7e_0    conda-forge
    r-catools                 1.18.2            r36h03ef668_0    conda-forge
    r-cli                     2.5.0             r36hc72bb7e_0    conda-forge
    r-colorspace              2.0_1             r36hcfec24a_0    conda-forge
    r-crayon                  1.4.1             r36hc72bb7e_0    conda-forge
    r-desc                    1.3.0             r36hc72bb7e_0    conda-forge
    r-diffobj                 0.3.4             r36hcfec24a_0    conda-forge
    r-digest                  0.6.27            r36h03ef668_0    conda-forge
    r-ellipsis                0.3.2             r36hcfec24a_0    conda-forge
    r-evaluate                0.14              r36h6115d3f_2    conda-forge
    r-fansi                   0.4.2             r36hcfec24a_0    conda-forge
    r-farver                  2.1.0             r36h03ef668_0    conda-forge
    r-ggplot2                 3.3.3             r36hc72bb7e_0    conda-forge
    r-glue                    1.4.2             r36hcfec24a_0    conda-forge
    r-gplots                  3.1.1             r36hc72bb7e_0    conda-forge
    r-gsalib                  2.1                    r36_1002    conda-forge
    r-gtable                  0.3.0             r36h6115d3f_3    conda-forge
    r-gtools                  3.8.2             r36hcdcec82_1    conda-forge
    r-isoband                 0.2.4             r36h03ef668_0    conda-forge
    r-jsonlite                1.7.2             r36hcfec24a_0    conda-forge
    r-kernsmooth              2.23_20           r36h742201e_0    conda-forge
    r-labeling                0.4.2             r36h142f84f_0    conda-forge
    r-lattice                 0.20_44           r36hcfec24a_0    conda-forge
    r-lifecycle               1.0.0             r36hc72bb7e_0    conda-forge
    r-magrittr                2.0.1             r36hcfec24a_1    conda-forge
    r-mass                    7.3_54            r36hcfec24a_0    conda-forge
    r-matrix                  1.3_3             r36he454529_0    conda-forge
    r-mgcv                    1.8_35            r36he454529_0    conda-forge
    r-munsell                 0.5.0           r36h6115d3f_1003    conda-forge
    r-nlme                    3.1_152           r36h859d828_0    conda-forge
    r-pillar                  1.6.1             r36hc72bb7e_0    conda-forge
    r-pkgconfig               2.0.3             r36h6115d3f_1    conda-forge
    r-pkgload                 1.2.1             r36h03ef668_0    conda-forge
    r-plyr                    1.8.6             r36h0357c0b_1    conda-forge
    r-praise                  1.0.0           r36h6115d3f_1004    conda-forge
    r-processx                3.5.2             r36hcfec24a_0    conda-forge
    r-ps                      1.6.0             r36hcfec24a_0    conda-forge
    r-r6                      2.5.0             r36hc72bb7e_0    conda-forge
    r-rcolorbrewer            1.1_2           r36h6115d3f_1003    conda-forge
    r-rcpp                    1.0.6             r36h03ef668_0    conda-forge
    r-rematch2                2.1.2             r36h6115d3f_1    conda-forge
    r-reshape                 0.8.8             r36hcdcec82_2    conda-forge
    r-rlang                   0.4.11            r36hcfec24a_0    conda-forge
    r-rprojroot               2.0.2             r36hc72bb7e_0    conda-forge
    r-rstudioapi              0.13              r36hc72bb7e_0    conda-forge
    r-scales                  1.1.1             r36h6115d3f_0    conda-forge
    r-testthat                3.0.2             r36h03ef668_0    conda-forge
    r-tibble                  3.1.2             r36hcfec24a_0    conda-forge
    r-utf8                    1.2.1             r36hcfec24a_0    conda-forge
    r-vctrs                   0.3.8             r36hcfec24a_1    conda-forge
    r-viridislite             0.4.0             r36hc72bb7e_0    conda-forge
    r-waldo                   0.2.5             r36hc72bb7e_0    conda-forge
    r-withr                   2.4.2             r36hc72bb7e_0    conda-forge
    readline                  7.0               hf8c457e_1001    conda-forge
    salmon                    0.14.2               ha0cc327_0    bioconda
    samtools                  1.6                  h244ad75_5    bioconda
    setuptools                58.0.4           py36h5fab9bb_2    conda-forge
    sip                       4.19.8          py36hf484d3e_1000    conda-forge
    six                       1.16.0             pyh6c4a22f_0    conda-forge
    snpeff                    4.1l                 hdfd78af_8    bioconda
    spades                    3.15.5               h95f258a_1    bioconda
    sqlite                    3.28.0               h8b20d00_0    conda-forge
    srprism                   2.4.24               h6a68c12_5    bioconda
    sysroot_linux-64          2.17                h4a8ded7_18    conda-forge
    tbb                       2020.3               hfd86e86_0
    tbl2asn                   25.7                 h9ee0642_1    bioconda
    tk                        8.6.13          noxft_h4845f30_101    conda-forge
    tktable                   2.10                 h8bc8fbc_6    conda-forge
    tornado                   6.1              py36h8f6f2f9_1    conda-forge
    trimmomatic               0.39                 hdfd78af_2    bioconda
    trinity                   2.8.5                h8b12597_5    bioconda
    tzdata                    2024b                hc8b5060_0    conda-forge
    unzip                     6.0                  h611a1e1_0
    vphaser2                  2.0                 h7a259b3_14    bioconda
    wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
    xorg-libice               1.0.10               h7f98852_0    conda-forge
    xorg-libsm                1.2.2                h470a237_5    conda-forge
    xorg-libx11               1.8.10               h4f16b4b_0    conda-forge
    xorg-libxau               1.0.11               hb9d3cd8_1    conda-forge
    xorg-libxdmcp             1.1.5                hb9d3cd8_0    conda-forge
    xorg-libxext              1.3.6                hb9d3cd8_0    conda-forge
    xorg-libxfixes            6.0.1                hb9d3cd8_0    conda-forge
    xorg-libxi                1.8.2                hb9d3cd8_0    conda-forge
    xorg-libxrender           0.9.11               hb9d3cd8_1    conda-forge
    xorg-libxtst              1.2.5                hb9d3cd8_3    conda-forge
    xorg-xorgproto            2024.1               hb9d3cd8_1    conda-forge
    xz                        5.2.6                h166bdaf_0    conda-forge
    yaml                      0.2.5                h7f98852_2    conda-forge
    zlib                      1.2.13               h4ab18f5_6    conda-forge
    zstd                      1.5.6                ha6fb4c9_0    conda-forge
  27. commands of viral-ngs

    bin/interhost.py
    
    Enter a subcommand to view additional information:
    
    interhost.py snpEff [...]
        Annotate variants in VCF file with translation consequences
        using snpEff.
    
    interhost.py align_mafft [...]
        Run the mafft alignment on the input FASTA file.
    
    interhost.py multichr_mafft [...]
        Run the mafft alignment on a series of chromosomes provided
        in sample-partitioned FASTA files. Output as FASTA. (i.e.
        file1.fasta would contain chr1, chr2, chr3; file2.fasta
        would also contain chr1, chr2, chr3)
    
    bin/ncbi.py
    
    Enter a subcommand to view additional information:
    
    ncbi.py tbl_transfer [...]
        This function takes an NCBI TBL file describing features on
        a genome(genes, etc) and transfers them to a new genome.
    
    ncbi.py tbl_transfer_prealigned [...]
        This breaks out the ref and alt sequences into separate
        fasta files, and thencreates unified files containing the
        reference sequence first and the alt second. Each of these
        unified filesis then passed as a cmap to
        tbl_transfer_common. This function expects to receive one
        fasta file containing a multialignment of a single
        segment/chromosome alongwith the respective reference
        sequence for that segment/chromosome. It also expects a
        reference containing allreference segments/chromosomes, so
        that the reference sequence can be identified in the input
        file by name. Italso expects a list of reference tbl files,
        where each file is named according to the ID present for
        itscorresponding sequence in the refFasta. For each non-
        reference sequence present in the inputFasta, two files
        arewritten: a fasta containing the segment/chromosome for
        the same, along with its corresponding feature table
        ascreated by tbl_transfer_common.
    
    ncbi.py fetch_fastas [...]
        This function downloads and saves the FASTA filesfrom the
        Genbank CoreNucleotide database given a given list of
        accession IDs.
    
    ncbi.py fetch_feature_tables [...]
        This function downloads and savesfeature tables from the
        Genbank CoreNucleotide database given a given list of
        accession IDs.
    
    ncbi.py fetch_genbank_records [...]
        This function downloads and savesfull flat text records from
        Genbank CoreNucleotide database given a given list of
        accession IDs.
    
    ncbi.py prep_genbank_files [...]
        Prepare genbank submission files. Requires .fasta and .tbl
        files as input,as well as numerous other metadata files for
        the submission. Creates adirectory full of files (.sqn in
        particular) that can be sent to GenBank.
    
    ncbi.py prep_sra_table [...]
        This is a very lazy hack that creates a basic table that can
        bepasted into various columns of an SRA submission
        spreadsheet. It probablydoesn't work in all cases.
  28. ~/Scripts/check_sequence_differences.py

    #!/usr/bin/env python3
    
    from Bio import AlignIO
    import sys
    
    # Check if correct arguments are provided
    if len(sys.argv) != 2:
        print("Usage: python check_sequence_differences.py 
    “) sys.exit(1) # Get the input file name from the command-line arguments input_file = sys.argv[1] # Read the alignment from the input CLUSTAL file alignment = AlignIO.read(input_file, “clustal”) # Extract the sequences for easy comparison seq_op = alignment[0].seq seq_hsv1 = alignment[1].seq seq_hsv_klinik = alignment[2].seq # Initialize a list to store positions with differences differences = [] # Iterate over each position in the alignment for i in range(len(seq_op)): op_base = seq_op[i] hsv1_base = seq_hsv1[i] hsv_klinik_base = seq_hsv_klinik[i] # Compare the sequences at the current position if op_base != hsv1_base or op_base != hsv_klinik_base or hsv1_base != hsv_klinik_base: differences.append((i + 1, op_base, hsv1_base, hsv_klinik_base)) # Print the differences if differences: print(“Differences found at the following positions:”) for diff in differences: pos, op_base, hsv1_base, hsv_klinik_base = diff print(f”Position {pos}: OP297860.1 = {op_base}, HSV1_S1-1 = {hsv1_base}, HSV-Klinik_S2-1 = {hsv_klinik_base}”) else: print(“No differences found between the sequences.”)
  29. ~/Scripts/summarize_snippy_res.py

    import pandas as pd
    import glob
    import argparse
    import os
    
    #python3 summarize_snps_indels.py snippy_HDRNA_01/snippy
    
    #The following record for 2365295 is wrong, since I am sure in the HDRNA_01_K010, it should be a 'G', since in HDRNA_01_K010.csv
    #CP133676,2365295,snp,A,G,G:178 A:0
    #
    #The current output is as follows:
    #CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,None,,,,,,None,None
    #CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan
    #grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
    #BUG: CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan
    
    import pandas as pd
    import glob
    import argparse
    import os
    
    def main(base_directory):
        # List of isolate identifiers
        isolates = ["HSV1_S1", "HSV-Klinik_S2"]
        expected_columns = ["CHROM", "POS", "REF", "ALT", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"]
    
        # Find all CSV files in the directory and its subdirectories
        csv_files = glob.glob(os.path.join(base_directory, '**', '*.csv'), recursive=True)
    
        # Create an empty DataFrame to store the summary
        summary_df = pd.DataFrame()
    
        # Iterate over each CSV file
        for file_path in csv_files:
            # Extract isolate identifier from the file name
            isolate = os.path.basename(file_path).replace('.csv', '')
            df = pd.read_csv(file_path)
    
            # Ensure all expected columns are present, adding missing ones as empty columns
            for col in expected_columns:
                if col not in df.columns:
                    df[col] = None
    
            # Extract relevant columns
            df = df[expected_columns]
    
            # Ensure consistent data types
            df = df.astype({"CHROM": str, "POS": int, "REF": str, "ALT": str, "TYPE": str, "EFFECT": str, "LOCUS_TAG": str, "GENE": str, "PRODUCT": str})
    
            # Add the isolate column with the ALT allele
            df[isolate] = df["ALT"]
    
            # Drop columns that are not needed for the summary
            df = df.drop(["ALT"], axis=1)
    
            if summary_df.empty:
                summary_df = df
            else:
                summary_df = pd.merge(summary_df, df, on=["CHROM", "POS", "REF", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"], how="outer")
    
        # Fill missing values with the REF allele for each isolate column
        for isolate in isolates:
            if isolate in summary_df.columns:
                summary_df[isolate] = summary_df[isolate].fillna(summary_df["REF"])
            else:
                summary_df[isolate] = summary_df["REF"]
    
        # Rename columns to match the required format
        summary_df = summary_df.rename(columns={
            "CHROM": "CHROM",
            "POS": "POS",
            "REF": "REF",
            "TYPE": "TYPE",
            "EFFECT": "Effect",
            "LOCUS_TAG": "Gene_name",
            "GENE": "Biotype",
            "PRODUCT": "Product"
        })
    
        # Replace any remaining None or NaN values in the non-isolate columns with empty strings
        summary_df = summary_df.fillna("")
    
        # Add empty columns for Impact, Functional_Class, Codon_change, Protein_and_nucleotide_change, Amino_Acid_Length
        summary_df["Impact"] = ""
        summary_df["Functional_Class"] = ""
        summary_df["Codon_change"] = ""
        summary_df["Protein_and_nucleotide_change"] = ""
        summary_df["Amino_Acid_Length"] = ""
    
        # Reorder columns
        cols = ["CHROM", "POS", "REF", "TYPE"] + isolates + ["Effect", "Impact", "Functional_Class", "Codon_change", "Protein_and_nucleotide_change", "Amino_Acid_Length", "Gene_name", "Biotype"]
        summary_df = summary_df.reindex(columns=cols)
    
        # Remove duplicate rows
        summary_df = summary_df.drop_duplicates()
    
        # Save the summary DataFrame to a CSV file
        output_file = os.path.join(base_directory, "summary_snps_indels.csv")
        summary_df.to_csv(output_file, index=False)
    
        print("Summary CSV file created successfully at:", output_file)
    
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description="Summarize SNPs and Indels from CSV files.")
        parser.add_argument("directory", type=str, help="Base directory containing the CSV files in subdirectories.")
        args = parser.parse_args()
    
        main(args.directory)
  30. ~/Scripts/merge_snps_indels.py

    import pandas as pd
    import argparse
    import os
    
    def merge_files(input_file1, input_file2, output_file):
        # Read the input files
        df1 = pd.read_csv(input_file1)
        df2 = pd.read_csv(input_file2, sep='\t')
        # Merge the dataframes on the 'POS' column, keeping only the rows that have common 'POS' values
        merged_df = pd.merge(df2, df1[['POS']], on='POS', how='inner')
        # Remove duplicate rows
        merged_df.drop_duplicates(inplace=True)
        # Save the merged dataframe to the output file
        merged_df.to_csv(output_file, index=False)
        print("Merged file created successfully at:", output_file)
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description="Merge two SNP and Indel files based on the 'POS' column.")
        parser.add_argument("input_file1", type=str, help="Path to the first input file (summary_snps_indels.csv).")
        parser.add_argument("input_file2", type=str, help="Path to the second input file (All_SNPs_indels_annotated.txt).")
        parser.add_argument("output_file", type=str, help="Path to the output file.")
        args = parser.parse_args()
        merge_files(args.input_file1, args.input_file2, args.output_file)
  31. ~/Scripts/convert_fasta_to_clustal.py

    #!/usr/bin/env python3
    
    from Bio import AlignIO
    import sys
    
    # Check if the correct number of arguments is provided
    if len(sys.argv) != 3:
        print("Usage: python convert_fasta_to_clustal.py 
    “) sys.exit(1) # Get the input and output file names from command-line arguments input_file = sys.argv[1] output_file = sys.argv[2] # Read the input FASTA file alignment = AlignIO.read(input_file, “fasta”) # Write the alignment to the output CLUSTAL file with open(output_file, “w”) as out_file: AlignIO.write(alignment, out_file, “clustal”) print(f”Conversion complete! The CLUSTAL file is saved as {output_file}.”)
  32. ~/Scripts/convert_clustal_to_clustal.py

    #!/usr/bin/env python3
    
    from Bio import AlignIO
    import sys
    
    # Check if correct arguments are provided
    if len(sys.argv) != 3:
        print("Usage: python convert_clustal_to_fasta.py 
    “) sys.exit(1) # Get the input and output file names from command-line arguments input_file = sys.argv[1] output_file = sys.argv[2] # Read the CLUSTAL alignment alignment = AlignIO.read(input_file, “clustal”) # Extract sequences (assuming three sequences) op_seq = alignment[0].seq hsv1_seq = alignment[1].seq hsv_klinik_seq = alignment[2].seq # Make sure the sequences have the same length if len(op_seq) != len(hsv1_seq) or len(op_seq) != len(hsv_klinik_seq): print(“Error: Sequences have different lengths!”) sys.exit(1) # Prepare new sequences for HSV1 and HSV-Klinik new_hsv1_seq = [] new_hsv_klinik_seq = [] # Iterate through each position of the sequences for i in range(len(op_seq)): op_base = op_seq[i] hsv1_base = hsv1_seq[i] hsv_klinik_base = hsv_klinik_seq[i] # Apply the rules for replacing bases in HSV1_S1-1 and HSV-Klinik_S2-1 if hsv1_base in [‘N’, ‘-‘]: # Replace with OP297860.1 base new_hsv1_seq.append(op_base) else: # Otherwise, keep the original base new_hsv1_seq.append(hsv1_base) if hsv_klinik_base in [‘N’, ‘-‘]: # Replace with OP297860.1 base new_hsv_klinik_seq.append(op_base) else: # Otherwise, keep the original base new_hsv_klinik_seq.append(hsv_klinik_base) # Update the sequences in the alignment alignment[1].seq = “”.join(new_hsv1_seq) alignment[2].seq = “”.join(new_hsv_klinik_seq) # Write the modified alignment back to a file in CLUSTAL format with open(output_file, “w”) as out_file: AlignIO.write(alignment, out_file, “clustal”) print(f”Conversion complete! The modified CLUSTAL file is saved as {output_file}.”)
  33. ~/Scripts/convert_clustal_to_fasta.py

    #!/usr/bin/env python3
    
    from Bio import AlignIO
    import sys
    
    # Check if the correct number of arguments is provided
    if len(sys.argv) != 3:
        print("Usage: python convert_clustal_to_fasta.py 
    “) sys.exit(1) # Get the input and output file names from command-line arguments input_file = sys.argv[1] output_file = sys.argv[2] # Read the input CLUSTAL file alignment = AlignIO.read(input_file, “clustal”) # Write the alignment to the output FASTA file with open(output_file, “w”) as out_file: AlignIO.write(alignment, out_file, “fasta”) print(f”Conversion complete! The FASTA file is saved as {output_file}.”)
  34. ~/Scripts/filter_isnv.py

    #!/usr/bin/env python3
    
    import sys
    import pandas as pd
    
    # Check for correct command-line arguments
    if len(sys.argv) != 3:
        print("Usage: python filter_isnv.py 
    “) sys.exit(1) input_file = sys.argv[1] min_freq = float(sys.argv[2]) # Load the data into a pandas DataFrame data = pd.read_csv(input_file, sep=’\t’) # Filter out records where all records at the same position have iSNV_freq < min_freq def filter_isnv(data, min_freq): # Group data by 'chr' and 'pos' to check records at each position grouped = data.groupby(['chr', 'pos']) # Keep groups where at least one record has iSNV_freq >= min_freq filtered_data = grouped.filter(lambda x: any(x[‘iSNV_freq’] >= min_freq)) return filtered_data # Apply the filter filtered_data = filter_isnv(data, min_freq) # Output the filtered data output_file = “filtered_” + input_file filtered_data.to_csv(output_file, sep=’\t’, index=False) print(f”Filtered data saved to {output_file}”)

Updated List of nf-core Pipelines (Released) Sorted by Stars (as of November 22, 2024)

Released

  • rnaseq: RNA sequencing analysis pipeline using STAR, RSEM, HISAT2, or Salmon with gene/isoform counts and extensive quality control (921 stars).

  • sarek: Analysis pipeline to detect germline or somatic variants from WGS/targeted sequencing (410 stars).

  • mag: Assembly and binning of metagenomes (217 stars).

  • scrnaseq: Single-cell RNA-seq pipeline for 10X genomics data (214 stars).

  • chipseq: ChIP-seq peak-calling, QC, and differential analysis pipeline (195 stars).

  • ampliseq: Amplicon sequencing analysis using DADA2 and QIIME2 (188 stars).

  • atacseq: ATAC-seq peak-calling and QC analysis pipeline (188 stars).

  • naoseq: Nanopore demultiplexing, QC, and alignment pipeline (180 stars).

  • fetchngs: Fetch metadata and raw FastQ files from public databases (151 stars).

  • eager: Ancient DNA analysis pipeline (148 stars).

  • rnafusion: RNA-seq analysis for gene-fusion detection (144 stars).

  • methylseq: Methylation analysis pipeline with Bismark or bwa-meth (140 stars).

  • taxprofiler: Multi-taxonomic profiling of metagenomic data (128 stars).

  • viralrecon: Viral assembly and intrahost variant calling (125 stars).

  • hic: Chromosome Conformation Capture data analysis (92 stars).

  • raredisease: Variant analysis for rare disease patients (90 stars).

  • cutandrun: Analysis pipeline for CUT&RUN and CUT&TAG experiments (81 stars).

  • pangenome: Renders sequences into a pangenome graph (78 stars).

  • smrnaseq: Small-RNA sequencing analysis pipeline (74 stars).

  • funcscan: Genome screening for functional and natural product gene sequences (74 stars).

  • differentialabundance: Differential abundance analysis (64 stars).

  • bacass: Bacterial assembly and annotation pipeline (63 stars).

  • hlatyping: HLA typing from NGS data (63 stars).

  • proteinfold: Protein 3D structure prediction pipeline (58 stars).

  • airrflow: Adaptive Immune Receptor Repertoire sequencing analysis (54 stars).

  • bactmap: Phylogeny from bacterial genomes (52 stars).

  • oncoanalyser: Cancer DNA/RNA analysis pipeline (50 stars).

  • rnasplice: RNA-seq alternative splicing analysis (46 stars).

  • demultiplex: Demultiplexing pipeline for sequencing data (44 stars).

  • epitopeprediction: Epitope prediction and annotation pipeline (42 stars).

  • rnavar: RNA variant calling pipeline (37 stars).

  • mhcquant: Quantify MHC-eluted peptides from mass spectrometry data (33 stars).

  • proteomicslfq: Proteomics label-free quantification pipeline (33 stars).

  • crisprseq: Analyze CRISPR edited data (31 stars).

  • isoseq: PacBio Iso-Seq genome annotation (29 stars).

  • circdna: Detect extrachromosomal circular DNA (28 stars).

  • readsimulator: Simulate sequencing reads (27 stars).

  • imcyto: Image Mass Cytometry analysis (25 stars).

  • multiplesequencealign: Multiple Sequence Alignment pipeline (22 stars).

  • bamtofastq: Convert BAM/CRAM to FastQ (22 stars).

  • metatdenovo: De novo assembly for metagenomics data (22 stars).

  • scnanoseq: Single-cell/nuclei pipeline with Nanopore data (21 stars).

Under development

  • spatialvi: Spatially-resolved gene counts for 10x Genomics Visium (52 stars).
  • circrna: Quantify circRNA, differential expression, and miRNA target prediction (45 stars).
  • scdownstream: Single-cell transcriptomics QC, integration, and visualization (41 stars).
  • lncpipe: Long non-coding RNA analysis from RNA-seq data (under development) (33 stars).
  • deepmodeloptim: Optimizes deep learning models for genomic applications (23 stars).
  • gwas: Genome Wide Association Studies pipeline (under construction) (22 stars).
  • genomeannotator: Gene structure identification in draft genomes (18 stars).
  • phaseimpute: Phases and imputes genetic data (17 stars).
  • genomeassembler: Genome assembly pipeline (16 stars).
  • pathogensurveillance: Population genomics for pathogen monitoring (13 stars).
  • variantbenchmarking: Benchmarking for variant-calling pipelines (premature) (12 stars).
  • omicsgenetraitassociation: Multi-omic data integration for trait analysis (10 stars).
  • phageannotator: Identifies and annotates phage sequences in metagenomic data (10 stars).
  • tfactivity: Differentially active transcription factor identification (9 stars).
  • createpanelrefs: Generate reference panels from sample datasets (8 stars).
  • datasync: Automation and system operation tasks (8 stars).
  • mcmicro: Processes multi-channel whole-slide images into single-cell data (8 stars).
  • metapep: Processes metagenomes to epitopes and beyond (8 stars).
  • variantcatalogue: Creates variant catalogues for populations (8 stars).
  • tbanalyzer: Analysis pipeline for Mycobacterium tuberculosis complex (7 stars).
  • radseq: Variant-calling for RADseq sequencing data (6 stars).
  • meerpipe: Processes MeerKAT pulsar data for astronomy applications (5 stars).
  • rnadnavar: RNA and DNA integration for somatic mutation detection (5 stars).
  • spatialxe: (Details not specified) (5 stars).
  • drugresponseeval: Evaluates drug-response prediction models (4 stars).
  • rangeland: Analyzes satellite imagery for land-cover trends (4 stars).
  • genomeqc: Compares genome quality and annotations (3 stars).
  • methylarray: Processes Illumina methylation data (3 stars).
  • spinningjenny: Simulates industrial revolution with agent-based models (2 stars).
  • troughgraph: Quantitative permafrost landscape analysis (2 stars).
  • pacvar: Processes PacBio sequencing for WGS and targeted data (updated 12 hrs ago) (1 star).
  • sammyseq: Analyzes chromatin accessibility with SAMMY-seq data (1 star).
  • fastqrepair: Repairs and reorders corrupted FASTQ.gz files (0 stars).

Archived

  • deepvariant: Variant calling pipeline leveraging Google’s DeepVariant (40 stars).
  • quantms: Quantitative mass spectrometry workflow supporting DDA-LFQ, DDA-Isobaric, and DIA-LFQ (31 stars).
  • scflow: RNA-seq analysis for single-cell and single-nuclei data (recommended: nf-core/scdownstream) (23 stars).
  • exoseq: Exome sequencing and variant calling pipeline (recommended: nf-core/sarek) (16 stars).
  • smartseq2: Processes single-cell RNA-seq data from the SmartSeq2 protocol (15 stars).
  • vipr: Viral genome assembly and low-frequency variant calling (14 stars).
  • denovohybrid: Hybrid genome assembly combining long and short reads (under construction) (8 stars).
  • crisprvar: Evaluates genome editing experiment outcomes (WIP) (5 stars).
  • ddamsproteomics: Quantitative shotgun mass spectrometry for proteomics (4 stars).
  • neutronstar: De novo assembly pipeline for 10X linked-reads using Supernova (3 stars).
  • ssds: Single-stranded DNA sequencing pipeline (1 star).
  • liverctanalysis: Pipeline for liver CT image analysis (under construction) (0 stars).

Die 5 wichtigsten Versicherungen für Hausbesitzer

https://www.dieversicherer.de/versicherer/wohnen/news/versicherungen-fuer-hausbesitzer-144090

  1. Die Wohngebäudeversicherung: Ein Muss für jeden Hausbesitzer
  2. Elementarschadenversicherung: Extremwetter nehmen zu
  3. Haftpflicht für Hausbesitzer: Wer genau benötigt eine Haus- und Grundbesitzerhaftpflichtversicherung?
    • Besitzer/Eigentümer von Mehrfamilienhäusern
    • Vermieter von Einfamilienhäusern
    • Besitzer/Eigentümer unbebauter Grundstücke
    • Besitzer/Eigentümer von Einfamilienhäusern mit Einliegerwohnungen. Der Versicherungsschutz durch die private Haftpflicht besteht nämlich nur, wenn das Einfamilienhaus bis auf drei Räume vom Versicherungsnehmer selbst genutzt wird. Andernfalls braucht der Eigentümer eine Haus- und Grundbesitzerhaftpflicht.
    • Wohnungseigentümer von Gebäuden, die für eine Eigentümergemeinschaft errichtet worden sind. Die Haftpflichtversicherung des Wohnungseigentümers deckt nur die Gefahren, die von der Wohnung, dem zugehörigen Kellerraum und dem eventuell vorhandenen abgegrenzten Parkplatz ausgehen.
  4. Die Hausratversicherung:
  5. Zusätzlicher Schutz für Photovoltaikanlagen, Scheiben (Glasbruchversicherung), Öltanks

Transposon analyses for the nanopore sequencing

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)

    #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
    To activate this environment, use:
        micromamba activate /home/jhuang/mambaforge
    Or to execute a single command in this environment, use:
        micromamba run -p /home/jhuang/mambaforge mycommand
    installation finished.
    
    Do you wish to update your shell profile to automatically initialize conda?
    This will activate conda on startup and change the command prompt when activated.
    If you'd prefer that conda's base environment not be activated on startup,
      run the following command when conda is activated:
    
    conda config --set auto_activate_base false
    
    You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    [no] >>> yes
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    modified      /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    no change     /home/jhuang/.bashrc
    No action taken.
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    Added mamba to /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    Thank you for installing Mambaforge!
    
    Close your terminal window and open a new one, or run:
    #source ~/mambaforge/bin/activate
    conda --version
    mamba --version
    
    https://github.com/conda-forge/miniforge/releases
    Note
    
        * After installation, please make sure that you do not have the Anaconda default channels configured.
            conda config --show channels
            conda config --remove channels defaults
            conda config --add channels conda-forge
            conda config --show channels
            conda config --set channel_priority strict
            #conda clean --all
            conda config --remove channels biobakery
    
        * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
    # --Deprecated method: mamba installing on conda--
    #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    #    * Note that installing mamba into any other environment than base is not supported.
    #
    #conda activate base
    #conda install conda
    #conda uninstall mamba
    #conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Configuring RepeatMasker after installation involves setting up a transposable element (TE) database, such as Dfam or RepBase. Here’s a detailed guide:

    1. Locate the RepeatMasker Directory
    
    After installing RepeatMasker (via Conda or manually), the main program will reside in its installation directory.
    
        conda info --envs
        #Locate the path for your environment (transposon_analysis), and then:
        cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker
    
    2. Install and choose a TE Database
    
        Dfam: A freely available TE database. Preferred for most users due to open access.
    
        cd /mnt/nvme0n1p1/ref
        #Download the Preprocessed Library:
        wget https://www.dfam.org/releases/Dfam_3.8/families/Dfam-RepeatMasker.lib.gz
        #Move the File to RepeatMasker Libraries:
        mv Dfam-RepeatMasker.lib.gz /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/
        #Configure RepeatMasker to Use Dfam: Re-run the RepeatMasker configuration script and specify this library.
    
        # Move Dfam data to the RepeatMasker directory
        mv Dfam.h5 /path/to/RepeatMasker/Libraries/
        mv Dfam.embl /path/to/RepeatMasker/Libraries/
    
        # The Dfam library has been already installed.
    
        RepBase: A more comprehensive TE database but requires a license for academic or commercial use.
            Download the library files (e.g., .tar.gz) provided by RepBase.
            Extract them to the directory of your choice.
            https://www.girinst.org/server/RepBase/index.php
            tar -zxvf repbase_library.tar.gz -C /path/to/repbase/
    
    4. Configure RepeatMasker
    
        #Run the configuration script in the RepeatMasker installation directory:
        cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker
        ./configure
    
        Enter Selection: 5
        Building FASTA version of RepeatMasker.lib .........
        Building RMBlast frozen libraries..
        The program is installed with a the following repeat libraries:
        File: /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5
        Database: Dfam
        Version: 3.3
        Date: 2020-11-09
    
        Dfam - A database of transposable element (TE) sequence alignments and HMMs.
    
        Total consensus sequences: 6953
        Total HMMs: 6915
    
        conda activate transposon_long
        #When using HMMER with RepeatMasker, it automatically looks for the Dfam.h5 file in the Libraries/ directory, not a custom library name specified with -lib.
        #If you're using HMMER and the Dfam.h5 file, the -lib option should not be used. Simply run RepeatMasker like this:
        RepeatMasker -species "YourSpecies" -pa 4 CP020463.fasta
  2. Test the installed tools

    # Check versions
    sniffles --version
    RepeatModeler -h
    RepeatMasker -h
    svim --help
    SURVIVOR --help
    mamba install -c conda-forge perl r
  3. Align Long Reads to the WT Reference

    Use Minimap2 for aligning your reads:
    
      for sample in 1 2 3 4 5 7 8 9 10; do
      for sample in WT; do
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          samtools index ${sample}.sorted.bam
      done
  4. Call Structural Variants with Sniffles2: A fast structural variant caller for long-read sequencing, Sniffles2 accurately detect SVs on germline, somatic and population-level for PacBio and Oxford Nanopore read data.

    Detect structural variants in each sample using Sniffles2:
    
        sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
    
        -s 20: Requires at least 20 reads to support an SV for reporting.
        -l 50: Reports SVs that are at least 50 base pairs long.
        -t 4: Uses 4 threads for faster processing.
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          samtools index ${sample}.sorted.bam
          sniffles -m ${sample}.sorted.bam -v ${sample}.vcf -s 10 -l 50 -t 60
      done
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" ${sample}.vcf > ${sample}_filtered.vcf
      done
      #!!!!WT has only one record as expected!!!!
  5. Annotate Transposable Elements

    Build a Custom Transposon Library (Optional but Recommended):
        Use RepeatModeler2 to identify and classify transposable elements in your WT genome.
    
        makeblastdb -in CP020463.fasta -dbtype nucl -out CP020463_db -parse_seqids
        blastdbcmd -db CP020463_db -info
        RepeatModeler -database CP020463_db -pa 8
    
        #esearch -db nucleotide -query "CP020463" | efetch -format gb | grep -A 1 "translation" > CP020463_proteins.fasta
        #awk '{if(NR%2==1){print ">acc"NR/2} else {print $0}}' CP020463.protein.faa > CP020463_with_accessions.faa
    
        #TODO: DEBUG_NEXT_MONDAY!
        makeblastdb -in CP020463.protein.faa -dbtype prot -out CP020463_protein_db
        blastdbcmd -db CP020463_protein_db -info
        RepeatModeler -database CP020463_protein_db -pa 8
    
    This creates a transposon library in FASTA format.
    
    Annotate Insertions with RepeatMasker:
    
        Use the transposon library (or a database like Dfam) to annotate the detected insertions:
    
    RepeatMasker -lib transposons.fasta sample1.vcf -dir output/
    
    This step determines if detected insertions match known or de novo transposons.
  6. Compare Insertions Across Samples

    Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
    SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
        Input: List of VCF files from Sniffles2.
        Output: A consolidated VCF file with shared and unique variants.
    
    Filter WT Insertions:
    
        Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
            bcftools isec WT.vcf merged.vcf -p comparison_results
  7. Validate and Visualize

    Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
    igv.sh
    
    Validate Findings:
        Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  8. Alternatives to TEPID for Long-Read Data

    If you’re looking for transposon-specific tools for long reads:
    
        REPET: A robust transposon annotation tool compatible with assembled genomes.
        EDTA (Extensive de novo TE Annotator):
            A pipeline to identify, classify, and annotate transposons.
            Works directly on your assembled genomes.
    
            perl EDTA.pl --genome WT.fasta --type all
  9. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    Steps to Generate WT.vcf
    1. Align WT Reads to the WT Reference Genome
    
    The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
    Command using Minimap2:
    
    minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
    Index the BAM file:
    
    samtools index WT.sorted.bam
    
    2. Detect Structural Variants with Sniffles2
    
    Run Sniffles2 on the WT alignment to call structural variants:
    
    sniffles --input WT.sorted.bam --vcf WT.vcf
    
    This step identifies:
    
        Native transposons and insertions present in the WT genome.
        Other structural variants that are part of the reference genome or sequencing artifacts.
    
    Key parameters to consider:
    
        --min_support: Adjust based on your WT sequencing coverage.
        --max_distance: Define proximity for merging variants.
        --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  10. Clean and Filter the WT.vcf

    To ensure the WT.vcf only includes relevant transposons or SVs:
    
        Use bcftools or similar tools to filter out low-confidence variants:
    
        bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
    
            This removes low-quality calls and focuses on insertions (INS) relevant for transposon detection.
        Optionally, annotate the WT.vcf with known transposons using tools like RepeatMasker.
    
    The WT.vcf acts as a baseline for comparison:
    Variants detected in your samples 1–9 are compared to those in the WT to identify novel insertions (potential transposons).
    Shared insertions between the WT and samples are excluded as native to the WT genome.
  11. In this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

    Why Use a Reference Genome (WT.fasta) from a Database?
    
        Higher Quality and Completeness:
            Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
        Consistency:
            Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
        Saves Time:
            Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
    Alternative: Assembling WT from FASTQ
    
    If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
        Assemble the genome from your WT.fastq:
            Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
        flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size 
    Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.

Structural Variant Calling for Nanopore Sequencing (edited)

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)

    #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
    To activate this environment, use:
        micromamba activate /home/jhuang/mambaforge
    Or to execute a single command in this environment, use:
        micromamba run -p /home/jhuang/mambaforge mycommand
    installation finished.
    
    Do you wish to update your shell profile to automatically initialize conda?
    This will activate conda on startup and change the command prompt when activated.
    If you'd prefer that conda's base environment not be activated on startup,
      run the following command when conda is activated:
    
    conda config --set auto_activate_base false
    
    You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    [no] >>> yes
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    modified      /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    no change     /home/jhuang/.bashrc
    No action taken.
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    Added mamba to /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    Thank you for installing Mambaforge!
    
    Close your terminal window and open a new one, or run:
    #source ~/mambaforge/bin/activate
    conda --version
    mamba --version
    
    https://github.com/conda-forge/miniforge/releases
    Note
    
        * After installation, please make sure that you do not have the Anaconda default channels configured.
            conda config --show channels
            conda config --remove channels defaults
            conda config --add channels conda-forge
            conda config --show channels
            conda config --set channel_priority strict
            #conda clean --all
            conda config --remove channels biobakery
    
        * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
    # --Deprecated method: mamba installing on conda--
    #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    #    * Note that installing mamba into any other environment than base is not supported.
    #
    #conda activate base
    #conda install conda
    #conda uninstall mamba
    #conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Test the installed tools

    # Check versions
    sniffles --version
    RepeatModeler -h
    RepeatMasker -h
    svim --help
    SURVIVOR --help
    mamba install -c conda-forge perl r
  2. Data Preparation

    Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
    Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
  3. Preprocessing

    Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
    Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
  4. (Optional) Variant Calling for SNP and Indel Detection:

    Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
  5. Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

      #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
      ./4/1.Cleandata/4.filtered_reads.fq.gz
      ./3/1.Cleandata/3.filtered_reads.fq.gz
      ./2/1.Cleandata/2.filtered_reads.fq.gz
      ./8/1.Cleandata/8.filtered_reads.fq.gz
      ./5/1.Cleandata/5.filtered_reads.fq.gz
      ./WT/1.Cleandata/WT.filtered_reads.fq.gz
      ./9/1.Cleandata/9.filtered_reads.fq.gz
      ./10/1.Cleandata/10.filtered_reads.fq.gz
      ./7/1.Cleandata/7.filtered_reads.fq.gz
      ./1/1.Cleandata/1.filtered_reads.fq.gz
    
      # -- Alignment and Detect structural variants in each sample using SVIM (failed due to the strange output from SVIM!)
      #mamba install -c bioconda ngmlr
      mamba install -c bioconda svim
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          svim reads --aligner ngmlr --nanopore svim_reads_ngmlr_${sample} F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz CP020463.fasta  --cores 10;
      done
      for sample in WT 1 2 3 4 5 7 8 9 10; do
      for sample in 1; do
          #INS,INV,DUP:TANDEM,DUP:INT,BND
          svim reads svim_reads_minimap2_${sample} F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz CP020463.fasta --aligner minimap2 --nanopore --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
      done
      #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    
      # -- Results1: Detect structural variants using Minamap2+Sniffles2:
    
      Minimap2: A commonly used aligner for nanopore sequencing data.
          Align Long Reads to the WT Reference using Minimap2
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          samtools index ${sample}.sorted.bam
      done
    
      #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
      #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
      #  -l 50: Reports SVs that are at least 50 base pairs long.
      #  -t 4: Uses 4 threads for faster processing. --> 60
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          sniffles -m ${sample}.sorted.bam -v ${sample}.vcf -s 10 -l 50 -t 60
          #QUAL < 20 ||
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}.vcf > ${sample}_filtered.vcf
      done
    
      # -- Results2: Detect structural variants using NGMLR+Sniffles2
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
          #sniffles -m svim_reads_minimap2_${sample}/${sample}.filtered_reads.fq.minimap2.coordsorted.bam -v #sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
          bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
          #Using
          sniffles -m svim_reads_ngmlr_${sample}/${sample}.filtered_reads.fq.ngmlr.coordsorted.bam -v sniffles_ngmlr_${sample}.vcf -s 10 -l 50 -t 60
          bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_ngmlr_${sample}.vcf > sniffles_ngmlr_${sample}_filtered.vcf
      done
    
      # -- Compare the results1 and results2, and check them each position in IGV!
    
      #minimap2+sniffles2
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 WT_filtered.vcf | grep -v "##"
      POS
      1855752
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 1_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 2_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 3_filtered.vcf | grep -v "##"
      POS
      529416
      529416
      529418
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 4_filtered.vcf | grep -v "##"
      POS
      55682
      529416
      1855752
      2422820
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 5_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 7_filtered.vcf | grep -v "##"
      POS
      518217
      1855752
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 8_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 9_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 10_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422818
      2424590
    
      #ngmlr+sniffles2
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_WT_filtered.vcf | grep -v "##"
      POS
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_1_filtered.vcf | grep -v "##"
      POS
      529419
      2422819
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_2_filtered.vcf | grep -v "##"
      POS
      529418
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_3_filtered.vcf | grep -v "##"
      POS
      529418
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_4_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_5_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_7_filtered.vcf | grep -v "##"
      POS
      518219
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_8_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_9_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_10_filtered.vcf | grep -v "##"
      POS
      529418
      2422820
    
      #~/Tools/csv2xls-0.4/csv_to_xls.py sniffles_ngmlr_WT_filtered.vcf sniffles_ngmlr_1_filtered.vcf sniffles_ngmlr_2_filtered.vcf sniffles_ngmlr_3_filtered.vcf sniffles_ngmlr_4_filtered.vcf sniffles_ngmlr_5_filtered.vcf sniffles_ngmlr_7_filtered.vcf sniffles_ngmlr_8_filtered.vcf sniffles_ngmlr_9_filtered.vcf sniffles_ngmlr_10_filtered.vcf -d$'\t' -o putative_transposons2.xls
    
      # -- Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)
    
      python vcf_to_fasta.py variants.vcf variants.fasta
      #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
      #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
      /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
      python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
      Exact Matches
      =============
      2 bacteria (blast name), Bacteria 
    (scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV​)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733​)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields.
  6. (failed) using own scripts direct analyze the bam-file via cigarString (failed due to too many short insertions!)

    transposons.fasta is a file containing the transposon sequences in FASTA format.
    python your_script.py input.bam reference.fasta transposons.fasta
    #Transposon_Sequence    Insertion_Frequency
    #Tn5                    10
    #Tn10                   5
    #Unknown                3
    
    python putative_transposons_with_counts.py mapping_WT.sorted.bam CP020463.fasta
    
    rule trim_short_reads:
        input:
            "/data/short-reads.fq.gz"
        output:
            "/data/trimmed-short-reads.fasta"
        shell:
            "python3 trim_by_tag_length.py /data/short-reads.fq.gz 10 > /data/trimmed-short-reads.fasta"
    
    rule trim_long_reads:
        input:
            "/data/long-reads.fq.gz"
        output:
            "/data/trimmed-long-reads.fasta"
        shell:
            "python3 trim_by_tag_length.py /data/long-reads.fq.gz 92 > /data/trimmed-long-reads.fasta"
    
    rule install_bwa:
        output:
            "bwa-mem2-2.0pre2_x64-linux/bwa-mem2"
        shell:
            "curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre2/bwa-mem2-2.0pre2_x64-linux.tar.bz2 | tar jxf -"
    
    rule map_short_reads:
        input:
            "bwa-mem2-2.0pre2_x64-linux/bwa-mem2",
            "/data/reference.fasta",
            "/data/trimmed-short-reads.fasta"
        output:
            "/data/mapping.sam"
        shell:
            """
            bwa-mem2-2.0pre2_x64-linux/bwa-mem2 index /data/reference.fasta
            bwa-mem2-2.0pre2_x64-linux/bwa-mem2 mem /data/reference.fasta /data/trimmed-short-reads.fasta > /data/mapping.sam
            """
    
    rule map_long_reads:
        input:
            "/data/reference.fasta",
            "/data/trimmed-long-reads.fasta"
        output:
            "/data/mapping.bam"
        conda:
            "minimap2.yml"
        shell:
            """
            minimap2 -x map-ont -d reference /data/reference.fasta > /dev/null 2>&1
            minimap2 -c -a -o /data/mapping.nonunique.sam -N 1 -x map-ont reference /data/trimmed-long-reads.fasta
            samtools view -bq 1 /data/mapping.nonunique.sam > /data/mapping.bam
            """
    
    rule convert_sam_to_bam:
        input:
            "/data/mapping.sam"
        output:
            "/data/mapping.bam",
        conda:
            "samtools.yml"
        shell:
            "samtools view -S -b /data/mapping.sam > /data/mapping.bam"
    
    rule get_unmapped_reads:
        input:
            "/data/mapping.bam"
        output:
            "/data/mapping.sorted.bam"
        conda:
            "samtools.yml"
        shell:
            """
    #        samtools view -f 4 /data/mapping.bam > /data/unmapped.sam
    #        samtools view -S -b /data/unmapped.sam > /data/unmapped.bam
    #        samtools bam2fq /data/unmapped.bam | seqtk seq -A - > /data/unmapped.fa
            samtools sort  /data/mapping.bam -o /data/mapping.sorted.bam
            samtools index /data/mapping.sorted.bam
            """
    
    rule create_insertion_plot:
        input:
            "/data/mapping.sorted.bam"
        output:
            "/data/summary-stats.tsv"
        shell:
            """
            python3 ~/Scripts/sam_to_insert_plot.py /data/mapping.sorted.bam /data/reference.fasta > /data/summary-stats.tsv
            """
  7. Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

      mamba install -c bioconda medaka
      medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
  8. Compare Insertions Across Samples

    Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
    SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
        Input: List of VCF files from Sniffles2.
        Output: A consolidated VCF file with shared and unique variants.
    
    Filter WT Insertions:
    
        Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
            bcftools isec WT.vcf merged.vcf -p comparison_results
  9. Validate and Visualize

    Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
    igv.sh
    
    Validate Findings:
        Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  10. Alternatives to TEPID for Long-Read Data

    If you’re looking for transposon-specific tools for long reads:
    
        REPET: A robust transposon annotation tool compatible with assembled genomes.
        EDTA (Extensive de novo TE Annotator):
            A pipeline to identify, classify, and annotate transposons.
            Works directly on your assembled genomes.
    
            perl EDTA.pl --genome WT.fasta --type all
  11. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    Steps to Generate WT.vcf
    1. Align WT Reads to the WT Reference Genome
    
    The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
    Command using Minimap2:
    
    minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
    Index the BAM file:
    
    samtools index WT.sorted.bam
    
    2. Detect Structural Variants with Sniffles2
    
    Run Sniffles2 on the WT alignment to call structural variants:
    
    sniffles --input WT.sorted.bam --vcf WT.vcf
    
    This step identifies:
    
        Native transposons and insertions present in the WT genome.
        Other structural variants that are part of the reference genome or sequencing artifacts.
    
    Key parameters to consider:
    
        --min_support: Adjust based on your WT sequencing coverage.
        --max_distance: Define proximity for merging variants.
        --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  12. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

    To ensure the WT.vcf only includes relevant transposons or SVs:
    
        Use bcftools or similar tools to filter out low-confidence variants:
    
        bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
        bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
  13. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

    Why Use a Reference Genome (WT.fasta) from a Database?
    
        Higher Quality and Completeness:
            Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
        Consistency:
            Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
        Saves Time:
            Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
    Alternative: Assembling WT from FASTQ
    
    If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
        Assemble the genome from your WT.fastq:
            Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
        flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size 
    Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
  14. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

    # -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ !
    # Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches.
    #https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1
    
    # -- (failed!) try TEPID for annotation
    mamba install tepid=0.10 -c bioconda
    #(tepid_env)
    for sample in WT 1 2 3 4 5 7 8 9 10; do
        tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q  ../batch1_depth25/trycycler_${sample}/reads.fastq;
        tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se;
    done
    
    tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se;
    
    tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS]
                [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL
    
    # -- (failed!) try EDTA for annotation
    https://github.com/oushujun/EDTA
    (transposon_long) mamba install -c conda-forge -c bioconda edta
    mamba install bioconda::rmblast  # cd RepeatMasker; ./configure
    EDTA.pl --genome CP020463.fasta --species others --threads 40
    
    For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options.
    For de novo repeat identification: RepeatScout is highly effective.
    For LTR retrotransposons: Use LTR_retriever.
    For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.
  15. Validation: Cross-validate with short-read sequencing data if available.

Whole Genome Sequencing: Pricing and Services from Dante Labs and Other Leading Providers

Whole genome sequencing (WGS) is an advanced genetic test that decodes your entire DNA, offering a detailed look at your genetic makeup. Several companies offer whole genome sequencing services, with varying levels of coverage, insights, and prices. Here’s a breakdown of the pricing and services from Dante Labs and other top providers.


1. Dante Labs

  • Price:
    • Basic Package: $599
    • Extended Package: $1,299
  • Service: Dante Labs provides true whole genome sequencing, covering all 3 billion base pairs in your DNA.
  • Additional Features: The extended package includes detailed health and wellness reports, ancestry insights, and traits analysis.
  • Data Access: You’ll receive full access to your raw genetic data, which can be downloaded for further analysis.
  • Turnaround Time: Typically a few weeks.

2. Veritas Genetics

  • Price: Approx. $999 to $1,200
  • Service: Offers true whole genome sequencing with 30x coverage, providing comprehensive health, ancestry, and trait insights.
  • Additional Features: Includes health-related reports and genetic trait analysis.
  • Data Access: Full genome data is provided, and updates are available as new research emerges.
  • Turnaround Time: A few weeks.

3. Nebula Genomics

  • Price:
    • Basic Whole Genome Sequencing: $299 (for 30x coverage)
    • Upgraded Services: Up to $999 or more.
  • Service: Provides whole genome sequencing and delivers periodic updates on new findings from genetic research.
  • Additional Features: Includes health insights and disease risk reports, with a focus on privacy and control over your genetic data.
  • Data Access: Full access to raw genome data and new insights as research progresses.
  • Turnaround Time: Typically a few weeks.

4. Genos

  • Price: Approx. $599 to $1,000
  • Service: Genos offers whole genome sequencing with a focus on health insights, disease risks, and traits.
  • Additional Features: Provides in-depth health reports based on your genetic data.
  • Data Access: Full access to your raw genomic data.
  • Turnaround Time: A few weeks.

5. Helix

  • Price: $999 to $1,500
  • Service: Helix offers whole genome sequencing in collaboration with other companies to provide genetic insights.
  • Additional Features: Works with partners to deliver reports on health, ancestry, and traits.
  • Data Access: You get access to your genetic data, though it’s mainly used for specific third-party reports.
  • Turnaround Time: A few weeks.

6. SelfDecode

  • Price: Approx. $299 (for basic sequencing) to $899 for full services.
  • Service: Whole genome sequencing through third-party labs with a focus on health and wellness insights.
  • Additional Features: Reports include genetic predispositions to diseases and traits.
  • Data Access: Full genome data with health insights.
  • Turnaround Time: A few weeks.

7. Fulgent Genetics

  • Price: Approx. $600 to $1,500
  • Service: Fulgent Genetics provides whole genome sequencing with a focus on medical and health genetics.
  • Additional Features: Includes health risk and disease reports, with genetic counseling services available.
  • Data Access: Full genome sequencing data with medical reports.
  • Turnaround Time: A few weeks.

8. 23andMe

  • Price:
    • Health + Ancestry Service: $199
    • Ancestry Service: $99
  • Service: 23andMe focuses on genotyping (not full sequencing) and provides ancestry and health-related insights.
  • Additional Features: Offers reports on ancestry, genetic traits, and some health information.
  • Data Access: Limited to selected markers rather than whole genome sequencing.
  • Turnaround Time: 2-3 weeks.

9. AncestryDNA

  • Price: Approx. $99 to $199
  • Service: Focuses on ancestry testing through genotyping (not full genome sequencing).
  • Additional Features: Provides ethnicity estimates and genealogical insights.
  • Data Access: Only selected markers, not full genome sequencing.
  • Turnaround Time: 6-8 weeks.

Summary of Pricing and Features:

Provider Price Service Full Genome Coverage Additional Features
Dante Labs $599 (Basic), $1,299 (Extended) Whole genome sequencing Yes Health, ancestry, traits reports
Veritas Genetics $999-$1,200 Whole genome sequencing Yes Health, ancestry, traits reports
Nebula Genomics $299 (Basic), $999 (Advanced) Whole genome sequencing Yes Health insights, periodic updates
Genos $599-$1,000 Whole genome sequencing Yes Health, disease risk, traits reports
Helix $999-$1,500 Whole genome sequencing Yes Partnered reports (health, ancestry)
SelfDecode $299-$899 Whole genome sequencing Yes Health, wellness, disease risk reports
Fulgent Genetics $600-$1,500 Whole genome sequencing Yes Health, medical, disease risk reports
23andMe $99-$199 Genotyping (not WGS) No Ancestry, health reports
AncestryDNA $99-$199 Genotyping (not WGS) No Ancestry, ethnicity reports

Conclusion:

If you’re looking for true whole genome sequencing, companies like Dante Labs ($599), Nebula Genomics ($299), Genos ($599), and Veritas Genetics ($999) offer comprehensive services with full genome coverage. 23andMe and AncestryDNA are more affordable but provide genotyping services, which focus on selected genetic markers and do not offer a full analysis of your DNA.

For a complete genome analysis, Dante Labs, Nebula Genomics, and Genos are great options at various price points, while Veritas Genetics offers additional health insights at a higher cost.

博士求职防割韭菜指南:那些花里胡哨的助理研究员、特聘研究员、专职研究员……到底是啥?

https://www.163.com/dy/article/G7LQR1V105349YKB.html

我国高等教育改革方兴未艾,各种名目的岗位层出不穷。

  不少博士毕业后懵懵懂懂就签了卖身契,等反应过来才追悔莫及。说实话,面对如此之多的岗位,要想理出个头绪真的很难:

  同样名称的岗位在不同学校可能完全不一样,不同名称的岗位在不同学校可能完全一样,甚至同样名称的岗位在同一个学校都会慢慢走样。

  有鉴于此,我们整理了一份不算全面的博士就业岗位说明,主要区分不同类别的博士后和高校岗位,帮助求职的博士们了解真相,有效避坑。

  01 博士后岗位

  很多人读完博士,就习惯性地想找个博后位置做一站。因为近期一些众所周知的原因,出国做博后可能比以前难了。国内的博后系统比国外复杂,不少想在国内做博后的人还有点搞不清楚。今天咱们就来捋一捋。

  我国博士后制度还不到四十年历史,最初由诺奖得主、华裔物理学家李政道倡议。1985 年 7 月,国务院批准设立博士后科研流动站。我国博士后制度正式确立。

  放眼海内外,博士后岗位是在高等学府或科研机构,或大型企业、高新技术企业、留学人员创业园或科研生产型事业单位设置的一些需要高学历人才的特殊短期职位,在招聘时专门针对博士,要求在规定的期限内从事具体的科学研究。博士后只是一种工作经历,不是专业学位也不是行政职务。

  博士后的工作期限多数为两年,少数可以延长到三年。按照国家的要求,博士后是设站单位的正式员工,但不列入编制,工作期满后必须流动出站。

  一、流动站博后

  1. 岗位简介

  正是因为博士后们在找到固定的工作岗位前实际上处于流动状态,所以最早容纳博士后进行科研的机构叫流动站。流动站是建立在高校或科研院所的某个一级学科之下的管理博士后的机构,也是我国博后群体的主流。

  2. 薪资待遇

  大部分高校为博后提供的待遇还不错,的确是向本校正式员工看齐,一般包括基本薪资、科研绩效津贴与专项经费、住房补贴与子女入托入学等。

  博士后制度建立初期,薪资曾一度非常可观,据说达到当时教授的水平。后来形势急转直下,性价比很低。这导致很大一部分中国博士出国做博后,不利于人才回流。最近十年,在多方呼吁之下,博后待遇有了飞速提高,最高 60 万,平均 30 万。

  某研究所博后招聘的支持政策

  不过,这钱可未必全是学校出。博士后有国家资助,每人两年 10 万;很多省市有专门针对博后的人才政策,给钱不说,还专门修建人才公寓,解决住房问题;学校肯定也要给一部分,解决一部分福利待遇;课题组通常还能再给点。几处相加,的确实惠。

  3. 绩效要求

  博后的工作也有绩效要求,基本都体现在出站要求上。博士后是专门从事科研的短期工作人员,无论是否在职,是否脱产,都不需要承担教学工作。

  说到底,绩效考核还是承担怎样的项目,发表多少论文。随着博士毕业人数的增加,博后岗位也加速内卷,出站要求水涨船高,有良心的学校还知道跟着提高配套的绩效,没良心的学校直接玩一手「退站就退钱」的操作,空手套白狼地割韭菜。

  4. 发展前景

  虽然博士后并非教职,但不少人还是把博士后看作学术生涯的起点,也确实有很多博士的独立科研正是从博后阶段开始。而且我国博士后由全国博士后管理委员会(博管办)统一管理,不是学位胜似学位。

  所以,博士后做得好坏与否,对一个博士的学术生涯而言还是很重要的。现实中也有许多博士阶段工作平平,博后阶段突然开窍而后谋得高校教职的例子。

  二、工作站博后

  1. 岗位简介

  说博后阶段很重要的前提,仅指流动站,不包括工作站。

  工作站是在企业、科研生产型事业单位或特殊的区域性机构内设立的管理博士后的机构,虽然 1997 年才开始推行,但以目前数量来看,比流动站还多,是不少博士毕业后的选择。

  工作站还解决了流动站无法解决的「同站同学科」难题,允许高校毕业的博士进入工作站的同一个一级学科开展研究。

  工作站博后的工作期限一般为两年,也有因研究项目进度等原因需要延长工作期限的情况。企业在用人方面操作很灵活,有的也未必是按照博后身份延期,可以按照新的劳动合同延期。

  图片来源:阿里巴巴官网

  2. 薪资待遇

  工作站博后的薪资和待遇由本单位承担,实际收入与企业同岗位、同资历工作人员的收入相当,也可以有地方政府人才政策的额外加成。在站博后也能同本单位正式职工一样享受其他各项生活福利待遇,包括奖金、公费医疗、困难补助、探亲、书报补贴等。由于能设立工作站的企业多数资质不错,也愿意拿钱留人,所以工作站博后的到手工资还是有竞争力。

  3. 绩效要求

  工作站博后也有出站要求,具体由所在单位决定。因为工作站博后通常要兼顾工作任务与学术要求,而且普遍来讲,企业的科研平台不如高校和研究所,整体的研究氛围也与高校无法相比,因此出站要求偏低。

  这就导致履历上工作站博后的含金量不高,后续求职中可能会被人另眼相看。

  4. 发展前景

  人才和学历是企业的名片,不排除有些企业设立工作站,就是为了作秀甚至骗钱,博士们去了可就惨了。

  另外还有些企业的博后管理比较落后,博后需要帮助的时候无人支援,一旦有麻烦出现,倒是推诿扯皮,磨蹭拖拉。

  更可恶的是,有些企业还把博后申请到的经费据为己有,导致博后根本无法开展研究工作。

  最惨的是,一旦这些坑人企业被查,博后工作站被撤,在站博后分分钟面临失业的境地。所以,去之前一定要考察好。

  图片来源:中国政府官网

  三、师资博后

  1. 岗位简介

  师资博后是中国特色高等教育的产物之一。

  进入新世纪,在高校扩招和高等教育飞速发展的背景下,优质师资不足日益凸显。为了提高青年教师的学术水平,把新晋教师的考察期延长,浙江大学于 2005 年率先推出「师资博士后」政策,将部分博士后纳入师资队伍管理。

  师资博后比普通博后的准入门槛高,需要同时申请博士后和应聘讲师。入站后,除了承担科研工作之外,也需要承担部分教学任务,为未来执教提前准备。

  2. 薪资待遇

  师资博后的合同一般还是两年,在站期间享有国家规定的薪资待遇和福利,少数高校会象征性给点科研启动金,但与之匹配的,出站要求则比流动站博后更高 —— 否则,凭啥留下做教师呢?

  某高校师资博后待遇,图片来源:某高校官网

  3. 绩效要求

  历史地看,当年浙大推出师资博后,属于一种对 tenure-track 的探索。本质上还是遴选优秀的博士作为教师预备队,如果绩效不达标,则两年到期后出站,不必继续占用本校教师名额。

  现在,各种明目的专职科研岗位在中华大地全面开花,师资博后再也不是建设教师队伍,提高科研绩效的最好选择。

  但不得不说的是,跟现在那些割韭菜的专职科研岗位相比,浙大还是很厚道的 —— 在推行师资博后的四年中,有大约七成的师资博后出站后成为浙大正式教师,兑现率很高。

  4. 发展前景

  当然,也有些高校的师资博后在如今内卷的大环境下变得非常坑,出站绩效要达到副教授甚至教授的水平,出站报告要以博士论文的标准来做。内卷之下,无一幸免,师资博后,留任无望,最后也免不了被割一把韭菜。

  四、国外博后

  1. 岗位简介

  跟国内的博后比起来,国外的博后就很简单了…… 除了名字之外。

  在国外,博后就是跟着项目走的短期高级雇员,只要钱足够,头衔可以随便给:

  Postdoctoral、Postdoctoral Scientist、Postdoctoral Fellow、Postdoc Associate、Postdoctoral Researcher、Postdoctoral Scholar、Postdoc Research Scientist、Research Scientist、(Senior) Research Fellow、(Senior) Research Associate……

  统统都可以是博后。

  美国疾控中心博后招聘

  2. 薪资待遇

  高级雇员也只是打短工,工作期限一般是两年,年薪多数在四五万美元。其实这都不重要,老板喜欢你,自然会加薪,重新签合同都行;老板不喜欢你,提前通知一下就能炒你鱿鱼。

  所谓的商业社会注重契约精神,前提还得是老板说了算。谁让钱都是老板给呢!

  正因如此,也就不存在实质上的博士后管理部门,什么科研启动金、人才政策、子女入托等福利更是没有 —— 国外压根儿不来这一套。

  3. 绩效要求

  因为老板说了算,考核什么的也就完全没有硬性标准,端的是「说你行你就行,不行也行;说不行就不行,行也不行」。有钱就是大爷,学术界也未能幸免啊~

  4. 发展前景

  从职业发展来看,如果能到国外大牛组里做博后还是非常好的,这可是闪闪发光的学术履历。好好表现,走的时候带走大牛一封有力的推荐信,那可是求职的杀手锏。要是还能发几篇顶刊论文,那几乎可以预定个「X 人计划」了。

  总结

  博后是科研生涯的踏板,选错博后而误终生的例子实在太多,一定要慎重选择。特别要注意协议文本,不要只关注进站的条件和待遇,更要仔细弄清出站的要求和退站的约定,免得白花自己时间精力,还惹一肚子气。

  02 高校岗位

  在谈高校岗位前,先祭出高校教师打怪升级图。

  

  图片来源:作者

  没有任何独立科研经验的博士,想拿教授是很根本不可能的,一般给个讲师,个别优秀的才能拿到副教授。现在不少高校已经取消了讲师的招聘,代之以名目繁多的不给编制的新头衔。这里面坑可不少,博士们在找工作的时候千万要注意加以分辨。

  一、讲师

  1. 岗位简介

  助教不能独立开课,讲师能够独立开设一门或一门以上的课程。高校默认博士们具有独立开课的能力,起点至少是讲师。

  2. 薪资待遇

  目前大多数高校的讲师都没有编制,合同期限长短得看协议。

  讲师的薪水都是高校给,一般底薪很低,每月小几千块 —— 这个没办法,按照国家标准来嘛 —— 但好在可以拿课时费,等于是绩效部分。不同高校的课时费多寡不同,少的有几十块的,多的有三位数的。

  讲师一般不按人才引进,但博士去做讲师,科研启动金和安家费可能都会有一点。社保转到当地的话,也能享受人才政策,政策力度各地不同。讲师是高校的正式员工,子女入托等福利还是有的,再多的就不要想了。

  也不排除有待遇很丰厚的讲师岗位,需要博士们在找工作的时候仔细寻觅。

  某高校招聘讲师提供的待遇,图片来源:某高校官网

  3. 绩效要求

  理论上讲师只要讲好课就可以,但为了打怪升级职业发展,讲师还是得搞科研,不然怎么评副教授呢?具体的职称评审要求,各个高校不同。

  需要注意的是,讲师是没办法独立建课题组的,文科的自己做研究也能将就,理工科需要做实验的话,多数会选择个大牛的组当个小老板,帮大牛带带学生,出的成果算自己一份儿,将来评职称可以用。

  二、副教授 / 副研究员

  1. 岗位简介

  少数高校按照人才引进优秀博士或博士后会直接给副教授 / 副研究员,有些还会很大方地直接给编制。这种岗位是每个想进高校的博士梦寐以求的。

  2. 薪资待遇

  如果有了编制,合同期限就只是个形式,不重要。薪水也是高校给,能比讲师多一些,课时费也比讲师高。绩效部分,因为承担科研项目,所以项目结题和发表论文都会有奖励。

  按照人才引进,科研启动金和安家费是必须得有的,不同地区不同高校在引进人才的这两笔钱上差距可能有十倍。看到别人的多,咱也别眼热。

  因为有编制,社保是必须转到当地的,自然也可以享受当地的人才政策 —— 多数还不是最低的一档。副教授 / 副研究员也是高校的正式员工,子女入托等福利一应俱全,还有其他明的暗的福利。

  某高校招聘讲师 / 副教授的公告,图片来源:某高校官网

  3. 绩效要求

  高校能给副高,肯定是希望博士们继续搞科研出成果的,所以科研绩效定得比较高。这也符合绝大多数博士们的意愿,双方利益是统一的。

  但这里有个问题,有了副高不一定是硕导,就算能独立建组,也未必能招到研究生。有的高校采取 PI 制(Principal Investigator)来协调,理论上 PI 是跟项目走的,项目的负责人就是 PI,但这还是难以解决国内研究生导师需要一定资质的问题。

  一种做法是让研究生统一挂名某个院内大牛,各个副教授 PI 完全独立带学生。当然,有的高校嫌麻烦,就让这些副教授们像讲师那样进大老板的组,但好处是搞联合培养啥的比讲师容易。

  三、特聘教授 / 副教授 or 特聘研究员 / 副研究员 or 专职研究员 / 副研究员

  1. 岗位简介

  古语有云:同进士不同进士,如夫人不如夫人。可见头衔加前缀,降低含金量。这个道理到今天也一样。

  甭管「特聘」还是「专职」,都是游离在高校岗位之外的,无非是为了对头衔做个区分,区分的重点在于:没有编制!

  某高校招聘特聘研究员 / 副研究员,图片来源:某高校官网

  2. 薪资待遇

  普遍意义上,没有编制就是临时工。对临时工而言,合同就很重要了。这些专门为研究而设立的岗位,一般是一份合同签两段相同年限的聘期,比如来个「3+3」。不要觉得六年时间还挺长,可以骑驴找马,慢慢找其他机会。要知道,某些高校的急功近利很刷下限。

  薪水方面,一般的临时工,工资是比正式员工少的,但特聘或专职岗位经常待遇还算过得去,一年怎么也得有个二三十万 —— 读了个博士,总是管点用。

  工资理论上都是学校给,但上面也说了,得看具体的协议怎么写的。但无论怎么写,都不是按照人才引进,科研启动金和安家费几乎是确定没有的,能保证纸面上的薪资不是打包的用人成本就已经很不错了。

  因为没编制,社保啥的不重要,人才政策啥的也没人替博士们张罗。虽然名曰高校正式职工,但子女入托等福利落实得比较少,个别厚道点的高校才会解决一部分。

  某高校招聘特聘研究员 / 副研究员,图片来源:某高校官网

  3. 绩效要求

  既然是高校专门设立岗位忽悠招聘来的,那目的就很明确了 —— 做科研。这类专门的岗位不能独立建组,只能找个大老板的组干活儿,通常也不必操心教学的事,就是一门心思弄项目发论文即可。在这点上,高校、课题组老板、特聘副教授们的需求是一致的。

  4. 发展前景

  照说,就这么闷头苦干多发论文,职业发展应该不错吧?那得看在哪里。在原高校想留下那是基本没机会的,但要是借着这段时间发表的论文,去其他地方求职,职业发展也有可能不错。注意!只是有可能哦~~坊间传言,有的高校已经开始不再招聘做过专职科研岗位的博士。

  四、助理教授

  1. 岗位简介

  助理教授是高校人事招聘中最近兴起的舶来品。美国的助理教授叫 Assistant Professor,比副教授(Associate Professor)和教授(Full Professor)低。在 tenure-track 体系中,是打怪升级的起点。

  引入国内后,目前有点混乱,既可以是讲师,也可以是特聘研究员,反正就是没编制。

  2. 薪资待遇

  虽说没编制,但学术起点可是相当好。助理教授薪资不错,据说看齐正教授,也有安家费和科研启动金 —— 都是学校给的。学校也会尽力申请,把地方上的人才政策落实到位。科研启动金往往很可观,多到能独立建实验室。

  前面提到的 PI 制跟助理教授的配合也比较好,比如在改革开放试验田的某高校,新引进的助理教授是 PI,也是博士生导师,能招研究生,还能招研究助理和博士后。这基本就是美国的整套玩法了。搞不好比美国还好,因为子女入托等高校福利,国外一般没有,可咱们有!

  图片来源:深圳大学官网

  3. 绩效要求

  考虑到助理教授也没编制,所以合同很重要。能干几年,什么标准能评副教授,最好都写进协议。助理教授需要承担的教学任务通常很少,安心做研究达到绩效就好。

  4. 发展前景

  从目前国内高校的情况看,助理教授还不算坑,好好干的话,留下的可能性很大。毕竟学校已经花了这么多资源为助理教授独立建组,要是聘期结束就赶人,沉没成本也忒大了点儿。但不排除某些高校在用人方面采取创新,继续把助理教授这个头衔玩儿坏。

  总结

  以上的信息,都只是大面上的粗略的介绍,跟具体高校的具体岗位肯定有出入。总结起来,高校的岗位,没编制的多,有编制的少。对于那些没编制的岗位,还要看用人单位的招聘说明。

  最后,需要提醒每一位博士的是:入职前一定要仔细看合同,对于薪资待遇、绩效要求、考核标准、晋升条件等要一再确认,最好都落实到纸面。

  —— 哪怕有些流氓高校朝令夕改,咱最少还有个依据不是?

当前博士入职高校任教究竟有多难?

我们来看一组数据:

2019年国家统计局年鉴显示,国内应届博士毕业生已经超过6万,加上海归,现在总人数突破10万是必然趋势。而普通高等学校正以每年3万~4万人的增量吸收专任教师岗位人才。这意味着有30%左右的博士将注定被挡在高校正职门外,剩下的大部分还要经过残酷、甚至惨烈的层层选拔,才有机会成为最终的“胜利者”。

面对僧多粥少、难求一职的尴尬现状,“十万博士大军”开始寻求“曲线救国”的方式,比如通过博士后这一特殊职位,先设法迈进高校的门槛,再以此为跳板,一步步挤进“正规军”行列。但事实上,如此美好的期待背后,往往隐藏着危机,稍不留神就会掉入大坑……

常见的博后类型有哪些?

关于博后的认知,门外汉还停留在“这是比博士更牛的学历”的误区里。而学术圈内的在读博士,虽然知道博后是职位,却也难以辨别其光鲜面具下的真实样貌。

在我国现有的在编高校教职序列中,博后本无姓名。换句话说,博后的存在是学校为缓和“教职编制供不应求和需要引进高层次人才”这一矛盾的应变策略。当然,不设上限数量的编外博后职位,确实也有效减缓着就业压力,同时培育着科研潜力股。

目前,博后职位逐渐衍生出以下三种类型:

  • 项目博后。顾名思义,入站就要为一个具体项目组的科学研究付出努力,待遇比普通博后略高,但项目结束之日就是你出站之时,完全没有留校机会。

  • 科研博后。类似于专职科研岗,这是高校争抢优质人才的又一手段,他们往往以高薪吸引基础扎实的博士毕业生进入学校工作,专职科研而不安排教学任务,目的就是能快出成果,冲刺“四青”。比较扎心的是,科研博后同样没有编制,3年一聘用,面临非升即走的困境。它的留校要求也非常苛刻,而且副教授名额有限,竞争压力极大。不过如果个人能力足够突出,仍然能出好成果,发好论文。某种程度上还是可以为自己积累跳槽的筹码。

  • 师资博后。这大概是博后中的“天坑职位”,所谓师资,本质上就是高校想招聘老师但是又没有编制名额,所以暂时用一部分博后指标先将博士招聘进来,教学、科研“双肩挑”,服务期大约2~3年不等。

要求时时变,待遇别期待

之所以说师资博后是“天坑”职位,不仅仅因为其需要背负“教学+科研”两座大山,更重要的是摇摆不定的政策令人头大,感觉自己就是任人宰割的小羔羊,毫无招架之力。

比如,谈及科研要求,起初承诺发1篇核心期刊论文即可拥有留校资格;没做多久,各类领导又婉转表达,必须拿到省部级课题+论文才能聘讲师;好吧,努努力也不是不可能,未曾想省部级课题申请书还没交上去,新领导上任,考核条件又提升成国家级课题+指定刊物论文。

这时你发现,这应该是该副教授的标准……那么问题来了,合同签了,其他offer推了,招聘季过了,陷入骑虎难下的处境。一怒之下闪出辞职念头,却又望着高昂的离职赔偿金,瞬间只好在心里弱弱哀叹——终究是错付了。

此外,师资博后可以说“付出和回报不成正比”。除了教学和科研,有时还要接受许多杂活儿,没有边界感;而且由于不是学校的“自己人”,福利待遇也一言难尽,基本课时费和工资或许比讲师高,但是五险一金、年终奖、节日奖等其他待遇是否一视同仁就不敢保证了。

总之,与项目博后、科研博后的明码标价不同,师资博后到处都是“灰色地带”,应届博士小白容易被看似充满希望的表象所迷惑,最后为自己的too young too simple 买单。

留校需要“硬实力+好圈子”

在高校内卷没有这么惨烈的从前,合同上所谓的“两年后经考核达到XX条件即可留校聘为讲师”的承诺通常还生效。但随着科研界通货膨胀的日益严重 ,实际上已经不可信了。大多数博士都是服务期内达不到科研要求,遗憾出局。正如某论坛上的段子所言,“两年过后,带着一身‘007’的伤痛,你独自一人黯然离去,而学校的科研GDP越来越红火”。

说到这,有些大佬可能会以切身经历站出来反驳。的确,师资博后虽然槽点多多,但也不排除“牛人趟出一条血路”的概率。毕竟,理论上师资博后的培养方向是“有编制、出成果、能授课”的满分人才。如果你满足以下几个条件,倒也不妨一试。

  • 成果数量多,代表作不多。此类博士不在少数,由于学历背景、年龄限制、海外经历等等原因,难以通过高校教职的选拔,但研究生阶段科研基础扎实,产出过一系列普通论文,同时对自己的研究方向具有深入见解和前瞻眼光。这样的潜力股保持住干劲,假以时日,确实有机会以师资博后为跳板,转正获得编制。

  • 学术人脉资源过硬。想要拥有光明的学术前景,除了刻苦努力,也讲究“圈子”支持,甚至有时候后者加持更重要。如果能跟上好老板、好导师,加上自身水平不差,肯付出被赏识,未来发展基本就稳了。

  • 做好兼顾教学和科研的心理准备。拥有超强的时间分配管理能力,既能保证高质量完成科研工作,还能抽出本就不多的休闲时间用来备课教学,精力MAX且责任心plus。

凡事无绝对,是否选择师资博后作为第一份工作,因人而异。

对于学历出身好、代表作不少的博士应届生,千万别因为眼前的诱惑而草草交待,还是耐心等待,投递教职岗位,因为出站“再就业”是大概率事件,到时候这段博后经历可能是减分项,就像“本科北大,硕士被调剂去了五道口技术学院”;对于综合实力一般的博士,可以选择海外博后历练一番,也可以在权衡利弊后入职师资博后,只要下定决心、耐住磨砺,逆袭也不是梦,就像“专科一路考上北大,特别难,但也总有人创造奇迹”。

关于师资博后,各高校情况不同,tips不一定放之四海皆准,但能有效避免小白傻傻“入坑”、追悔莫及。

尾声

博士求职的高压越来越令人窒息不假,但也千万别因此抱着“赌一把”的心态随意签下“卖身契“。求职关键期里:首先锁定教职岗位;经济条件允许的可以考虑海外博后;而一旦选择了国内师资博后等博后职位,要心中有数且想好后路!