93只恒生指数成分股名单

根据维基百科和恒生指数公司的最新信息,从2026年6月8日起,恒生指数成分股已增加至93只[[2]]。以下是完整的93只恒生指数成分股名单,按行业分类:

金融业(10只成份股)

  1. 汇丰控股 (0005)
  2. 友邦保险 (1299)
  3. 建设银行 (0939)
  4. 工商银行 (1398)
  5. 香港交易所 (0388)
  6. 中国平安 (2318)
  7. 中国银行 (3988)
  8. 中国人寿 (2628)
  9. 招商银行 (3968)
  10. 中银香港 (2388)

非必需性消费及必需性消费(30只成份股)

  1. 阿里巴巴-SW (9988)
  2. 美团-W (3690)
  3. 比亚迪股份 (1211)
  4. 京东集团-SW (9618)
  5. 百度集团-SW (9888)
  6. 创科实业 (0669)
  7. 快手-W (1024)
  8. 吉利汽车 (0175)
  9. 泡泡玛特 (9992)
  10. 安踏体育 (2020)
  11. 携程集团-S (9961)
  12. 农夫山泉 (9633)
  13. 理想汽车-W (2015)
  14. 万洲国际 (0288)
  15. 银河娱乐 (0027)
  16. 港铁公司 (0066)
  17. 美的集团 (0300)
  18. 蒙牛乳业 (2319)
  19. 海尔智家 (6690)
  20. 华润啤酒 (0291)
  21. 李宁 (2331)
  22. 申洲国际 (2313)
  23. 金沙中国 (1928)
  24. 老铺黄金 (6181)
  25. 新东方-S (9901)
  26. 海底捞 (6862)
  27. 康师傅控股 (0322)
  28. 周大福 (1929)
  29. 恒安国际 (1044)
  30. 百威亚太 (1876)

资讯科技业(6只成份股)

  1. 腾讯控股 (0700)
  2. 小米集团-W (1810)
  3. 中芯国际 (0981)
  4. 网易-S (9999)
  5. 联想集团 (0992)
  6. 比亚迪电子 (0285)

能源业、原材料业、工业及综合事业(18只成份股)

  1. 中国海洋石油 (0883)
  2. 中国石油股份 (0857)
  3. 紫金矿业 (2899)
  4. 长和 (0001)
  5. 中国神华 (1088)
  6. 宁德时代 (3000)
  7. 中国宏桥 (1378)
  8. 中国石油化工股份 (0386)
  9. 中信股份 (0267)
  10. 中通快递 (2057)
  11. 洛阳钼业 (3993)
  12. 极兔速递 (1519) 新增
  13. 舜宇光学科技 (2382)
  14. 中国铝业 (2600) 新增
  15. 京东物流 (2618)
  16. 信义玻璃 (0868)
  17. 东方海外国际 (0316)
  18. 信义光能 (0968)

电讯业及公用事业(9只成份股)

  1. 中国移动 (0941)
  2. 中电控股 (0002)
  3. 电能实业 (0006)
  4. 香港中华煤气 (0003)
  5. 中国电信 (0728)
  6. 中国联通 (0762)
  7. 新奥能源 (2688)
  8. 华润电力 (0836)
  9. 长江基建集团 (1038)

地产建筑业(10只成份股)

  1. 新鸿基地产 (0016)
  2. 华润置地 (1109)
  3. 领展房产基金 (0823)
  4. 长实集团 (1113)
  5. 中国海外发展 (0688)
  6. 恒基地产 (0012)
  7. 九龙仓置业 (1997)
  8. 华润万象生活 (1209)
  9. 龙湖集团 (0960)
  10. 恒隆地产 (0101)

医疗保健业(10只成份股)

  1. 百济神州 (6160) 新增
  2. 信达生物 (1801)
  3. 药明生物 (2269)
  4. 药明康德 (2359)
  5. 石药集团 (1093)
  6. 中国生物制药 (1177)
  7. 翰森制药 (3692)
  8. 京东健康 (6618)
  9. 阿里健康 (0241)
  10. 国药控股 (1099)

最新变动: 在2026年5月22日的季度检讨中,恒生指数公司宣布新增3只蓝筹股:极兔速递(1519)中国铝业(2600)百济神州(6160),令恒指成分股数目由90只增加至93只,并于2026年6月8日起生效[[3]][[7]]。



在港股市场中,股票名称后面的 -W-S 甚至 -SW-B 等字母后缀,是港交所为了区分公司的特殊股权结构或上市类型而设置的“身份标签”。

这能帮助投资者一眼看出这家公司的治理结构或上市背景。具体含义如下:

1. -W:同股不同权 (Weighted Voting Rights)

  • 全称:Weighted Voting Rights (WVR)。
  • 含义:指这家公司采用了“同股不同权”的架构。在这种架构下,公司的股票分为普通股份和特别表决权股份。普通投资者买到的股票(1股只有1票投票权),而创始人或管理层持有的特别股份(1股拥有多票投票权,比如1股抵10票)。
  • 目的:确保创始团队在公司经过多轮融资、自身股权被稀释的情况下,依然能牢牢掌握公司的控制权和经营决策权(这在互联网和科技公司中非常常见)。
  • 代表公司:美团-W (3690)、小米集团-W (1810)。

2. -S:第二上市 (Secondary Listing)

  • 全称:Secondary Listing。
  • 含义:指这家公司已经在海外(如美国纽交所、纳斯达克)完成了“主要上市”,然后再次来到港交所发行股票并挂牌交易。
  • 目的:这类公司通常是为了拓宽融资渠道、吸引亚洲及大中华区的投资者、提升股票流动性,或者为了规避单一海外市场的监管风险(近年来很多中概股回港都属于此类)。
  • 代表公司:百度集团-S (9888)。
  • (注:如果一家第二上市的公司后来申请转为“双重主要上市”,它的 -S 标记通常会被取消。)

3. 组合后缀:-SW

  • 含义:如果您看到像 阿里巴巴-SW (9988)京东集团-SW (9618) 这样的名字,说明这家公司既是第二上市(-S),又采用了同股不同权架构(-W)

补充:其他常见后缀

除了 -W-S,您在港股名单中可能还会看到:

  • -B:代表 Biotech(生物科技公司)。
    • 含义:指这家公司是未通过港交所常规盈利测试、但具备核心研发能力的生物科技公司。因为还在研发阶段,可能尚未实现盈利,带有 -B 是交易所特意提醒投资者注意其未盈利的风险。
    • 代表公司:歌礼制药-B (1672)。

总结: 这些字母后缀只是交易所的提示性标记,相当于公司的“特殊身份证”,它们不会影响股票的正常交易,也不直接代表公司基本面的好坏。普通投资者在买卖时,只需知道它们代表了特殊的股权或上市背景即可。



不是的,比亚迪并非只在香港上市。它是一家典型的 “A+H”两地上市的公司,同时在中国大陆的A股市场和香港的H股市场挂牌交易。

具体情况如下:

1. A股(中国大陆)

  • 上市地点:深圳证券交易所(深交所)
  • 股票简称:比亚迪
  • 股票代码002594
  • 上市时间:2011年6月(比亚迪最早是在A股上市的,当时发行的是中小板,后来中小板与主板合并)。

2. H股(中国香港)

  • 上市地点:香港联合交易所(港交所)
  • 股票简称:比亚迪股份
  • 股票代码01211(在港股软件中常显示为 1211)
  • 上市时间:2002年7月(比亚迪最早其实是先在港股上市的,比A股早了将近9年)。

为什么会产生“只在港股上市”的错觉?

在您之前看到的“恒生指数成分股”名单中,列出的是 “比亚迪股份 (1211)”。因为恒生指数是港股的指数,所以名单里只能收录它在港股上市的代码和名称。这可能会让人误以为它只有港股。

A股和H股是什么关系?

  • 同一家公司:002594(A股)和 01211(H股)代表的是完全同一家公司(比亚迪股份有限公司),享有同等的股东权益(同股同权)。
  • 价格不同:因为A股和H股是在两个完全独立的市场(大陆市场和香港市场)进行交易,面对的投资者群体、资金流动性、市场情绪都不同,所以两者的股价(折算成同一种货币后)通常是不一样的。这就产生了所谓的“A/H股溢价”。

补充:在其他地方上市了吗?

除了中国大陆(A股)和中国香港(H股)之外,比亚迪没有在美国(如纳斯达克、纽交所)或欧洲(如伦敦交易所)等其他海外资本市场上市。

总结:如果您在大陆的股票软件里搜索“比亚迪”(002594),或者在港股软件里搜索“比亚迪股份”(01211),买到的都是同一家公司的股票,只是交易的币种(人民币 vs 港币)和所在的市场不同而已。



这个问题需要分两个层面来回答:

如果按大众认知和实际市场主导地位来看,美国主要有 2个 核心股票交易所; 但如果按美国证券交易委员会(SEC)的官方注册来看,美国目前共有 16个 全国性的合法证券交易所。

为了让您清晰了解,我们可以将它们分为“两大巨头”、“其他专业交易所”和“场外市场”三个梯队:

第一梯队:两大绝对巨头(占据 99% 以上的市值和交易量)

这是普通投资者最熟悉、也是全球最著名的两家交易所。绝大多数知名的美国上市公司都在这两家挂牌:

  1. 纽约证券交易所 (NYSE – New York Stock Exchange)
    • 特点:历史最悠久(1792年成立),位于华尔街11号。采用“专家撮合”制度(虽然现在很多也电子化了),门槛极高。
    • 代表公司:主要是传统行业巨头、超级蓝筹股、大型金融机构。比如:伯克希尔·哈撒韦、沃尔玛、可口可乐、摩根大通、强生等。
  2. 纳斯达克 (Nasdaq)
    • 特点:成立于1971年,是全球第一个全电子化的股票交易市场。没有实体的交易大厅,门槛相对纽交所略低,对科技和创新企业非常友好。
    • 代表公司:全球顶尖的科技巨头、成长型公司。比如:苹果、微软、英伟达、特斯拉、亚马逊、Meta(脸书)、谷歌等。

第二梯队:其他全国性证券交易所(共 14 个左右)

除了纽交所和纳斯达克,美国SEC还注册了十几个其他的证券交易所。这些交易所普通散户平时很少直接感知,它们主要处理机构订单、提供流动性或主打特定理念。 它们包括:

  • 纽交所旗下的其他交易所
    • NYSE Arca: originally 太平洋证券交易所,现在是ETF(交易型开放式指数基金) 交易的绝对主力。
    • NYSE American:原美国证券交易所(AMEX),现在主要面向中小型市值公司和初创企业。
    • NYSE Chicago / NYSE National:主要处理部分电子订单,提供交易通道。
  • 芝加哥期权交易所 (Cboe) 旗下的股票交易所
    • Cboe 本身是期权交易巨头,但它旗下也拥有 4 个股票交易所(Cboe BZX, BYX, EDGA, EDGX)。很多券商(如Robinhood、盈透)会把散户的订单路由到这些交易所去撮合。
  • 特色/独立交易所
    • IEX (Investors Exchange):因畅销书《闪电侠》而闻名。它的特色是故意设置一个微小的“减速带”(350微秒的延迟),用来防止高频交易机构利用速度优势“插队”收割普通投资者。
    • MEMX (Members Exchange):由华尔街多家大型券商(如嘉信理财、摩根士丹利等)联合出资成立的交易所,目的是为了打破纽交所和纳斯达克的垄断,降低交易手续费
    • LTSE (Long-Term Stock Exchange):长期证券交易所,由硅谷著名风投家提出,旨在奖励长期持有的股东(比如持股时间越长,投票权越大),对抗华尔街的短期逐利行为。

补充:场外交易市场 (OTC – Over-The-Counter)

除了上述正规的“交易所”,美国还有一个庞大的场外交易市场(严格意义上它不是交易所,而是一个交易商网络)。

  • OTC Markets:大家常听说的 “粉单市场” (Pink Sheets) 就在这里。
  • 特点:这里上市的公司通常达不到纽交所或纳斯达克的财务标准。里面充斥着仙股(几毛钱甚至几分钱的股票)、退市公司、破产重组公司,以及一些不想承担高昂合规成本的微型企业。风险极高。

总结

  • 如果您问的是 “美国最主要的股票交易场所”,答案是 2个(纽交所和纳斯达克)。
  • 如果您问的是 “美国官方认可的证券交易所数量”,答案是 16个。美国之所以有这么多交易所,是因为美国有一个“全国市场系统(NMS)”,允许各家交易所通过降低手续费、提供更快的网速或更公平的机制来互相竞争,从而让整体的交易成本保持在极低的水平。

关于Betano送你80欧元优惠的“流水要求”(Umsatzbedingungen)

关于Betano送你80欧元、需要充值80欧元的这个优惠,这里的关键信息都整理好了:

💰 这80欧元能提现吗?

可以,但有条件。

这80欧元属于奖金(Bonusguthaben),不能直接提现。你需要先满足 “流水要求”(Umsatzbedingungen) 后才能提现。

具体要求是:你需要用 “充值金额+奖金总额” (即80+80=160欧元)下注,且总下注额达到这个数的5倍,也就是 800欧元

并且,这些投注最低赔率必须是1.65。只有满足这些条件后,奖金及其产生的盈利才会转为可提现的余额。

⏳ 有使用期限吗?

有,而且有两个不同的截止日期需要注意:

  • 完成流水要求的期限:你有90天的时间来完成上述800欧元的流水要求。如果90天后没完成,奖金和相关的盈利可能会被收回。
  • 激活后投注的期限:奖金发放后,也需要在一定时间内用于下注,通常是90天内。

📝 其他重要规则

  • 这是一项新用户优惠:通常只针对新注册用户。
  • 最低充值额:要获得这个100%的奖金,最低充值额是10欧元
  • 注意支付方式:并非所有支付方式都符合条件,例如使用 Skrill 充值可能无法获得此奖金。
  • 某些投注类型不算:不是所有投注都算在流水里,比如“双 chance”等特殊玩法可能不被计入。
  • 可能还有额外赠礼:这个优惠有时还附带一张20欧元的免费投注券(Freebet)。它通常有单独的7-14天有效期,赢了只给你净利润,不包括本金。

💎 总结

简单来说,Betano的这笔奖金可以提现,但必须在90天内,用不低于1.65的赔率,完成总计800欧元的投注额。

建议你在参与前,务必登录Betano官网,仔细阅读最新的 “奖金条款”(Bonusbedingungen),因为具体规则可能会有调整。

How to Convert GenBank Files to GTF/FASTA for the nf-core/rnaseq Pipeline (Data_Nina_RNAseq_2026)

Abstract: When running the nf-core/rnaseq pipeline on custom or non-model viral genomes (like VZV or HSV), standard GenBank (.gb) files often cause indexing errors in STAR or quantification failures in Salmon due to missing GTF attributes. In this post, I share a robust Python (Biopython) workflow to convert GenBank files into perfectly formatted FASTA and GTF files. We specifically address how to inject required features like exon, gene_type, and properly formatted transcript_id and gene_id attributes to ensure seamless execution of the star_salmon alignment and quantification steps.

You are absolutely right! The duplicate exon was not the root cause. The real issue is that Salmon and the nf-core/rnaseq pipeline strictly expect specific attributes (like gene_type "protein_coding") to be present in every corresponding record to correctly build the transcript-to-gene mapping and index.

By comparing your original BB1528_nanopore_consensus.gtf with the HSV1 structure, we can see the missing “additional IDs/attributes”:

  1. gene_type "protein_coding" is missing in all records.
  2. The transcript_id format should ideally end with _RNA (e.g., UL33_RNA) instead of _tx.
  3. CDS records should have an orf_id.
  4. gene records should not have gene_name, and exon records should not have exon_number (to match the clean HSV1 structure).

Instead of re-converting from the GenBank file, we can write a quick Python script to directly fix your existing BB1528_nanopore_consensus.gtf and reformat its attributes to perfectly match the HSV1 structure.

Step 1: Save and Run the GTF Formatting Script

Save the following code as fix_gtf_to_hsv1_format.py in your working directory:

import sys

def fix_gtf(input_gtf, output_gtf):
    seen_exons = set()

    with open(input_gtf, 'r') as fin, open(output_gtf, 'w') as fout:
        for line in fin:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue

            seqname, source, feature, start, end, score, strand, frame, attributes = parts

            # Parse existing attributes into a dictionary
            attrs_dict = {}
            for attr in attributes.split(';'):
                attr = attr.strip()
                if not attr:
                    continue
                if ' ' in attr:
                    key, val = attr.split(' ', 1)
                    attrs_dict[key] = val.strip('"')

            gene_id = attrs_dict.get('gene_id', 'unknown_gene')
            # Standardize transcript_id to gene_id_RNA to match HSV1 format
            new_transcript_id = f"{gene_id}_RNA"

            # Rebuild attributes strictly matching HSV1 structure
            new_attrs = ""
            if feature == "gene":
                new_attrs = f'gene_id "{gene_id}"; gene_type "protein_coding"'
            elif feature == "transcript":
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "exon":
                # Avoid duplicate exons
                exon_key = (seqname, start, end, strand, gene_id, new_transcript_id)
                if exon_key in seen_exons:
                    continue
                seen_exons.add(exon_key)
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; gene_type "protein_coding"'
            elif feature == "CDS":
                orf_id = f"{gene_id}_CDS"
                new_attrs = f'gene_id "{gene_id}"; transcript_id "{new_transcript_id}"; orf_id "{orf_id}"; gene_type "protein_coding"'
            else:
                # For any other features, keep them but ensure gene_type is present
                if 'transcript_id' in attrs_dict:
                    attrs_dict['transcript_id'] = new_transcript_id
                if 'gene_type' not in attrs_dict:
                    attrs_dict['gene_type'] = 'protein_coding'
                new_attrs = '; '.join([f'{k} "{v}"' for k, v in attrs_dict.items()])

            # Write the fixed line
            fout.write(f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{new_attrs}\n")

if __name__ == "__main__":
    fix_gtf("BB1528_nanopore_consensus.gtf", "BB1528_final.gtf")
    print("✅ GTF successfully reformatted to match HSV1 structure -> BB1528_final.gtf")

Run it in your terminal:

python3 fix_gtf_to_hsv1_format.py

Step 2: Verify the Output

If you check the new BB1528_final.gtf, the structure will now perfectly mirror your HSV1 example:

BB1528  GenBank gene    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; gene_type "protein_coding"
BB1528  GenBank transcript  88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank exon    88  561 .   +   .   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; gene_type "protein_coding"
BB1528  GenBank CDS 88  561 .   +   0   gene_id "ORF_S_L_transcribed"; transcript_id "ORF_S_L_transcribed_RNA"; orf_id "ORF_S_L_transcribed_CDS"; gene_type "protein_coding"

Step 3: Re-run Nextflow with -resume

Now, update your Nextflow command to point to the newly formatted BB1528_final.gtf.

Important: Keep the -resume flag! Because the GTF file’s content has changed, Nextflow will automatically detect the new file hash and re-run only the genome preparation steps (like SALMON_INDEX and GTF2BED) without re-processing your FASTQ files.

nextflow run rnaseq/main.nf \
    --input samplesheet.csv \
    --outdir results_BB1528 \
    --fasta BB1528_nanopore_consensus.fasta \
    --gtf BB1528_final.gtf \
    --with_umi \
    --umitools_extract_method regex \
    --umitools_bc_pattern '^(?P
<UMI>.{12}).*' \
    --umitools_dedup_stats \
    -profile test_full \
    -resume \
    --max_memory 256.GB \
    --max_time 2400.h \
    --save_reference \
    --aligner star_salmon \
    --gtf_extra_attributes gene_id \
    --gtf_group_features transcript_id \
    --featurecounts_group_type gene_id \
    --featurecounts_feature_type transcript \
    --skip_rseqc \
    --skip_dupradar \
    --skip_preseq \
    --skip_biotype_qc \
    --skip_deseq2_qc \
    --skip_multiqc \
    --min_mapped_reads 0

This should completely resolve the SALMON_QUANT indexing error!



Good news: Your GTF and FASTA files are perfectly formatted!

We can prove this from your log: [info] Index contained 78 targets This means Salmon successfully built the transcriptome index and found all 78 transcripts from your BB1528_final.gtf. The reference files are completely correct.

What is causing the error?

The error is purely a biological/sample-specific issue:

[warning] salmon was only able to assign 9 fragments to transcripts in the index, 
but the minimum number of required assigned fragments (--minAssignedFrags) was 10.

The nf-core/rnaseq pipeline automatically subsamples your reads (in this case, 1,000,000 fragments) to run a quick Salmon quantification and check the library strandedness (--libType=A).

Because you are mapping to a specific viral reference (BB1528), it is highly likely that control_r3 is a negative control or a sample with an extremely low viral load. Out of the 1,000,000 subsampled reads, only 9 reads actually belonged to the virus.

Salmon has a built-in safety threshold (--minAssignedFrags 10) that intentionally crashes the pipeline if fewer than 10 reads map to the transcriptome, to prevent “garbage” quantification of empty samples. Since 9 < 10, the pipeline stops.

How to fix it

You can easily bypass this by creating a custom Nextflow configuration file that tells Salmon to accept samples with even just 1 mapped fragment.

Step 1: Create a config file In your working directory, create a file named salmon_fix.config and add the following code:

process {
    withName: 'SALMON_QUANT' {
        // Append --minAssignedFrags 1 to any existing Salmon arguments
        ext.args = { (task.ext.args ?: '') + " --minAssignedFrags 1" }
    }
}

Step 2: Run the pipeline with the config file Add -c salmon_fix.config to your Nextflow command. This will apply the fix to both the subsampled QC step and the main quantification step, ensuring your negative controls don’t crash the pipeline.

nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \
    -c salmon_fix.config \
    --input samplesheet.csv \
    --outdir results_BB1528 \
    --fasta BB1528_nanopore_consensus.fasta \
    --gtf BB1528_final.gtf \
    --with_umi \
    --umitools_extract_method regex \
    --umitools_bc_pattern '^(?P
<umi_1>.{12}).*' \
    --trimmer fastp \
    --extra_fastp_args "--trim_front2 10" \
    -resume \
    --max_cpus 54 \
    --max_memory 120.GB \
    --max_time 2400.h \
    --save_align_intermeds \
    --save_unaligned \
    --save_reference \
    --aligner star_salmon \
    --gtf_extra_attributes gene_id \
    --gtf_group_features transcript_id \
    --featurecounts_group_type gene_id \
    --featurecounts_feature_type transcript \
    --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc \
    --min_mapped_reads 0

Summary of what will happen now:

  1. The pipeline will resume from where it left off (thanks to -resume).
  2. Salmon will process control_r3, see that it only has 9 viral reads, but instead of crashing, it will output a warning and successfully generate an output file with 0 or near-0 counts.
  3. The rest of your infected samples (which likely have plenty of viral reads) will quantify normally.
  4. At the end of the pipeline, you will get a complete count matrix where your negative controls correctly show 0 expression, which is exactly what you want for downstream differential expression analysis!


Yes, all detected configuration files are being loaded and merged by Nextflow. Your custom settings will definitely be applied. What does this warning actually mean? This is not an error; it is just a best-practice reminder built into the nf-core pipeline framework. The developers of nf-core strictly enforce a rule: Pipeline parameters (the arguments starting with — like –input, –fasta, –max_memory) should never be defined inside a .config file. They must be passed directly via the command line (CLI) or a dedicated -params-file. The warning is simply the pipeline saying:

"Hey, I see you are using custom .config files. Just a reminder: do not put your --parameters inside them! Use the command line for parameters, and use .config files only for Nextflow settings (like process resources, arguments, or executor settings)."

Is your setup correct? Yes, absolutely. The salmon_fix.config file we created in the previous step only contains a process block to modify Salmon’s internal arguments (ext.args). It does not contain any params. Therefore, you are using the config file exactly as intended. Why did it trigger? Nextflow automatically detects any file ending in .config in your working directory (e.g., nextflow.config, salmon_fix.config, or profile configs). Because it found more than one, it triggered this informational warning. What should you do? You can safely ignore this warning. The pipeline will continue running normally, and your fix to allow Salmon to quantify samples with fewer than 10 assigned fragments (–minAssignedFrags 1) will be successfully applied. Did the pipeline successfully pass the SALMON_QUANT step and finish running this time? If it didn’t crash at that specific step, it means the config file worked perfectly!



Based on the error log and the reference files you provided, here is the diagnosis of the problem and how to fix it.

🔍 The Root Cause

The pipeline is failing during the FASTQ_SUBSAMPLE_FQ_SALMON step. nf-core/rnaseq uses this step to subsample 1 million reads and run Salmon to automatically detect your library’s strandedness (e.g., forward, reverse, unstranded) and verify the mapping rate.

The critical error from Salmon is:

salmon was only able to assign 8 fragments to transcripts in the index, but the minimum number of required assigned fragments (--minAssignedFrags) was 10. This could be indicative of a mismatch between the reference and sample...

Why is this happening?

  1. Your Reference is Virus-Only: The reference you provided (BB1528_nanopore_consensus.fasta) is a Varicella-Zoster Virus (VZV) genome (~127 kb).
  2. Your Controls are Likely Host-Only: You have samples named control_r1, control_r2, and control_r3. These are almost certainly uninfected host cells (e.g., human or monkey cell lines).
  3. The Mismatch: Because host RNA-seq reads do not match the VZV viral genome, the mapping rate for your control samples is essentially 0% (only 8 random reads mapped out of 1,000,000). Salmon aborts because it cannot confidently determine the library type with fewer than 10 mapped reads.

🛠️ How to Fix It

Depending on your experimental goals, choose one of the following solutions:

Solution 1: Add the Host Genome to your Reference (Recommended)

If you want to analyze host gene expression or simply want the pipeline to run smoothly on uninfected controls, you must include the host genome (e.g., Human GRCh38 or Macaque) alongside the VZV genome.

  • Action: Concatenate the Host FASTA and VZV FASTA into a single combined_genome.fa. Do the same for the Host GTF and VZV GTF (combined_annotation.gtf).
  • Why: This allows the host reads in your control samples to map properly, allowing Salmon to easily detect the library strandedness and pass the QC threshold.

Solution 2: Remove the Control Samples from the Samplesheet

If you are only interested in viral transcription and intentionally built a virus-only reference, the uninfected controls will yield zero useful data anyway.

  • Action: Open your samplesheet.csv and delete the rows for control_r1, control_r2, and control_r3.
  • Why: The pipeline will only run on the VZV.dXX infected samples, which will have high mapping rates to the VZV reference, allowing Salmon to successfully detect the library type.

Solution 3: Manually Specify Strandedness

If you know the strandedness of your library preparation kit (e.g., Illumina TruSeq Stranded mRNA is usually reverse), you can tell the pipeline to skip the auto-detection step.

  • Action: Add a strandedness column to your samplesheet.csv and fill it with reverse, forward, or unstranded for all samples. Alternatively, pass --strandedness reverse (or appropriate value) in your Nextflow run command.
  • Note: While this bypasses the library-type inference failure, Salmon might still throw a warning about the low mapping rate for the controls, but it usually prevents the hard crash.

Solution 4: Verify Sample Identities

If your control samples are actually supposed to be VZV-infected (e.g., a different viral strain or a positive control), then an 8-read mapping rate indicates a severe issue.

  • Action: Check your FASTQ files. Ensure the control FASTQ files aren’t corrupted, empty, or accidentally swapped with a completely different organism’s sequencing data. Ensure the viral strain in the control isn’t highly divergent from the Oka parental strain reference you are using.

💡 Summary Recommendation

For viral RNA-seq experiments, Solution 1 is the standard best practice. Creating a “chimeric” reference (Host + Virus) ensures that mapping rates are accurate, host-response pathways can be evaluated, and nf-core/rnaseq’s internal QC tools function exactly as intended.

Comprehensive Reproducible Pipeline for Longitudinal Nasal Microbiome and *S. epidermidis* Epidome Analysis (DATA_B/Data_Luise_Epidome_longitudinal_nose)

readme

phyloseq-rmd

microbiotaprocess-r

1. Study Design & Metadata

This pipeline analyzes longitudinal nasal swab samples from patients undergoing neurosurgery.

  • Cohort A (Aneurysm): 15 patients (45 samples)
  • Cohort H (Hypophysis): 20 patients (60 samples)
  • Total: 108 patient samples + 3 extraction controls.
  • Timepoints:
    • .1 = Admission (Baseline)
    • .2 = Surgery (Intraoperative/Immediate post-op)
    • .3 = Discharge (Recovery)

Targeted Sequencing Approaches:

  1. 16S rRNA Gene Amplicon Sequencing: For overall nasal microbiome profiling (processed via QIIME1 open-reference picking against SILVA 132).
  2. Epidome Method (S. epidermidis): Targeted amplicon sequencing of the g216 and yycH genes for high-resolution Sequence Type (ST) tracking (processed via DADA2 and custom Epidome Python/R scripts).

2. R Environment & Dependencies

To reproduce this analysis, ensure you are using R (v4.1+) and have the following packages installed.

# Core Microbiome & Phyloseq
install.packages(c("phyloseq", "microbiome", "picante", "vegan", "ape"))
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "microbiome", "phyloseq"))

# Advanced Processing & Visualization
install.packages(c("MicrobiotaProcess", "microeco", "ggplot2", "ggpubr", "ggrepel", "aplot"))
install.packages(c("ggh4x", "gghalves", "ggalluvial", "RColorBrewer", "heatmaply", "gplots"))

# Statistical Testing & Data Wrangling
install.packages(c("dplyr", "tidyr", "rstatix", "RVAideMemoire", "openxlsx", "knitr", "kableExtra", "tibble", "reshape2", "coin", "BSDA"))

3. Workflow Part I: Overall Microbiome (16S) Analysis

Scripts involved: MicrobiotaProcess.R (Primary engine for diversity/ordination) & Phyloseq.Rmd (Reporting, DESeq2, and specific statistical tests).

3.1 Data Import & Normalization Strategy

The pipeline splits the raw OTU table (table_even9893.biom) into distinct objects based on the downstream statistical requirements:

  • ps_filt: Raw counts, low-depth samples removed (min 1000 reads). Used for DESeq2 and Hellinger transformation.
  • ps_rarefied: Rarefied to an even depth of 9,893 reads (set.seed(9242)). Used for Alpha diversity and UniFrac/Jaccard metrics.
  • ps_abund_rel: Relative abundance, filtered to keep only taxa with mean relative abundance > 0.1%. Used for clean taxonomic composition plots.
  • hellinger (via MicrobiotaProcess): Hellinger-transformed ps_filt. Used for Bray-Curtis distances and PCoA.

3.2 Alpha Diversity (Longitudinal & Cross-Sectional)

  • Metrics: Observed OTUs, Chao1, ACE, Shannon, Simpson, Pielou.
  • Longitudinal Analysis (Within Cohorts A and H):
    • Overall temporal shift: Friedman test (non-parametric repeated measures) accounting for PatientID pairing.
    • Post-hoc pairwise: Paired t-tests and Paired Wilcoxon signed-rank tests with Benjamini-Hochberg (BH) correction.
  • Cross-Sectional Analysis (Cohort A vs. Cohort H):
    • All timepoints pooled: Independent t-test / Wilcoxon rank-sum test.
    • Surgery timepoint ONLY: To prevent dilution of the surgical effect by baseline/discharge samples, a specific filter (SampleType == "surgery") is applied before running the independent t-test/Wilcoxon.

3.3 Beta Diversity & Ordination

  • Distance Metric: Bray-Curtis dissimilarity calculated on Hellinger-transformed non-rarefied counts.
  • Ordination: Principal Coordinates Analysis (PCoA).
  • Statistical Testing:
    • Overall: PERMANOVA (vegan::adonis2) with 9,999 permutations.
    • Post-hoc Pairwise: All 15 possible pairwise group comparisons (e.g., A1 vs A2, A1 vs H1, H2 vs H3) are calculated using adonis2. P-values are adjusted using BH (FDR) and Bonferroni corrections.
  • Outputs: PCoA plots (colored by group, sized by Shannon, alpha by Observed), exported as PNG/PDF/SVG. Pairwise PERMANOVA results exported to Bray_pairwise_PERMANOVA.xlsx.

3.4 Taxonomic Composition

  • Plots: Stacked barplots and Heatmaps at the Class level (Top 20 taxa).
  • Grouping: Plots generated for individual samples, grouped by Cohort (A vs H), and grouped by Timepoint (Admission, Surgery, Discharge).

3.5 Differential Abundance Analysis (DESeq2)

  • Method: Negative Binomial Generalized Linear Model (Wald test) using raw, non-rarefied integer counts (ps_filt).
  • Comparisons: Six independent pairwise comparisons per cohort:
    1. Admission vs. Surgery
    2. Admission vs. Discharge
    3. Surgery vs. Discharge (Repeated for both Cohort A and Cohort H)
  • Threshold: Adjusted P-value (BH) < 0.05.
  • Outputs: Volcano-style plots (log2FoldChange by Family/Class) and Excel tables of significant DEGs for each comparison.

4. Workflow Part II: S. epidermidis Epidome Analysis

Script involved: Phyloseq.Rmd (Sections: ST summary, Binary Prevalence Analysis). Input Data: count_table_seq31_seq37_ST.txt (ASVs classified into STs via g216 and yycH BLAST against the Epidome DB). Normalized to median depth (56,191).

4.1 ST Composition & Alpha Diversity

  • Composition: Stacked barplots of relative ST abundance across the 108 samples.
  • Alpha Diversity: Shannon index and Observed STs compared between Cohorts (t-test) and across Timepoints (t-test).
  • goeBURST: Clonal complex (CC) visualization based on PubMLST allele profiles.

4.2 Continuous Abundance Testing (Specific STs)

  • Method: Wilcoxon signed-rank test (paired, Admission .1 vs. Discharge .3).
  • Targets: Analyzed individually for clinically relevant STs (ST2, ST5, ST8, ST23, ST59, ST60, ST73, ST100, ST130, ST215) and in combined clinical groups (e.g., ST5+ST87+ST130+ST210+ST384).
  • Handling Ties: Evaluated using exact calculations via the coin package and the Sign Test (BSDA) to ensure robustness against zero-inflation.

4.3 🌟 Novel Binary Prevalence Analysis (Cochran’s Q & McNemar)

Because continuous abundance of specific high-risk STs (ST2, ST5, ST23) showed no significant shifts, we converted the data to a binary presence/absence matrix to track strain acquisition and clearance.

  1. Data Transformation: Abundance > 0 becomes 1 (Present); becomes (Absent).
  2. Overall Temporal Shift: Cochran’s Q test (RVAideMemoire::cochran.qtest) to evaluate if the prevalence proportion changes across the 3 timepoints (Admission, Surgery, Discharge) for a given ST.
  3. Pairwise Transitions: Robust McNemar tests for paired binary data (Admission vs Surgery, Surgery vs Discharge, Admission vs Discharge).
    • Robustness: Custom R function handles zero-variance (no discordant pairs) by assigning $p = 1.0$ instead of crashing.
  4. Correction: Benjamini-Hochberg (BH) FDR correction applied to the 3 pairwise p-values.
  5. Visualization: Line charts with scatter points showing the Prevalence (%) trend across the 3 timepoints, annotated with $n/N$ (number of positive patients / total patients).
  6. Export: The binary matrix is exported to ST_binary_matrix.xlsx.

5. Execution Instructions (How to Reproduce)

Step 1: Prepare the Directory

Ensure your working directory contains the following raw files:

  • table_even9893.biom and ../clustering/rep_set.tre (16S data)
  • ../map3_corrected.txt (Metadata)
  • ../count_table_seq31_seq37_ST.txt (Epidome ST counts)
  • adiv_even.txt, adiv_even_A.txt, adiv_even_H.txt (Pre-calculated 16S alpha metrics from QIIME)
  • alpha_diversity_metrics_samples_ST.csv (Pre-calculated ST alpha metrics)

Step 2: Run the MicrobiotaProcess Pipeline (Beta/Alpha/Composition)

Open your terminal or RStudio and run the MicrobiotaProcess.R script. This will generate the figures_MP/ and figures_All_Combined/ directories.

Rscript MicrobiotaProcess.R

Outputs: PCoA plots, rarefaction curves, Bray-Curtis distance matrices, pairwise PERMANOVA Excel files, and Class-level heatmaps.

Step 3: Render the Comprehensive HTML Report

Open RStudio, load Phyloseq.Rmd, and render it. This integrates the outputs from Step 2, runs DESeq2, performs the Epidome statistical tests, and generates the final interactive report.

# In R Console:
rmarkdown::render('Phyloseq.Rmd', output_file = 'Phyloseq.html')

Outputs: Phyloseq.html (the master report), figures/ directory (DESeq2 plots, ST prevalence plots, goeBURST), and various .xlsx files for ST binary matrices and DEG tables.


6. Summary of Key Output Files

File / Directory Description
Phyloseq.html Master Report. Contains all 16S and Epidome statistics, tables, and embedded figures.
figures_MP/ High-res PCoA, rarefaction, alpha diversity, and composition heatmaps from MicrobiotaProcess.
figures_All_Combined/ Pairwise PERMANOVA results (Bray_pairwise_PERMANOVA.xlsx) and combined PCoA plots.
figures/ DESeq2 volcano plots, ST prevalence trend lines, and Alpha diversity boxplots.
DEGs_*.xlsx Excel tables of differentially abundant OTUs for all 6 pairwise comparisons per cohort.
ST_binary_matrix.xlsx The 0/1 presence/absence matrix for ST2, ST5, and ST23 used for the McNemar analysis.

Pipeline maintained by Jiabin Huang. For questions regarding the Epidome Python scripts or upstream QIIME1/DADA2 processing, refer to the respective bash/R scripts in the epidome/scripts/ directory.

统计分析:Cochran’s Q 检验 + 配对 McNemar 检验 (带 BH 校正)

这两个检验是统计学中专门用于处理配对设计(Paired Design)重复测量设计(Repeated Measures Design)二分类数据(Binary Data,如是/否、阳性/阴性、检出/未检出) 的经典非参数方法。

在您的课题中,同一个患者在三个时间点(入院、手术、出院)被反复采样,这构成了天然的“配对/纵向”关系。合作者建议将数据转换为“0/1”矩阵,正是为了使用这套统计体系。

以下为您全面、深入地解析这两个检验的原理、逻辑以及在您课题中的具体应用。


一、 McNemar 检验 (McNemar’s Test)

1. 核心定位 用于比较两个相关(配对)样本的二分类频率是否有显著差异。 在您的课题中,用于比较两个时间点(例如:入院 vs. 手术)某种 ST(如 ST2)的检出率是否有变化。

2. 核心逻辑:只关注“发生变化的样本” 很多初学者会误以为 McNemar 检验是比较两组的总阳性率,其实不然。它的核心逻辑是:如果两个时间点没有差异,那么“由阴转阳”的人数应该等于“由阳转阴”的人数。

假设我们有 $N$ 个患者,比较 Time 1(入院)和 Time 2(手术),数据可以整理成如下的 2×2 配对列联表

Time 2: 检出 (1) Time 2: 未检出 (0) 合计
Time 1: 检出 (1) $a$ (一直检出) $b$ (由阳转阴) $a+b$
Time 1: 未检出 (0) $c$ (由阴转阳) $d$ (一直未检出) $c+d$
合计 $a+c$ $b+d$ $N$
  • $a$ 和 $d$:代表状态没有改变的患者。它们对判断“时间点之间是否有差异”不提供任何信息
  • $b$ 和 $c$:代表状态发生改变的患者(不一致对子,Discordant pairs)。
  • McNemar 检验的本质:就是检验 $b$ 和 $c$ 是否相等。如果 $b$ 远大于 $c$,说明手术导致了 ST 的清除或丢失;如果 $c$ 远大于 $b$,说明手术导致了 ST 的新发获得。

3. 统计量计算 在 $b+c$ 足够大(通常 $>25$)时,服从卡方分布: $$ \chi^2 = \frac{(b – c)^2}{b + c} $$ (注:如果 $b+c < 25$,卡方近似不准确,必须使用 McNemar’s Exact Test 精确检验,基于二项分布计算。)


二、 Cochran’s Q 检验 (Cochran’s Q Test)

1. 核心定位 McNemar 检验只能比较两个时间点。当您的研究有三个或三个以上相关时间点(入院、手术、出院)时,如果您做 3 次 McNemar 检验,会大幅增加假阳性率(Multiple testing problem)。 Cochran’s Q 检验就是 McNemar 检验在 3 个及以上时间点的扩展,用于评估多个时间点的整体检出率是否存在差异。

2. 核心逻辑 它类似于配对设计的 ANOVA(重复测量方差分析),但针对的是二分类数据。它通过计算每个时间点的总阳性数,与“期望的总阳性数”之间的偏差,构建一个服从卡方分布的 $Q$ 统计量。

3. 严格的前提假设(避坑指南)

  • 数据必须完整(Complete Block):Cochran’s Q 检验要求每个受试者在所有时间点都有数据。如果某个患者只有入院和出院数据,缺失了手术数据,传统的 Cochran’s Q 会报错或将其整行剔除。
  • 样本量要求:患者数量不能太少,否则检验效能(Power)极低。

三、 针对您课题的标准统计分析流程

面对 ST2、ST5、ST23 这三个目标 ST,以及 3 个时间点,严谨的统计学报告流程应如下:

第一步:整体评估 (Omnibus Test)

针对每一个 ST(例如 ST2),运行 Cochran’s Q 检验

  • 零假设 ($H_0$):ST2 在入院、手术、出院三个时间点的检出率完全相同。
  • 目的:回答“ST2 的定植状态在整体病程中是否发生了波动?”
  • 报告:如果 $p 0.05$,通常认为无整体差异,但您仍可以基于临床假设进行第二步(事后比较)。

第二步:事后两两比较 (Post-hoc Pairwise Comparisons)

使用 McNemar 检验(或精确 McNemar 检验)进行三组两两对比:

  1. Admission vs. Surgery (入院 vs. 手术)
  2. Surgery vs. Discharge (手术 vs. 出院)
  3. Admission vs. Discharge (入院 vs. 出院)

第三步:多重检验校正 (Multiple Testing Correction)

因为您对 ST2 做了 3 次 McNemar 检验,假阳性率膨胀了。必须使用 Benjamini-Hochberg (BH) 方法对得到的 3 个原始 p 值进行校正,得到 $p_{adj}$ (FDR)

  • 只有当 $p_{adj} < 0.05$ 时,才能宣称这两个时间点之间的检出率差异具有统计学显著性。

四、 为什么合作者的建议非常专业且切中要害?

在微生物组学和流行病学中,“丰度(Abundance)”“检出率/流行率(Prevalence)” 代表两个完全不同的生物学维度:

  1. 克服“零膨胀”与“极端值”干扰: 表皮葡萄球菌的某些 ST(如 ST2)在很多患者鼻腔中可能是极低丰度甚至检测不到的(大量 0 值)。如果您用绝对丰度做 Wilcoxon 或 t 检验,结果极易被个别“超级携带者”(某患者 ST2 丰度突然飙升至 50%)拉偏,导致统计失效。 转换为 0/1 矩阵后,直接忽略了丰度大小的干扰,纯粹考察“有没有”的问题,统计效能(Power)往往更高。
  2. 直接对接临床意义(院感控制): ST2、ST5 是国际公认的与医院获得性感染(HAI)、生物膜形成、多重耐药(如 MRSE) 高度相关的高危克隆系。 在临床视角下,患者是否“定植(Colonized)”了高危 ST,比定植了多少个细菌更关键。合作者建议分析这三个时间点的检出比例(Prevalence %),实际上是在回答一个极具临床价值的流行病学问题:
    • “外科手术干预(及住院环境)是否增加了患者获得院内高危 ST 克隆(如 ST2/ST5)的风险?” (如果 Surgery/Discharge 的检出率显著高于 Admission)。
    • “手术或抗生素使用是否清除了患者原有的定植?” (如果检出率显著下降)。

五、 进阶建议:如果数据有缺失值怎么办?

如前所述,Cochran’s Q 检验对缺失值非常苛刻。在真实的临床纵向研究中,患者脱落或样本采集失败导致的数据缺失很常见。

如果您的数据中存在部分患者缺失某个时间点(例如只有 .1.3,没有 .2),除了我上一轮提供的代码中使用的 tryCatch 容错机制外,您在论文的方法学部分可以提及使用更强大的广义线性混合模型 (GLMM, Generalized Linear Mixed Models) 作为补充或替代。

  • GLMM 的优势:它可以处理二分类数据(二项分布 family=binomial),将“时间点”作为固定效应(Fixed effect),将“患者 ID”作为随机截距(Random intercept, (1|PatientID))。
  • 最大优点:GLMM 采用最大似然估计(Maximum Likelihood),允许数据存在缺失(Missing at Random),不需要像 Cochran’s Q 那样剔除任何不完整配对的样本,在顶级期刊的纵向队列研究中更受审稿人青睐。

总结:合作者的建议将您的分析从“描述性的微生物群落变化”提升到了“具有明确临床流行病学意义的病原体定植动态研究”。使用 Cochran’s Q + 配对 McNemar + BH校正 的组合拳,在统计学上无懈可击,能够非常漂亮地回应审稿人可能提出的质疑。

Processing Data_Patricia_Transposon_2025 v2 (Workflow for Structural Variant Calling in Nanopore Sequencing)

  1. Generate the HD46_Ctrol annotation

     mamba activate trycycler
     cd trycycler_HD46_Ctrl;
     trycycler cluster --threads 55 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler;
    
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
     mv trycycler/cluster_001/1_contigs/J_ctg000010.fasta .
     mv trycycler/cluster_001/1_contigs/L_tig00000016.fasta .
     mv trycycler/cluster_001/1_contigs/R_tig00000001.fasta .
     mv trycycler/cluster_001/1_contigs/H_utg000001c.fasta .
    
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
     mv trycycler/cluster_002/1_contigs/*00000*.fasta .
     Error: unable to find a suitable common sequence
    
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
     mv trycycler/cluster_003/1_contigs/F_tig00000004.fasta .
     mv trycycler/cluster_003/1_contigs/L_tig00000003.fasta .
    
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
     mv trycycler/cluster_004/1_contigs/J_ctg000000.fasta .
     mv trycycler/cluster_004/1_contigs/P_ctg000000.fasta .
     mv trycycler/cluster_004/1_contigs/S_contig_2.fasta .
    
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_007
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_008
    
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_009
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_010
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_011
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_012
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_013
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_014
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_015
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_016
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_017
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_018
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_019
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_020
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_021
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_022
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_023
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_024
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_025
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_026
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_027
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_028
     trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_029
    
     trycycler msa --threads 55 --cluster_dir trycycler/cluster_001
     trycycler msa --threads 55 --cluster_dir trycycler/cluster_004
    
     trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001
     trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_004
    
     trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001
     trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004
    
     #Polish --> TODO: Need to be Debugged!
     for c in trycycler/cluster_001 trycycler/cluster_004; do
         medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka  -m r941_min_sup_g507 -t 12
         mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
         rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi  # clean up
     done
     # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
    
     cp trycycler/cluster_001/7_final_consensus.fasta HD46_Ctrl_chr.fasta
     cp trycycler/cluster_004/7_final_consensus.fasta HD46_Ctrl_plasmid.fasta
  2. install mambaforge https://conda-forge.org/miniforge/ (recommended)

     #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
     chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
     ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
     To activate this environment, use:
         micromamba activate /home/jhuang/mambaforge
     Or to execute a single command in this environment, use:
         micromamba run -p /home/jhuang/mambaforge mycommand
     installation finished.
    
     Do you wish to update your shell profile to automatically initialize conda?
     This will activate conda on startup and change the command prompt when activated.
     If you'd prefer that conda's base environment not be activated on startup,
       run the following command when conda is activated:
    
     conda config --set auto_activate_base false
    
     You can undo this by running `conda init --reverse $SHELL`? [yes|no]
     [no] >>> yes
     no change     /home/jhuang/mambaforge/condabin/conda
     no change     /home/jhuang/mambaforge/bin/conda
     no change     /home/jhuang/mambaforge/bin/conda-env
     no change     /home/jhuang/mambaforge/bin/activate
     no change     /home/jhuang/mambaforge/bin/deactivate
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
     no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
     no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
     no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
     no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
     modified      /home/jhuang/.bashrc
     ==> For changes to take effect, close and re-open your current shell. <==
     no change     /home/jhuang/mambaforge/condabin/conda
     no change     /home/jhuang/mambaforge/bin/conda
     no change     /home/jhuang/mambaforge/bin/conda-env
     no change     /home/jhuang/mambaforge/bin/activate
     no change     /home/jhuang/mambaforge/bin/deactivate
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
     no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
     no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
     no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
     no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
     no change     /home/jhuang/.bashrc
     No action taken.
     WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
     WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
     Added mamba to /home/jhuang/.bashrc
     ==> For changes to take effect, close and re-open your current shell. <==
     Thank you for installing Mambaforge!
    
     Close your terminal window and open a new one, or run:
     #source ~/mambaforge/bin/activate
     conda --version
     mamba --version
    
     https://github.com/conda-forge/miniforge/releases
     Note
    
         * After installation, please make sure that you do not have the Anaconda default channels configured.
             conda config --show channels
             conda config --remove channels defaults
             conda config --add channels conda-forge
             conda config --show channels
             conda config --set channel_priority strict
             #conda clean --all
             conda config --remove channels biobakery
    
         * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
     # --Deprecated method: mamba installing on conda--
     #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
     #    * Note that installing mamba into any other environment than base is not supported.
     #
     #conda activate base
     #conda install conda
     #conda uninstall mamba
     #conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Test the installed tools

     # Check versions
     sniffles --version
     RepeatModeler -h
     RepeatMasker -h
     svim --help
     SURVIVOR --help
     mamba install -c conda-forge perl r
  2. Data Preparation

     Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
     Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
  3. Preprocessing

     Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
     Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
  4. (Optional) Variant Calling for SNP and Indel Detection:

     Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
  5. (OFFICIAL STARTING POINT) Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

       #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
    
       # -- PREPARING the input fastq-data, merge the fastqz and move the top-directory
    
       # Under raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass
       zcat ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_0.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_1.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_2.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_3.fastq.gz ... | gzip > HD46_1.fastq.gz
       mv ./raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass/HD46_1.fastq.gz ~/DATA/Data_Patricia_Transposon_2025
    
         #this are the corresponding sample names:
         #barcode 1: HD46-1
         #barcode 2: HD46-2
         #barcode 3: HD46-3
         #barcode 4: HD46-4
         mv barcode01.fastq.gz HD46_1.fastq.gz
         mv barcode02.fastq.gz HD46_2.fastq.gz
         mv barcode03.fastq.gz HD46_3.fastq.gz
         mv barcode04.fastq.gz HD46_4.fastq.gz
    
       # -- CALCULATE the coverages
         #!/bin/bash
    
         for bam in barcode*_minimap2.sorted.bam; do
             echo "Processing $bam ..."
             avg_cov=$(samtools depth -a "$bam" | awk '{sum+=$3; cnt++} END {if (cnt>0) print sum/cnt; else print 0}')
             echo -e "${bam}\t${avg_cov}" >> coverage_summary.txt
         done
    
       # ---- !!!! LOGIN the suitable environment !!!! ----
       # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
       mamba activate transposon_long
    
       # -- TODO: AFTERNOON_DEBUG_THIS: FAILED and not_USED: Alignment and Detect structural variants in each sample using SVIM which used aligner ngmlr or mimimap2
       #mamba install -c bioconda ngmlr
       mamba install -c bioconda svim
    
       #SEARCH FOR "HD46_Ctrl_chr_plasmid.fasta" for finding the insertion-calling-commands
       # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! for all 4 options #
       # ---- Option_1: minimap2 (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
    
       for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           #INS,INV,DUP:TANDEM,DUP:INT,BND
           svim reads --aligner minimap2 --nanopore minimap2+svim_${sample}    ${sample}.fastq.gz HD46_Ctrl_chr_plasmid.fasta  --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
       done
    
       #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    
       # ---- Option_2: minamap2 (aligner) + Sniffles2 (structural variant caller) --> SUCCESSFUL ----
       #Minimap2: A commonly used aligner for nanopore sequencing data.
       #    Align Long Reads to the WT Reference using Minimap2
       #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
       #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
       #  -l 50: Reports SVs that are at least 50 base pairs long.
       #  -t 60: Uses 60 threads for faster processing.
       for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           #minimap2 --MD -t 60 -ax map-ont HD46_Ctrl_chr_plasmid.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
           minimap2 --MD -t 60 -ax map-ont HD46_Ctrl_chr_plasmid.fasta ${sample}.fastq.gz | samtools sort -o ${sample}_minimap2.sorted.bam
           samtools index ${sample}_minimap2.sorted.bam
           sniffles -m ${sample}_minimap2.sorted.bam -v ${sample}_minimap2+sniffles.vcf -s 10 -l 50 -t 60
           #QUAL < 20 ||
           bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_minimap2+sniffles.vcf > ${sample}_minimap2+sniffles_filtered.vcf
       done
    
         #Estimating parameter...
         #        Max dist between aln events: 44
         #        Max diff in window: 76
         #        Min score ratio: 2
         #        Avg DEL ratio: 0.0112045
         #        Avg INS ratio: 0.0364027
         #Start parsing... CP020463
         #                # Processed reads: 10000
         #                # Processed reads: 20000
         #        Finalizing  ..
         #Start genotype calling:
         #        Reopening Bam file for parsing coverage
         #        Finalizing  ..
         #Estimating parameter...
         #        Max dist between aln events: 28
         #        Max diff in window: 89
         #        Min score ratio: 2
         #        Avg DEL ratio: 0.013754
         #        Avg INS ratio: 0.17393
         #Start parsing... CP020463
         #                # Processed reads: 10000
         #                # Processed reads: 20000
         #                # Processed reads: 30000
         #                # Processed reads: 40000
    
         # Results:
         # * barcode01_minimap2+sniffles.vcf
         # * barcode01_minimap2+sniffles_filtered.vcf
         # * barcode02_minimap2+sniffles.vcf
         # * barcode02_minimap2+sniffles_filtered.vcf
         # * barcode03_minimap2+sniffles.vcf
         # * barcode03_minimap2+sniffles_filtered.vcf
         # * barcode04_minimap2+sniffles.vcf
         # * barcode04_minimap2+sniffles_filtered.vcf
    
       #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
       #for sample in barcode01 barcode02 barcode03 barcode04; do
       #    sniffles -m svim_reads_minimap2_${sample}/${sample}.fastq.minimap2.coordsorted.bam -v sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
       #    bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
       #done
    
       # ---- Option_3: NGMLR (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
       for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           svim reads --aligner ngmlr --nanopore    ngmlr+svim_${sample}       ${sample}.fastq.gz HD46_Ctrl_chr_plasmid.fasta  --cores 10;
       done
    
       # ---- Option_4: NGMLR (aligner) + sniffles (structural variant caller) --> SUCCESSFUL ----
       for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           sniffles -m ngmlr+svim_${sample}/${sample}.fastq.ngmlr.coordsorted.bam -v ${sample}_ngmlr+sniffles.vcf -s 10 -l 50 -t 60
           bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_ngmlr+sniffles.vcf > ${sample}_ngmlr+sniffles_filtered.vcf
       done
    
       #END
  6. Compare and integrate all results produced by minimap2+sniffles and ngmlr+sniffles, and check them each position in IGV!

     # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
     mv HD46_1_minimap2+sniffles_filtered.vcf    HD46-1_minimap2+sniffles_filtered.vcf
     mv HD46_1_ngmlr+sniffles_filtered.vcf       HD46-1_ngmlr+sniffles_filtered.vcf
     mv HD46_2_minimap2+sniffles_filtered.vcf    HD46-2_minimap2+sniffles_filtered.vcf
     mv HD46_2_ngmlr+sniffles_filtered.vcf       HD46-2_ngmlr+sniffles_filtered.vcf
     mv HD46_3_minimap2+sniffles_filtered.vcf    HD46-3_minimap2+sniffles_filtered.vcf
     mv HD46_3_ngmlr+sniffles_filtered.vcf       HD46-3_ngmlr+sniffles_filtered.vcf
     mv HD46_4_minimap2+sniffles_filtered.vcf    HD46-4_minimap2+sniffles_filtered.vcf
     mv HD46_4_ngmlr+sniffles_filtered.vcf       HD46-4_ngmlr+sniffles_filtered.vcf
     mv HD46_5_minimap2+sniffles_filtered.vcf    HD46-5_minimap2+sniffles_filtered.vcf
     mv HD46_5_ngmlr+sniffles_filtered.vcf       HD46-5_ngmlr+sniffles_filtered.vcf
     mv HD46_6_minimap2+sniffles_filtered.vcf    HD46-6_minimap2+sniffles_filtered.vcf
     mv HD46_6_ngmlr+sniffles_filtered.vcf       HD46-6_ngmlr+sniffles_filtered.vcf
     mv HD46_7_minimap2+sniffles_filtered.vcf    HD46-7_minimap2+sniffles_filtered.vcf
     mv HD46_7_ngmlr+sniffles_filtered.vcf       HD46-7_ngmlr+sniffles_filtered.vcf
     mv HD46_8_minimap2+sniffles_filtered.vcf    HD46-8_minimap2+sniffles_filtered.vcf
     mv HD46_8_ngmlr+sniffles_filtered.vcf       HD46-8_ngmlr+sniffles_filtered.vcf
     mv HD46_13_minimap2+sniffles_filtered.vcf   HD46-13_minimap2+sniffles_filtered.vcf
     mv HD46_13_ngmlr+sniffles_filtered.vcf      HD46-13_ngmlr+sniffles_filtered.vcf
  7. (NOT_USED) Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)

       python vcf_to_fasta.py variants.vcf variants.fasta
       #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
       #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
       /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
       python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
       Exact Matches
       =============
       2 bacteria (blast name), Bacteria 
    (scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV​)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733​)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields. * In your VCF file generated by Sniffles, the REF=N in the results has a specific meaning: * In a standard VCF, the REF field usually contains the reference base(s) at the variant position. * For structural variants (SVs), especially insertions, there is no reference sequence replaced; the insertion occurs between reference bases. * Therefore, Sniffles uses N as a placeholder in the REF field to indicate “no reference base replaced”. * The actual inserted sequence is then stored in the ALT field.
  8. Why some records have UNRESOLVED in the FILTER field in the Excel output.

     1. Understanding the format
    
         The data appears to be structural variant (SV) calls from Sniffles, probably in a VCF-like tabular format exported to Excel:
    
             * gi|1176884116|gb|CP020463.1| → reference sequence
             * Positions: 1855752 and 2422820
             * N → insertion event
             * SVLEN=999 → size of the insertion
             * AF → allele frequency
             * GT:DR:DV → genotype, depth reference, depth variant (1/1:0:678, example values for a PASS variant)
             * FILTER → whether the variant passed filters (UNRESOLVED means it didn’t pass)
    
     2. What UNRESOLVED usually means
    
         In Sniffles:
    
         * UNRESOLVED is assigned to SVs when the tool cannot confidently resolve the exact sequence or breakpoint.
         * Reasons include:
             - Low read support (RE, DV) relative to the expected coverage
             - Ambiguous alignment at repetitive regions
             - Conflicting strand or orientation signals
             - Allele frequency inconsistent with expectations
    
     3. Examine your two records
    
         First record
    
             POS: 1855752
             SVTYPE: INS
             SVLEN: 999
             RE: 68
             AF: 1
             GT: 1/1
             FILTER: UNRESOLVED
    
         Observations:
    
         * AF = 1 → allele frequency 100%, homozygous insertion
         * RE = 68 → 68 reads support the variant, decent coverage
         * Still UNRESOLVED → likely because Sniffles could not resolve the inserted sequence precisely; sometimes long insertions in repetitive regions are hard to reconstruct fully even with good read support.
    
         Second record
    
             POS: 2422820
             SVTYPE: INS
             SVLEN: 999
             RE: 22
             AF: 0.025522
             GT: 0/0
             FILTER: UNRESOLVED
    
         Observations:
    
         * AF = 0.0255 → very low allele frequency (~2.5%)
         * RE = 22, DR = 840 → very low variant reads vs reference
         * GT = 0/0 → homozygous reference
         * Sniffles marks it UNRESOLVED because the variant is essentially noise, not confidently detected.
    
     4. Key difference between the two
         Feature First record    Second record
         Allele frequency (AF)   1 (high)    0.0255 (very low)
         Variant reads (RE)  68  22
         Genotype (GT)   1/1 0/0
         Reason for UNRESOLVED   Unresolvable inserted sequence
    
     ✅ 5. Conclusion
    
         * Sniffles marks a variant as UNRESOLVED when the SV cannot be confidently characterized.
         * Even if there is good read support (first record), complex insertions can’t always be reconstructed fully.
         * Very low allele frequency (second record) also triggers UNRESOLVED because the signal is too weak compared to background noise.
         * Essentially: “UNRESOLVED” ≠ bad data, it’s just unresolved uncertainty.
  9. (NOT_SURE_HOW_TO_USE) Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

       mamba install -c bioconda medaka
       medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
  10. Compare Insertions Across Samples

     Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
     SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
         Input: List of VCF files from Sniffles2.
         Output: A consolidated VCF file with shared and unique variants.
    
     Filter WT Insertions:
    
         Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
             bcftools isec WT.vcf merged.vcf -p comparison_results
  11. Validate and Visualize

     Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
     igv.sh
    
     Validate Findings:
         Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  12. Alternatives to TEPID for Long-Read Data

     If you’re looking for transposon-specific tools for long reads:
    
         REPET: A robust transposon annotation tool compatible with assembled genomes.
         EDTA (Extensive de novo TE Annotator):
             A pipeline to identify, classify, and annotate transposons.
             Works directly on your assembled genomes.
    
             perl EDTA.pl --genome WT.fasta --type all
  13. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

     Steps to Generate WT.vcf
     1. Align WT Reads to the WT Reference Genome
    
     The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
     Command using Minimap2:
    
     minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
     Index the BAM file:
    
     samtools index WT.sorted.bam
    
     2. Detect Structural Variants with Sniffles2
    
     Run Sniffles2 on the WT alignment to call structural variants:
    
     sniffles --input WT.sorted.bam --vcf WT.vcf
    
     This step identifies:
    
         Native transposons and insertions present in the WT genome.
         Other structural variants that are part of the reference genome or sequencing artifacts.
    
     Key parameters to consider:
    
         --min_support: Adjust based on your WT sequencing coverage.
         --max_distance: Define proximity for merging variants.
         --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  14. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

     To ensure the WT.vcf only includes relevant transposons or SVs:
    
         Use bcftools or similar tools to filter out low-confidence variants:
    
         bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
         bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
  15. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

     Why Use a Reference Genome (WT.fasta) from a Database?
    
         Higher Quality and Completeness:
             Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
         Consistency:
             Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
         Saves Time:
             Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
     Alternative: Assembling WT from FASTQ
    
     If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
         Assemble the genome from your WT.fastq:
             Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
         flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size 
    Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
  16. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

     # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
     #Using snpEff to annotate the insertion!
     conda activate /home/jhuang/miniconda3/envs/spandx
# --> BUG:
LOCUS       HD46_Ctrl 2707468 bp    DNA     circular BCT
    02-OCT-2025
DEFINITION  Staphylococcus epidermidis strain HD46-ctrl chromosome, whole
            genome shotgun sequence.
ACCESSION
VERSION

# --> DEBUG: adapt the genbank-file header as follows:
LOCUS       HD46_Ctrl 2707468 bp    DNA     circular BCT 02-OCT-2025
DEFINITION  Staphylococcus epidermidis strain HD46-ctrl chromosome, whole
            genome shotgun sequence.
ACCESSION   HD46_Ctrl
VERSION     HD46_Ctrl.1
DBLINK      BioProject: PRJNA1337321
            BioSample: SAMN52215988
KEYWORDS    .
SOURCE      Staphylococcus epidermidis
  ORGANISM  Staphylococcus epidermidis
            Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae;
            Staphylococcus.
COMMENT     Annotated genome for HD46_Ctrl.
...
    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/HD46_Ctrl
    cp HD46_Ctrl_chr.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/HD46_Ctrl/genes.gbk

    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config  #HD46_Ctrl.genome : HD46_Ctrl
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build -genbank HD46_Ctrl      -d

    sed -i 's/^cluster_001_consensus/HD46_Ctrl.1/' HD46-8_ngmlr+sniffles_filtered.vcf
    sed -i 's/^cluster_001_consensus/HD46_Ctrl.1/' HD46-13_ngmlr+sniffles_filtered.vcf
    #snpEff eff -nodownload -no-downstream -no-intergenic -ud 100 -v HD46_Ctrl HD46-8_ngmlr+sniffles_filtered.vcf > HD46-8_ngmlr+sniffles_filtered.annotated.vcf
    #snpEff eff -nodownload -no-downstream -no-intergenic -ud 100 -v HD46_Ctrl HD46-13_ngmlr+sniffles_filtered.vcf > HD46-13_ngmlr+sniffles_filtered.annotated.vcf

    # HD46-8
    snpEff ann -Xmx8g -v -hgvs -canon -ud 200 \
    -stats HD46-8_snpeff_stats.html \
    HD46_Ctrl \
    HD46-8_ngmlr+sniffles_filtered.vcf \
    > HD46-8_ngmlr+sniffles_filtered.annotated.vcf

    # HD46-13
    snpEff ann -Xmx8g -v -hgvs -canon -ud 200 \
    -stats HD46-13_snpeff_stats.html \
    HD46_Ctrl \
    HD46-13_ngmlr+sniffles_filtered.vcf \
    > HD46-13_ngmlr+sniffles_filtered.annotated.vcf
  1. Summarize the results as a Excel-file

    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    conda activate plot-numpy1
    #python generate_common_vcf.py
    #mv common_variants.xlsx putative_transposons.xlsx
    
    # * Reads each of your VCFs.
    # * Filters variants → only keep those with FILTER == PASS.
    # * Compares the two aligner methods (minimap2+sniffles2 vs ngmlr+sniffles2) per sample.
    # * Keeps only variants that appear in both methods for the same sample.
    # * Outputs: An Excel file with the common variants and a log text file listing which variants were filtered out, and why (not_PASS or not_COMMON_in_two_VCF).
    
    #python generate_fuzzy_common_vcf_v1.py
    #Sample PASS_minimap2   PASS_ngmlr  COMMON
    #  HD46-Ctrl_Ctrl   39  29  28
    #  HD46-1   39  32  29
    #  HD46-2   40  32  28
    #  HD46-3   38  30  27
    #  HD46-4   46  35  32
    #  HD46-5   40  35  31
    #  HD46-6   43  35  30
    #  HD46-7   40  33  28
    #  HD46-8   37  20  11
    #  HD46-13  39  38  27
    
    #Sample PASS_minimap2   PASS_ngmlr  COMMON_FINAL
    #HD46-Ctrl_Ctrl 39  29  6
    #HD46-1 39  32  8
    #HD46-2 40  32  8
    #HD46-3 38  30  6
    #HD46-4 46  35  8
    #HD46-5 40  35  9
    #HD46-6 43  35  10
    #HD46-7 40  33  8
    #HD46-8 37  20  4
    #HD46-13    39  38  5
    
    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    #!!!! Summarize the results of ngmlr+sniffles !!!!
    python merge_ngmlr+sniffles_filtered_results_and_summarize.py
    
    #!!!! Post-Processing !!!!
    #DELETE "2186168    N   

    . PASS” in Sheet HD46-13 and Summary #DELETE “2427785 N CGTCAGAATCGCTGTCTGCGTCCGAGTCACTGTCTGAGTCTGAATCACTATCTGCGTCTGAGTCACTGTCTG . PASS” due to “0/1:169:117” in HD46-13 and Summary #DELETE “2441640 N GCTCATTAAGAATCATTAAATTAC . PASS” due to 0/1:170:152 in HD46-13 and Summary

  2. Source code of merge_ngmlr+sniffles_filtered_results_and_summarize.py

    python add_ann_to_excel.py         --excel merged_ngmlr+sniffles_variants.xlsx         --sheet8 "HD46-8"         --sheet13 "HD46-13"         --vcf8 HD46-8_ngmlr+sniffles_filtered.annotated.vcf         --vcf13 HD46-13_ngmlr+sniffles_filtered.annotated.vcf         --out merged_ngmlr+sniffles_variants_with_ANN.xlsx
    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    """
    Add SnpEff ANN columns (for SVTYPE=INS) from annotated VCFs into an Excel workbook,
    with detailed debug about why CHROM+POS may not match.
    
    Key improvements:
    - Stronger CHROM/POS normalization (strip 'chr', unify MT naming, coerce numbers).
    - Explicit detection and logging of sheet key columns used.
    - Debug block prints:
    * Unique key counts in sheet vs VCF (before/after normalization)
    * Example non-matching keys from the sheet and from the VCF (top N)
    * Chromosome naming diagnostics (e.g., 'chr' presence, 'MT'/'M' harmonization)
    * Off-by-N diagnostics via --pos_tolerance (counts for would-match @ ±N)
    * Optional preview of the sheet's SV type column, if present
    - Safer ANN parsing and aggregation.
    - Command-line options: --debug_examples, --pos_tolerance
    
    ANN filling is done only for **exact** (CHROM, POS) equality (as before).
    Tolerance is used only for *diagnostics*, not for filling, to avoid incorrect merges.
    """
    
    import argparse
    import gzip
    import io
    import re
    from pathlib import Path
    from typing import List, Tuple, Dict, Iterable, Set
    
    import pandas as pd
    
    FALLBACK_ANN_FIELDS: List[str] = [
        'Allele','Annotation','Annotation_Impact','Gene_Name','Gene_ID',
        'Feature_Type','Feature_ID','Transcript_BioType','Rank','HGVS.c',
        'HGVS.p','cDNA.pos/cDNA.length','CDS.pos/CDS.length','AA.pos/AA.length',
        'Distance','Errors_Warnings_Info'
    ]
    
    def open_text_maybe_gzip(path: Path):
        if str(path).endswith('.gz'):
            return io.TextIOWrapper(gzip.open(path, 'rb'), encoding='utf-8', errors='ignore')
        return open(path, 'r', encoding='utf-8', errors='ignore')
    
    def normalize_chrom(col: pd.Series) -> pd.Series:
        s = col.astype(str).str.strip()
        s = s.str.replace(r'^(chr|CHR)', '', regex=True)
        # Standardize mitochondrial names to "MT"
        s = s.str.replace(r'^(M|MtDNA|MTDNA|Mito|Mitochondrion)$', 'MT', regex=True, case=False)
        return s.str.upper()
    
    def normalize_pos(col: pd.Series) -> pd.Series:
        # Excel can make ints look like floats; coerce then Int64
        # (We keep Int64 nullable for robustness; we never compare NaNs.)
        s = pd.to_numeric(col, errors='coerce')
        # If people had 0-based starts in the sheet (rare for INS), this won't fix it,
        # but the tolerance debug will reveal a +1 shift if present.
        return s.astype('Int64')
    
    def parse_vcf_ann(vcf_path: Path) -> Tuple[pd.DataFrame, List[str]]:
        ann_fields = None
        header_cols = None
        records = []
    
        with open_text_maybe_gzip(vcf_path) as f:
            for line in f:
                if line.startswith('##INFO=<ID=ANN'):
                    m = re.search(r'Format:\s*([^">]+)', line)
                    if m:
                        ann_fields = [s.strip() for s in m.group(1).split('|')]
                if line.startswith('#CHROM'):
                    header_cols = line.strip().lstrip('#').split('\t')
                    break
    
            if not header_cols:
                raise RuntimeError(f"Could not find VCF header line (#CHROM ...) in {vcf_path}")
    
            if not ann_fields:
                ann_fields = FALLBACK_ANN_FIELDS
    
            ann_cols = [f'ANN_{x}' for x in ann_fields]
    
            for line in f:
                if not line or line[0] == '#':
                    continue
                parts = line.rstrip('\n').split('\t')
                if len(parts) < len(header_cols):
                    continue
                row = dict(zip(header_cols, parts))
                info = row.get('INFO', '')
    
                # Only INS
                if not re.search(r'(?:^|;)SVTYPE=INS(?:;|$)', info):
                    continue
    
                chrom = row.get('#CHROM') or row.get('CHROM')
                pos_str = row.get('POS')
                try:
                    pos = int(pos_str)
                except Exception:
                    continue
    
                # Extract ANN entries
                ann_match = re.search(r'(?:^|;)ANN=([^;]+)', info)
                ann_entries = ann_match.group(1).split(',') if ann_match else []
    
                field_values: Dict[str, List[str]] = {k: [] for k in ann_fields}
                for ann in ann_entries:
                    items = ann.split('|')
                    if len(items) < len(ann_fields):
                        items += [''] * (len(ann_fields) - len(items))
                    elif len(items) > len(ann_fields):
                        items = items[:len(ann_fields)]
                    for k, v in zip(ann_fields, items):
                        field_values[k].append(v)
    
                joined = {f'ANN_{k}': (';'.join(v) if v else '') for k, v in field_values.items()}
                records.append({'CHROM': chrom, 'POS': pos, **joined})
    
        df = pd.DataFrame.from_records(records)
        if not df.empty:
            df['POS'] = pd.to_numeric(df['POS'], errors='coerce').astype('Int64')
            df['CHROM'] = normalize_chrom(df['CHROM'])
        return df, ann_cols
    
    def detect_key_columns(df: pd.DataFrame) -> Dict[str, str]:
        chrom_candidates = ['CHROM', '#CHROM', 'Chrom', 'Chromosome', 'chrom', 'chr', 'Chr']
        pos_candidates   = ['POS', 'Position', 'position', 'pos', 'Start', 'start']
        mapping = {}
        for c in chrom_candidates:
            if c in df.columns:
                mapping['CHROM'] = c
                break
        for p in pos_candidates:
            if p in df.columns:
                mapping['POS'] = p
                break
        return mapping
    
    def normalize_chrom_pos_df(df: pd.DataFrame, keys: Dict[str, str]) -> pd.DataFrame:
        out = df.copy()
        out[keys['CHROM']] = normalize_chrom(out[keys['CHROM']])
        out[keys['POS']]   = normalize_pos(out[keys['POS']])
        return out
    
    def summarize_chr_formats(series: pd.Series, label: str):
        raw = series.astype(str)
        has_chr_prefix = raw.str.startswith(('chr','CHR')).sum()
        mt_like = raw.str.fullmatch(r'(M|MtDNA|MTDNA|Mito|Mitochondrion)', case=False).sum()
        print(f"[{label}] CHROM diagnostics:")
        print(f"  total rows: {len(raw)}")
        print(f"  with 'chr'/'CHR' prefix: {has_chr_prefix}")
        print(f"  mitochondrial names like M/MtDNA/etc: {mt_like}")
    
    def keys_set(df: pd.DataFrame, chrom_col: str, pos_col: str) -> Set[Tuple[str, int]]:
        # Drop NA POS, NA CHROM
        sub = df[[chrom_col, pos_col]].dropna()
        # Ensure ints (drop NA after coercion)
        sub = sub[(sub[pos_col].astype('Int64').notna())]
        return set(zip(sub[chrom_col].astype(str), sub[pos_col].astype('int64')))
    
    def tolerance_match_count(sheet_keys: Iterable[Tuple[str,int]],
                            vcf_keys: Set[Tuple[str,int]],
                            tol: int) -> int:
        if tol <= 0:
            return sum(1 for k in sheet_keys if k in vcf_keys)
        cnt = 0
        for chrom, pos in sheet_keys:
            if (chrom, pos) in vcf_keys:
                cnt += 1
            else:
                matched = False
                # check +/- 1..tol
                for d in range(1, tol+1):
                    if (chrom, pos - d) in vcf_keys or (chrom, pos + d) in vcf_keys:
                        matched = True
                        break
                if matched:
                    cnt += 1
        return cnt
    
    def debug_match_report(df_sheet: pd.DataFrame,
                        vcf_df: pd.DataFrame,
                        keys: Dict[str, str],
                        debug_examples: int = 15,
                        pos_tolerance: int = 1):
        print("\n=== DEBUG: Matching overview ===")
        # Raw diagnostics
        summarize_chr_formats(df_sheet[keys['CHROM']], label="SHEET (raw)")
        summarize_chr_formats(vcf_df['CHROM'], label="VCF (normalized)")
    
        # Normalize sheet
        df_norm = normalize_chrom_pos_df(df_sheet, keys)
        print(f"Detected key columns -> CHROM: '{keys['CHROM']}'  POS: '{keys['POS']}'")
        # Basic stats
        n_sheet_all = len(df_sheet)
        n_sheet_key_nonnull = df_norm[keys['CHROM']].notna().sum() - df_norm[keys['CHROM']].isna().sum()
        n_sheet_pos_nonnull = df_norm[keys['POS']].notna().sum()
        print(f"SHEET rows total: {n_sheet_all}")
        print(f"SHEET rows with non-null CHROM: {n_sheet_key_nonnull}, non-null POS: {n_sheet_pos_nonnull}")
    
        # Unique key counts
        sheet_norm_keys_df = df_norm.rename(columns={keys['CHROM']: 'CHROM', keys['POS']: 'POS'})
        sheet_norm_keys_df = sheet_norm_keys_df.dropna(subset=['CHROM','POS'])
        sheet_norm_keys_df['POS'] = sheet_norm_keys_df['POS'].astype('Int64')
        sheet_keys_unique = keys_set(sheet_norm_keys_df, 'CHROM', 'POS')
        vcf_keys_unique   = keys_set(vcf_df, 'CHROM', 'POS')
    
        print(f"Unique (CHROM,POS) keys -> SHEET: {len(sheet_keys_unique)}  VCF(INS): {len(vcf_keys_unique)}")
    
        # Exact match count
        exact_matches = len(sheet_keys_unique & vcf_keys_unique)
        print(f"Exact key matches (SHEET∩VCF): {exact_matches}")
    
        # Tolerance diagnostics (diagnose off-by-one etc.)
        if pos_tolerance > 0:
            approx_matches = tolerance_match_count(sheet_keys_unique, vcf_keys_unique, pos_tolerance)
            print(f"Keys that would match within ±{pos_tolerance}: {approx_matches}")
    
        # Show some examples of non-matching keys from SHEET
        if debug_examples > 0:
            not_in_vcf = sorted(k for k in sheet_keys_unique if k not in vcf_keys_unique)
            not_in_sheet = sorted(k for k in vcf_keys_unique if k not in sheet_keys_unique)
            print(f"\nExamples of SHEET keys not found in VCF (showing up to {debug_examples}):")
            for k in not_in_vcf[:debug_examples]:
                print("  SHEET-only:", k)
            print(f"\nExamples of VCF keys not found in SHEET (showing up to {debug_examples}):")
            for k in not_in_sheet[:debug_examples]:
                print("  VCF-only:", k)
    
        # Try to detect a type column and report counts
        type_cols = [c for c in df_sheet.columns if c.lower() in ('svtype','type','variant_type','sv_type')]
        if type_cols:
            tcol = type_cols[0]
            is_ins = df_sheet[tcol].astype(str).str.upper() == 'INS'
            print(f"\nType column detected: '{tcol}'. SHEET rows with INS: {int(is_ins.sum())} / {len(df_sheet)}")
            # Of the INS rows, how many have keys that match?
            ins_keys = keys_set(df_norm[is_ins], keys['CHROM'], keys['POS'])
            exact_ins_matches = len(ins_keys & vcf_keys_unique)
            print(f"  INS-only exact key matches: {exact_ins_matches} / {len(ins_keys)}")
            if pos_tolerance > 0:
                approx_ins_matches = tolerance_match_count(ins_keys, vcf_keys_unique, pos_tolerance)
                print(f"  INS-only matches within ±{pos_tolerance}: {approx_ins_matches} / {len(ins_keys)}")
        else:
            print("\nNo explicit type column found in SHEET.")
    
    def merge_ann_into_sheet(df_sheet: pd.DataFrame, vcf_df: pd.DataFrame, ann_cols: List[str],
                            pos_tolerance: int = 1, debug_examples: int = 15) -> pd.DataFrame:
        df = df_sheet.copy()
    
        keys = detect_key_columns(df)
        if 'CHROM' not in keys or 'POS' not in keys:
            print("WARNING: Could not detect CHROM/POS columns in sheet; ANN columns will be empty.")
            for c in ann_cols:
                if c not in df.columns:
                    df[c] = ''
            return df
    
        # DEBUG: run a comprehensive match report
        debug_match_report(df, vcf_df, keys, debug_examples=debug_examples, pos_tolerance=pos_tolerance)
    
        # Normalize sheet keys for merge
        df_norm = normalize_chrom_pos_df(df, keys)
    
        # Prepare VCF map (unique by CHROM,POS), aggregate ANN fields
        vcf_use = vcf_df.copy()
        if vcf_use.empty:
            print("NOTE: No INS records found in VCF; ANN columns will be created but empty.")
        else:
            agg = {c: lambda s: ';'.join([x for x in s.astype(str).tolist() if x]) for c in ann_cols}
            vcf_use = vcf_use.groupby(['CHROM', 'POS'], as_index=False).agg(agg)
    
        # Identify potential type column in sheet
        type_cols = [c for c in df.columns if c.lower() in ('svtype','type','variant_type','sv_type')]
        has_type = bool(type_cols)
        if has_type:
            tcol = type_cols[0]
            is_ins = df[tcol].astype(str).str.upper() == 'INS'
            print(f"\nMERGE: using type column '{tcol}' -> rows marked INS: {int(is_ins.sum())} / {len(df)}")
        else:
            is_ins = pd.Series([False]*len(df), index=df.index)
            print("\nMERGE: no type column -> will fill ANN wherever exact (CHROM,POS) matches VCF INS.")
    
        # Left merge on exact keys only (do not use tolerance for filling, just for diagnostics)
        left = df_norm.rename(columns={keys['CHROM']: 'CHROM', keys['POS']: 'POS'})
        merged = left.merge(vcf_use[['CHROM','POS'] + ann_cols], on=['CHROM','POS'], how='left', suffixes=('',''))
    
        # Initialize ANN columns on original df
        for c in ann_cols:
            if c not in df.columns:
                df[c] = ''
    
        # Fill values:
        for c in ann_cols:
            values = merged[c]
            if has_type:
                df.loc[is_ins, c] = values[is_ins].fillna('').astype(str).values
            else:
                df[c] = values.fillna('').astype(str).values
    
        # Report matching stats on the actual merge
        matched = merged[ann_cols].notna().any(axis=1).sum()
        print(f"\nMERGE RESULT: rows with any ANN filled (exact VCF match): {int(matched)} / {len(df)}")
    
        # Additional hint if tolerance suggests many near-misses
        if pos_tolerance > 0:
            sheet_keys = keys_set(left, 'CHROM', 'POS')
            vcf_keys_unique = keys_set(vcf_use, 'CHROM', 'POS')
            approx = tolerance_match_count(sheet_keys, vcf_keys_unique, pos_tolerance)
            if approx > matched:
                print(f"NOTE: There appear to be {approx - matched} additional rows that would match within ±{pos_tolerance}.")
                print("      This often indicates a 0-based vs 1-based position shift or use of END instead of POS in the sheet.")
    
        return df
    
    def main():
        ap = argparse.ArgumentParser()
        ap.add_argument('--excel', default='merged_ngmlr+sniffles_variants.xlsx', help='Input Excel workbook')
        ap.add_argument('--sheet8', default='HD46-8', help='Sheet name for HD46-8 sample')
        ap.add_argument('--sheet13', default='HD46-13', help='Sheet name for HD46-13 sample')
        ap.add_argument('--vcf8', default='HD46-8_ngmlr+sniffles_filtered.annotated.vcf', help='Annotated VCF for HD46-8')
        ap.add_argument('--vcf13', default='HD46-13_ngmlr+sniffles_filtered.annotated.vcf', help='Annotated VCF for HD46-13')
        ap.add_argument('--out', default='merged_ngmlr+sniffles_variants_with_ANN.xlsx', help='Output Excel path')
        ap.add_argument('--debug_examples', type=int, default=15, help='How many non-match examples to print from each side')
        ap.add_argument('--pos_tolerance', type=int, default=1, help='Diagnostic tolerance (±N bp) for off-by-N checks (used for debug only)')
        args = ap.parse_args()
    
        excel_path = Path(args.excel)
        vcf8_path = Path(args.vcf8)
        vcf13_path = Path(args.vcf13)
        out_path = Path(args.out)
    
        # Load sheets (resolve case-insensitive names)
        xls = pd.ExcelFile(excel_path)
        def resolve_sheet(name: str) -> str:
            if name in xls.sheet_names:
                return name
            lower_map = {s.lower(): s for s in xls.sheet_names}
            return lower_map.get(name.lower(), name)
    
        sheet8 = resolve_sheet(args.sheet8)
        sheet13 = resolve_sheet(args.sheet13)
    
        df8 = pd.read_excel(excel_path, sheet_name=sheet8)
        df13 = pd.read_excel(excel_path, sheet_name=sheet13)
    
        # Parse VCFs (INS only)
        vcf8_df, ann_cols = parse_vcf_ann(vcf8_path)
        vcf13_df, _ = parse_vcf_ann(vcf13_path)
    
        print(f"VCF8 INS variants: {len(vcf8_df)}; VCF13 INS variants: {len(vcf13_df)}")
        print(f"ANN subfields ({len(ann_cols)}): {', '.join(ann_cols)}")
    
        # Merge with diagnostics
        df8_out = merge_ann_into_sheet(df8, vcf8_df, ann_cols,
                                    pos_tolerance=args.pos_tolerance,
                                    debug_examples=args.debug_examples)
        df13_out = merge_ann_into_sheet(df13, vcf13_df, ann_cols,
                                        pos_tolerance=args.pos_tolerance,
                                        debug_examples=args.debug_examples)
    
        # Save
        with pd.ExcelWriter(out_path, engine='xlsxwriter') as writer:
            df8_out.to_excel(writer, sheet_name=sheet8, index=False)
            df13_out.to_excel(writer, sheet_name=sheet13, index=False)
    
        print(f"\nDone. Wrote: {out_path.resolve()}")
    
    if __name__ == '__main__':
        main()
  3. Manually merge all contents of ANN=? to a seperate column ‘ANN’ in the isolate-specific sheets in the Excel-file.

    #Add CHROM and HD46_Ctrl.1 to first column of the input Excel-file
    (plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Patricia_Transposon_2025$ python add_ann_to_excel.py         --excel merged_ngmlr+sniffles_variants.xlsx         --sheet8 "HD46-8"         --sheet13 "HD46-13"         --vcf8 HD46-8_ngmlr+sniffles_filtered.annotated.vcf         --vcf13 HD46-13_ngmlr+sniffles_filtered.annotated.vcf         --out merged_ngmlr+sniffles_variants_with_ANN.xlsx
    #DEL some columns (INFO, NN_Allele, ANN_Rank, ANN_Errors_Warnings_Info, from the table, and COPY the summary-sheet to the final table.
  4. Run nextflow bacass

    # -- samplesheet_bacass.tsv --
    #ID R1  R2  LongFastQ   Fast5   GenomeSize
    #HD46_Ctrl          HD46_Ctrl.fastq.gz  NA  NA
    #HD46_1         HD46_1.fastq.gz NA  NA
    #An6    /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/X101SC26036392-Z01-J002/clean_data/An6/An6_L1_1.clean.rd.fq.gz   /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/X101SC26036392-Z01-J002/clean_data/An6/An6_L1_2.clean.rd.fq.gz   /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/X101SC26036392-Z01-J003/Release-X101SC26036392-Z01-J003-20260513_01/Data-X101SC26036392-Z01-J003/An6/2157_4C_PBK79106_7ec05c46/merged_An6_longreads.fastq.gz NA  2.7m
    #BG5    /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/X101SC26036392-Z01-J002/clean_data/BG5/BG5_L1_1.clean.rd.fq.gz   /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/X101SC26036392-Z01-J002/clean_data/BG5/BG5_L1_2.clean.rd.fq.gz   /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/X101SC26036392-Z01-J003/Release-X101SC26036392-Z01-J003-20260513_01/Data-X101SC26036392-Z01-J003/BG5/2157_4C_PBK79106_7ec05c46/merged_BG5_longreads.fastq.gz NA  6.5m
    
    conda deactivate
    # DEBUG: --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ not working, maybe due to the version, since 2.5.0 was working (see below)!
    #nextflow run nf-core/bacass -r 2.5.0 -profile docker \
    #--input samplesheet.tsv \
    #--outdir bacass_out \
    #--assembly_type long \
    #--kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
    #--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
    #-resume
    
    #For hybrid-assembly: --assembly_type hybrid --assembler unicycler,dragonflye [unicycler,autocycler,canu,dragonflye,flye,miniasm,raven,megahit] \
    nextflow run nf-core/bacass -r 2.6.0 -profile docker --help
    nextflow run nf-core/bacass -r 2.6.0 -profile docker \
      --input samplesheet_bacass.tsv \
      --outdir bacass_out \
      --assembly_type long \
      --assembler unicycler,dragonflye \
      --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
      --skip_kmerfinder \
      -resume \
      -c unicycler.config \
      -work-dir bacass_out/work
    
    #SAVE bacass_out/Kmerfinder/kmerfinder_summary.csv to bacass_out/Kmerfinder/An6?/An6?_kmerfinder_results.xlsx
    
    #busco example results:
    Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
    wt_cef.scaffolds.fa     bacteria_odb10  98.4    98.4    0.0     1.6     0.0     124     285852  285852  0.000%  45
    wt_cipro.scaffolds.fa   bacteria_odb10  90.3    89.5    0.8     8.1     1.6     124     7434    7434    0.000%  1699
  5. Detecting the next closest genome

    mamba activate gtdbtk
    
    # 验证环境变量是否加载成功
    echo $GTDBTK_DATA_PATH
    # 应输出:/mnt/nvme4n1p1/gtdb_data/release232
    
    # 3. 运行分类(你提供的命令 + 实用参数)
    gtdbtk classify_wf \
      --genome_dir ./bacass_out/Medaka \
      --out_dir gtdb_out \
      --cpus 64 \
      --extension .fa \
      --prefix mygenome
    
    # 4. 查看结果
    cat gtdb_out/classify/mygenome.bac120.summary.tsv   # 细菌结果
  6. Structural variant calling

    conda activate sv_assembly
    
    # MLST calling
    for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13 _WT _1 _2 _3 _4 _5 _7 _8 _9 _10; do
        mlst bacass_out/Medaka/${sample}-unicycler-medaka_polished_genome.fa >> mlst_res
    done
    
    # After running MLST and genome taxonomy checks, I found that CP020463 is only a suitable reference for the first dataset (_WT, _1–_10), because all of these samples share ST 86 — the same sequence type as CP020463. For the second dataset (HD46 series), CP020463 is not an appropriate reference. MLST and GTDB-Tk classification results (see mygenome.bac120.summary2.xlsx and mlst_res.xlsx) show that these genomes are genetically distinct.
    for sample in _WT _1 _2 _3 _4 _5 _7 _8 _9 _10; do
            nucmer --maxmatch -l 100 -c 500 CP020463.fasta bacass_out/Medaka/${sample}-dragonflye-medaka_polished_genome.fa -p ${sample};
            delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta;
            #Usage: Assemblytics delta output_prefix    unique_length_required    min_size    max_size
            Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 1000 100 500000;
    done
    samtools faidx bacass_out/Medaka/HD46_Ctrl-dragonflye-medaka_polished_genome.fa contig00001 > bacass_out/Medaka/HD46_Ctrl_chrom.fa
    for sample in HD46_1 HD46_2 HD46_5 HD46_6 HD46_7; do
            nucmer --maxmatch -l 100 -c 500 bacass_out/Medaka/HD46_Ctrl_chrom.fa bacass_out/Medaka/${sample}-dragonflye-medaka_polished_genome.fa -p ${sample};
            delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta;
            #Usage: Assemblytics delta output_prefix    unique_length_required    min_size    max_size; Note that we use a large threshold 500,000 nt.
            Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 1000 100 500000;
    done
    
    ./merge_variants.sh
            #!/bin/bash
            # Define the output file name
            OUTPUT="merged_assemblytics_variants.txt"
            # 1. Write the header with the new 'Sample' column
            # We read the header from the first file, strip the '#', and append 'Sample'
            head -n 1 *_assemblytics.variant_preview.txt | grep '^#' | sed 's/^#//' | awk '{print $0 "\tSample"}' > "$OUTPUT"
            # 2. Loop through all matching files
            for file in *_assemblytics.variant_preview.txt; do
                # Extract sample name (e.g., "HD46_Ctrl" from "HD46_Ctrl_assemblytics.variant_preview.txt")
                sample=$(echo "$file" | sed 's/_assemblytics\.variant_preview\.txt//')
    
                # Append data lines, skipping the header line (lines starting with '#')
                # and append the sample name as the last column
                grep -v '^#' "$file" | awk -v samp="$sample" '{print $0 "\t" samp}' >> "$OUTPUT"
            done
            echo "✅ Successfully merged $(ls *_assemblytics.variant_preview.txt | wc -l) files into $OUTPUT"
    
    #Manually sorted the generated file merged_assemblytics_variants.txt into two parts: one is HD46-series and one is _*series.
    
    unicycler -l HD46_Ctrl.fastq.gz --mode normal -t 40 -o HD46_Ctrl_unicycler_normal
    unicycler -l HD46_3.fastq.gz --mode normal -t 40 -o HD46_3_unicycler_normal
    unicycler -l HD46_4.fastq.gz --mode normal -t 40 -o HD46_4_unicycler_normal
    unicycler -l HD46_13.fastq.gz --mode normal -t 40 -o HD46_13_unicycler_normal

From bioBakery Processing to Paired Differential Analysis (Data_Tam_Metagenomics_2026_Wastewater)

Direct Answer

Yes and No.

  • Yes, you can use biobakery_workflows wmgx to process all your FASTQ files into taxonomic and functional profiles, and wmgx_vis to generate basic visualizations (like alpha/beta diversity).
  • No (Not Recommended), you should not use wmgx_vis for the actual differential analysis. Because your study is a paired/longitudinal design (Pre- vs. Post-treatment from the same subjects across multiple time points), you must account for the paired nature of the data (using random effects). The command-line wrapper for wmgx_vis does not support specifying random effects, which would lead to statistically flawed results.

Instead, the standard practice in the bioBakery ecosystem is to use wmgx for processing, and then run MaAsLin 2 directly for the differential analysis.

Here is the complete workflow on how to do this:


Step 1: Process the data with wmgx

Since your paired-end files follow the standard _1.fastq.gz and _2.fastq.gz naming convention, wmgx will automatically detect and pair them.

Assuming all your symlinks are in a folder called /data/fastq_files/:

biobakery_workflows wmgx \
  --input /data/fastq_files/ \
  --output /data/wmgx_results/ \
  --threads 20

(Note: If you don’t need strain profiling, you can speed this up by adding --bypass-strain-profiling)


Step 2: Basic Visualization with wmgx_vis (Optional)

You can use wmgx_vis to generate standard diversity plots and heatmaps. You will need a metadata file (see Step 3 for the format).

biobakery_workflows wmgx_vis \
  --input /data/wmgx_results/ \
  --project-name Metagenome_Study \
  --input-metadata /data/metadata.tsv \
  --metadata-categorical Group Timepoint \
  --output /data/visualization_results/

Step 3: Proper Differential Analysis (Using MaAsLin 2 directly)

To correctly compare Pre vs. Post while accounting for the fact that the samples are paired (and potentially accounting for the different time points), you should use MaAsLin 2 directly on the output tables generated by wmgx.

1. Prepare your Metadata File (metadata.tsv)

You need a tab-separated file where the SampleID exactly matches the prefix of your FASTQ files (e.g., 2025_Nov_Pre_A1).

SampleID Group Timepoint Subject_ID
2025_Nov_Pre_A1 Pre Nov Subject_1
2025_Nov_Post_B1 Post Nov Subject_1
2026_Jan_Pre_A Pre Jan Subject_2
2026_Jan_Post_B Post Jan Subject_2

(Note: You must correctly map which “Pre” sample belongs to which “Post” sample in the Subject_ID column. This is how the model knows they are paired).

2. Run MaAsLin 2 in R

You can run MaAsLin 2 (which is installed with bioBakery) using R. This allows you to define Group (Pre/Post) as a fixed effect and Subject_ID as a random effect, which is mathematically required for paired data.

# Install Maaslin2 if you haven't already
# BiocManager::install("Maaslin2")
library(Maaslin2)

# Run the differential abundance analysis
fit_data <- Maaslin2(
    input_data     = "/data/wmgx_results/wmgx/humann/genefamilies.tsv", # or pathabundance.tsv
    input_metadata = "/data/metadata.tsv",
    output         = "/data/maaslin2_results/",

    # Define your variables
    fixed_effects  = c("Group", "Timepoint"), # Comparing Pre vs Post, while controlling for Timepoint
    random_effects = c("Subject_ID"),         # CRUCIAL: Accounts for the paired Pre/Post design!

    # Data processing parameters
    normalization  = "NONE",                  # HUMAnN output is already normalized to relative abundance
    transform      = "LOG",
    min_abundance  = 0.01,
    min_prevalence = 0.10,
    correction     = "BH"                     # Benjamini-Hochberg for multiple testing (FDR)
)

Summary

  1. Use wmgx to process your raw FASTQs.
  2. Use wmgx_vis if you just want quick, automated PCA/PCoA plots and heatmaps.
  3. Do not use wmgx_vis for the differential statistics. Extract the genefamilies.tsv or metaphlan merged tables and run MaAsLin 2 directly so you can include Subject_ID as a random effect to properly handle your Pre/Post paired design.

世界杯竞猜游戏(Tippspiel)

由于2026年美加墨世界杯扩军到了48支球队,总共有104场比赛,预测所有小组赛的精确比分篇幅会非常庞大。如果你是在参加公司或朋友间的世界杯竞猜游戏(Tippspiel),我为你整理了一份最核心、最完整的赛事比分预测表

这份预测涵盖了小组赛焦点战完整的淘汰赛路径(从16强到决赛)以及附加竞猜问题,你可以直接参考填入你的竞猜表:

⚽ 一、 小组赛焦点战预测 (Gruppenphase)

在小组赛阶段,强队通常会稳扎稳打,比分不会过于夸张。

🇺🇸🇨🇦🇲🇽 东道主首战:

  • 🇲🇽 墨西哥 vs 南非 🇿🇦 ➔ 2:0
  • 🇨🇦 加拿大 vs 波黑 🇧🇦 ➔ 1:0
  • 🇺🇸 美国 vs 巴拉圭 🇵🇾 ➔ 2:1

🏆 夺冠热门及强队小组赛典型比分:

  • 🇩🇪 德国 vs 厄瓜多尔/澳大利亚 ➔ 3:1
  • 🇪🇸 西班牙 vs 克罗地亚/新西兰 ➔ 2:0
  • 🇫🇷 法国 vs 丹麦/加拿大 ➔ 2:1
  • 🇦🇷 阿根廷 vs 澳大利亚/秘鲁 ➔ 2:0
  • 🏴󠁧󠁢󠁥󠁮󠁧󠁿 英格兰 vs 塞尔维亚/斯洛文尼亚 ➔ 2:0
  • 🇧🇷 巴西 vs 苏格兰/摩洛哥 ➔ 2:1

🏆 二、 完整淘汰赛路径预测 (K.O.-Phase)

淘汰赛阶段通常防守更严密,1球或2球的优势是常态,部分比赛会拖入点球大战。

🥊 1/8 决赛 (Achtelfinale) – 核心场次

  • 🇪🇸 西班牙 2:0 日本 🇯🇵
  • 🇩🇪 德国 2:1 墨西哥 🇲🇽
  • 🇫🇷 法国 3:0 波兰 🇵🇱
  • 🇦🇷 阿根廷 2:1 澳大利亚 🇦🇺
  • 🏴󠁧󠁢󠁥󠁮󠁧󠁿 英格兰 1:0 哥伦比亚 🇨🇴
  • 🇧🇷 巴西 2:0 瑞士 🇨🇭
  • 🇵🇹 葡萄牙 1:0 乌拉圭 🇺🇾
  • 🇳🇱 荷兰 2:1 美国 🇺🇸 (东道主遗憾止步16强)

🔥 1/4 决赛 (Viertelfinale)

  • 🇪🇸 西班牙 2:1 德国 🇩🇪 (焦点大战)
  • 🇫🇷 法国 2:1 阿根廷 🇦🇷 (卫冕冠军出局)
  • 🏴󠁧󠁢󠁥󠁮󠁧󠁿 英格兰 1:1 (点球 4:3) 巴西 🇧🇷
  • 🇵🇹 葡萄牙 2:1 荷兰 🇳🇱

🚀 半决赛 (Halbfinale)

  • 🇪🇸 西班牙 2:1 法国 🇫🇷
  • 🏴󠁧󠁢󠁥󠁮󠁧󠁿 英格兰 2:1 葡萄牙 🇵🇹

🥉 季军战 (Spiel um Platz 3)

  • 🇫🇷 法国 2:0 葡萄牙 🇵🇹

🏆 决赛 (Finale) – 新泽西大都会人寿体育场

  • 🇪🇸 西班牙 2 : 1 英格兰 🏴󠁧󠁢󠁥󠁮󠁧󠁿 (预测:西班牙凭借更强大的中场控制力(如罗德里、佩德里)和边路爆点(亚马尔)在常规时间或加时赛中绝杀英格兰,队史第二次捧杯!)

🎯 三、 附加竞猜问题 (Rahmenwetten)

如果你的竞猜表有这些问题,可以直接填这些答案:

  1. 🏆 世界杯冠军 (Weltmeister): 西班牙 (Spanien)
  2. 🥇 最佳射手/金靴 (Torschützenkönig): 基利安·姆巴佩 (Kylian Mbappé)预测进球数:6球
  3. 🌟 最佳年轻球员 (Bester junger Spieler): 拉明·亚马尔 (Lamine Yamal / 西班牙)
  4. 🧤 最佳门将 (Bester Torwart): 埃米利亚诺·马丁内斯 (Emiliano Martínez / 阿根廷)乌奈·西蒙 (Unai Simón / 西班牙)
  5. ⚽ 决赛比分 (Ergebnis des Finales): 2:1
  6. 📊 赛事总进球数 (Anzahl der Turniertore): 268球 (48队104场比赛,场均约2.5-2.8球)

💡 提示 (Tipp): 在填写竞猜表时,淘汰赛阶段的比分尽量多写 1:0, 2:0, 2:1, 1:1 这种小比分,这在世界杯淘汰赛中出现的概率远高于大比分(如3:2或4:1)。祝你的竞猜表取得好成绩(Viel Glück beim Tippspiel)!

RNAseq processing (Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE)

complete_deg_pipeline.R

complete_deg_pipeline_custom_cutoff.R

1.R

  1. Preparing raw data

     mkdir raw_data; cd raw_data
    
     # control samples (8)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_1.fq.gz AYE-WT_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_2.fq.gz AYE-WT_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_1.fq.gz AYE-WT_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_2.fq.gz AYE-WT_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_1.fq.gz AYE-WT_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_2.fq.gz AYE-WT_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_1.fq.gz AYE-T_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_2.fq.gz AYE-T_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_1.fq.gz AYE-T_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_2.fq.gz AYE-T_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_1.fq.gz AYE-T_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_2.fq.gz AYE-T_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_1.fq.gz AYE-O_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_2.fq.gz AYE-O_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_1.fq.gz AYE-O_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_2.fq.gz AYE-O_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_1.fq.gz AYE-O_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_2.fq.gz AYE-O_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_1.fq.gz O-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_2.fq.gz O-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_1.fq.gz O-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_2.fq.gz O-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_1.fq.gz O-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_2.fq.gz O-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_1.fq.gz WT-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_2.fq.gz WT-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_1.fq.gz WT-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_2.fq.gz WT-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_1.fq.gz WT-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_2.fq.gz WT-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_1.fq.gz AYE-WT_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_2.fq.gz AYE-WT_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_1.fq.gz AYE-WT_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_2.fq.gz AYE-WT_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_1.fq.gz AYE-WT_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_2.fq.gz AYE-WT_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_1.fq.gz AYE-O_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_2.fq.gz AYE-O_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_1.fq.gz AYE-O_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_2.fq.gz AYE-O_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_1.fq.gz AYE-O_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_2.fq.gz AYE-O_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_1.fq.gz AYE-T_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_2.fq.gz AYE-T_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_1.fq.gz AYE-T_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_2.fq.gz AYE-T_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_1.fq.gz AYE-T_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_2.fq.gz AYE-T_ctr_solid_r3_R2.fastq.gz
    
     # Diclofenac(双氯芬酸)treatment (6)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_1.fq.gz AYE-WT_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_2.fq.gz AYE-WT_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_1.fq.gz AYE-WT_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_2.fq.gz AYE-WT_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_1.fq.gz AYE-WT_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_2.fq.gz AYE-WT_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_1.fq.gz AYE-T_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_2.fq.gz AYE-T_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_1.fq.gz AYE-T_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_2.fq.gz AYE-T_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_1.fq.gz AYE-T_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_2.fq.gz AYE-T_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_1.fq.gz AYE-O_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_2.fq.gz AYE-O_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_1.fq.gz AYE-O_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_2.fq.gz AYE-O_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_1.fq.gz AYE-O_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_2.fq.gz AYE-O_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_1.fq.gz O-Trans_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_2.fq.gz O-Trans_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_1.fq.gz O-Trans_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_2.fq.gz O-Trans_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_1.fq.gz O-Trans_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_2.fq.gz O-Trans_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_1.fq.gz WT-Trans_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_2.fq.gz WT-Trans_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_1.fq.gz WT-Trans_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_2.fq.gz WT-Trans_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_1.fq.gz WT-Trans_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_2.fq.gz WT-Trans_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_1.fq.gz AYE-WT_Diclo1250_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_2.fq.gz AYE-WT_Diclo1250_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_1.fq.gz AYE-WT_Diclo1250_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_2.fq.gz AYE-WT_Diclo1250_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_1.fq.gz AYE-WT_Diclo1250_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_2.fq.gz AYE-WT_Diclo1250_solid_r3_R2.fastq.gz
    
     # Rifampicin(利福平)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_1.fq.gz AYE-WT_Rifampicin1.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_2.fq.gz AYE-WT_Rifampicin1.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_1.fq.gz AYE-WT_Rifampicin1.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_2.fq.gz AYE-WT_Rifampicin1.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_1.fq.gz AYE-WT_Rifampicin1.5_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_2.fq.gz AYE-WT_Rifampicin1.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_1.fq.gz AYE-T_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_2.fq.gz AYE-T_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_1.fq.gz AYE-T_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_2.fq.gz AYE-T_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_1.fq.gz AYE-T_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_2.fq.gz AYE-T_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_1.fq.gz AYE-O_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_2.fq.gz AYE-O_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_1.fq.gz AYE-O_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_2.fq.gz AYE-O_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_1.fq.gz AYE-O_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_2.fq.gz AYE-O_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_1.fq.gz O-Trans_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_2.fq.gz O-Trans_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_1.fq.gz O-Trans_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_2.fq.gz O-Trans_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_1.fq.gz O-Trans_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_2.fq.gz O-Trans_Rifampicin2_r3_R2.fastq.gz
    
     # Meropenem(美罗培南)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_1.fq.gz AYE-WT_Mero0.35-0.5_r1_R1.fastq.gz  #AYE-WT_Mero0.5_r1
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_2.fq.gz AYE-WT_Mero0.35-0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_1.fq.gz AYE-WT_Mero0.35-0.5_r2_R1.fastq.gz  #AYE-WT_YX_Mero0.35_r2
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_2.fq.gz AYE-WT_Mero0.35-0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_1.fq.gz AYE-WT_Mero0.35-0.5_r3_R1.fastq.gz  #AYE-WT_public_Mero0.35_r3
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_2.fq.gz AYE-WT_Mero0.35-0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_1.fq.gz AYE-T_Mero0.15_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_2.fq.gz AYE-T_Mero0.15_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_1.fq.gz AYE-T_Mero0.15_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_2.fq.gz AYE-T_Mero0.15_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_1.fq.gz AYE-T_Mero0.15_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_2.fq.gz AYE-T_Mero0.15_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_1.fq.gz AYE-O_Mero0.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_2.fq.gz AYE-O_Mero0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_1.fq.gz AYE-O_Mero0.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_2.fq.gz AYE-O_Mero0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_1.fq.gz AYE-O_Mero0.5_r3_R1.fastq.gz  #Mero0.45
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_2.fq.gz AYE-O_Mero0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_1.fq.gz O-Trans_Mero0.25_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_2.fq.gz O-Trans_Mero0.25_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_1.fq.gz O-Trans_Mero0.25_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_2.fq.gz O-Trans_Mero0.25_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_1.fq.gz O-Trans_Mero0.25_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_2.fq.gz O-Trans_Mero0.25_r3_R2.fastq.gz
    
     # Azithromycin(阿奇霉素)treatment (5), among them, F_ctr_solid is clinical isolate.
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_1.fq.gz F_ctr_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_2.fq.gz F_ctr_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_1.fq.gz F_ctr_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_2.fq.gz F_ctr_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_1.fq.gz F_ctr_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_2.fq.gz F_ctr_solid_r3_R2.fastq.gz  #clinical
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_1.fq.gz AYE-WT_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_2.fq.gz AYE-WT_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_1.fq.gz AYE-WT_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_2.fq.gz AYE-WT_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_1.fq.gz AYE-WT_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_2.fq.gz AYE-WT_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_1.fq.gz AYE-T_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_2.fq.gz AYE-T_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_1.fq.gz AYE-T_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_2.fq.gz AYE-T_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_1.fq.gz AYE-T_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_2.fq.gz AYE-T_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_1.fq.gz AYE-O_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_2.fq.gz AYE-O_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_1.fq.gz AYE-O_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_2.fq.gz AYE-O_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_1.fq.gz AYE-O_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_2.fq.gz AYE-O_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_1.fq.gz F_Azi20_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_2.fq.gz F_Azi20_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_1.fq.gz F_Azi20_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_2.fq.gz F_Azi20_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_1.fq.gz F_Azi20_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_2.fq.gz F_Azi20_solid_r3_R2.fastq.gz  #clinical
  2. Preparing the directory trimmed

     mkdir trimmed trimmed_unpaired;
     for sample_id in AYE-O_Azi20_solid_r1 AYE-O_Azi20_solid_r2 AYE-O_Azi20_solid_r3 AYE-O_ctr_r1 AYE-O_ctr_r2 AYE-O_ctr_r3 AYE-O_ctr_solid_r1 AYE-O_ctr_solid_r2 AYE-O_ctr_solid_r3 AYE-O_Diclo375_r1 AYE-O_Diclo375_r2 AYE-O_Diclo375_r3 AYE-O_Mero0.5_r1 AYE-O_Mero0.5_r2 AYE-O_Mero0.5_r3 AYE-O_Rifampicin2_r1 AYE-O_Rifampicin2_r2 AYE-O_Rifampicin2_r3 AYE-T_Azi20_solid_r1 AYE-T_Azi20_solid_r2 AYE-T_Azi20_solid_r3 AYE-T_ctr_r1 AYE-T_ctr_r2 AYE-T_ctr_r3 AYE-T_ctr_solid_r1 AYE-T_ctr_solid_r2 AYE-T_ctr_solid_r3 AYE-T_Diclo375_r1 AYE-T_Diclo375_r2 AYE-T_Diclo375_r3 AYE-T_Mero0.15_r1 AYE-T_Mero0.15_r2 AYE-T_Mero0.15_r3 AYE-T_Rifampicin2_r1 AYE-T_Rifampicin2_r2 AYE-T_Rifampicin2_r3 AYE-WT_Azi20_solid_r1 AYE-WT_Azi20_solid_r2 AYE-WT_Azi20_solid_r3 AYE-WT_ctr_r1 AYE-WT_ctr_r2 AYE-WT_ctr_r3 AYE-WT_ctr_solid_r1 AYE-WT_ctr_solid_r2 AYE-WT_ctr_solid_r3 AYE-WT_Diclo1250_solid_r1 AYE-WT_Diclo1250_solid_r2 AYE-WT_Diclo1250_solid_r3 AYE-WT_Diclo750_r1 AYE-WT_Diclo750_r2 AYE-WT_Diclo750_r3 AYE-WT_Mero0.35-0.5_r1 AYE-WT_Mero0.35-0.5_r2 AYE-WT_Mero0.35-0.5_r3 AYE-WT_Rifampicin1.5_r1 AYE-WT_Rifampicin1.5_r2 AYE-WT_Rifampicin1.5_r3 F_Azi20_solid_r1 F_Azi20_solid_r2 F_Azi20_solid_r3 F_ctr_solid_r1 F_ctr_solid_r2 F_ctr_solid_r3 O-Trans_ctr_r1 O-Trans_ctr_r2 O-Trans_ctr_r3 O-Trans_Diclo375_r1 O-Trans_Diclo375_r2 O-Trans_Diclo375_r3 O-Trans_Mero0.25_r1 O-Trans_Mero0.25_r2 O-Trans_Mero0.25_r3 O-Trans_Rifampicin2_r1 O-Trans_Rifampicin2_r2 O-Trans_Rifampicin2_r3 WT-Trans_ctr_r1 WT-Trans_ctr_r2 WT-Trans_ctr_r3 WT-Trans_Diclo750_r1 WT-Trans_Diclo750_r2 WT-Trans_Diclo750_r3; do \
     for sample_id in AYE-T_Diclo375_r2; do \
             java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
     done
  3. (Optional) using trinity to find the most closely reference

     #Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
     #https://www.genome.jp/kegg/tables/br08606.html#prok
     acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
     abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
     aby     KGB     Acinetobacter baumannii AYE     2008    GenBank --> *
     abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
     abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
     abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
     abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
     abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
     abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
     abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
     abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
     abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
     abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
     abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
     abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
     abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
     abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
     abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
     abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
     abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
     abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
     #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CU459141.1) was choosen as reference!
  4. Preparing samplesheet.csv

     sample,fastq_1,fastq_2,strandedness
     Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
     ...
  5. Downloading CU459141.fasta and CU459141.gff from GenBank and preparing CU459141_m.gff

     #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
     #Default NOT_WORKING: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
     #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
     #Checking the record (see below) in results/genome/CP059040.gtf
     #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
     #--featurecounts_feature_type 'transcript' returns only the tRNA results
     #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon
    
     grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
     grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
     grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
     grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
     wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!
    
     grep -P "\tCDS\t" CU459141.gff3 | wc -l  #3659
     sed 's/\tCDS\t/\texon\t/g' CU459141.gff3 > CU459141_m.gff
     grep -P "\texon\t" CU459141_m.gff | sort | wc -l  #3760
    
     # -- DEBUG_2: combination of 'CU459141_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
     --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff" --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
  6. nextflow run

     # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
     (host_env) mv trimmed/*.fastq.gz .
     (host_env) nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \

    –input samplesheet.csv –outdir results –fasta “/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141.fasta” –gff “/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff” -resume –max_cpus 90 –max_memory 900.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘star_salmon’ –gtf_group_features ‘gene_id’ –gtf_extra_attributes ‘gene_name’ –featurecounts_group_type ‘gene_biotype’ –featurecounts_feature_type ‘transcript’

  7. Import data and pca-plot

     #mamba activate r_env
    
     #install.packages("ggfun")
     # Import the required libraries
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library(gplots)
     library(tximport)
     library(DESeq2)
     #library("org.Hs.eg.db")
     library(dplyr)
     library(tidyverse)
     #install.packages("devtools")
     #devtools::install_version("gtable", version = "0.3.0")
     library(gplots)
     library("RColorBrewer")
     #install.packages("ggrepel")
     library("ggrepel")
     # install.packages("openxlsx")
     library(openxlsx)
     library(EnhancedVolcano)
     library(DESeq2)
     library(edgeR)
    
     setwd("~/DATA/Data_Tam_RNAseq_2026_on_AYE/results/star_salmon")
     # Define paths to your Salmon output quantification files
    
     # Store sample names in a character vector
     samples <- c(
         "AYE-O_Azi20_solid_r1", "AYE-O_Azi20_solid_r2", "AYE-O_Azi20_solid_r3", "AYE-O_ctr_r1", "AYE-O_ctr_r2",
         "AYE-O_ctr_r3", "AYE-O_ctr_solid_r1", "AYE-O_ctr_solid_r2", "AYE-O_ctr_solid_r3",
         "AYE-O_Diclo375_r1", "AYE-O_Diclo375_r2", "AYE-O_Diclo375_r3", "AYE-O_Mero0.5_r1",
         "AYE-O_Mero0.5_r2", "AYE-O_Mero0.5_r3", "AYE-O_Rifampicin2_r1", "AYE-O_Rifampicin2_r2",
         "AYE-O_Rifampicin2_r3", "AYE-T_Azi20_solid_r1", "AYE-T_Azi20_solid_r2", "AYE-T_Azi20_solid_r3",
         "AYE-T_ctr_r1", "AYE-T_ctr_r2", "AYE-T_ctr_r3", "AYE-T_ctr_solid_r1", "AYE-T_ctr_solid_r2",
         "AYE-T_ctr_solid_r3", "AYE-T_Diclo375_r1", "AYE-T_Diclo375_r2", "AYE-T_Diclo375_r3",
         "AYE-T_Mero0.15_r1", "AYE-T_Mero0.15_r2", "AYE-T_Mero0.15_r3", "AYE-T_Rifampicin2_r1",
         "AYE-T_Rifampicin2_r2", "AYE-T_Rifampicin2_r3", "AYE-WT_Azi20_solid_r1", "AYE-WT_Azi20_solid_r2",
         "AYE-WT_Azi20_solid_r3", "AYE-WT_ctr_r1", "AYE-WT_ctr_r2", "AYE-WT_ctr_r3", "AYE-WT_ctr_solid_r1",
         "AYE-WT_ctr_solid_r2", "AYE-WT_ctr_solid_r3", "AYE-WT_Diclo1250_solid_r1", "AYE-WT_Diclo1250_solid_r2",
         "AYE-WT_Diclo1250_solid_r3", "AYE-WT_Diclo750_r1", "AYE-WT_Diclo750_r2", "AYE-WT_Diclo750_r3",
         "AYE-WT_Mero0.35-0.5_r1", "AYE-WT_Mero0.35-0.5_r2", "AYE-WT_Mero0.35-0.5_r3", "AYE-WT_Rifampicin1.5_r1",
         "AYE-WT_Rifampicin1.5_r2", "AYE-WT_Rifampicin1.5_r3", "F_Azi20_solid_r1", "F_Azi20_solid_r2",
         "F_Azi20_solid_r3", "F_ctr_solid_r1", "F_ctr_solid_r2", "F_ctr_solid_r3", "O-Trans_ctr_r1",
         "O-Trans_ctr_r2", "O-Trans_ctr_r3", "O-Trans_Diclo375_r1", "O-Trans_Diclo375_r2", "O-Trans_Diclo375_r3",
         "O-Trans_Mero0.25_r1", "O-Trans_Mero0.25_r2", "O-Trans_Mero0.25_r3", "O-Trans_Rifampicin2_r1",
         "O-Trans_Rifampicin2_r2", "O-Trans_Rifampicin2_r3", "WT-Trans_ctr_r1", "WT-Trans_ctr_r2",
         "WT-Trans_ctr_r3", "WT-Trans_Diclo750_r1", "WT-Trans_Diclo750_r2", "WT-Trans_Diclo750_r3"
     )
    
     ## Automatically generate the named vector
     files <- setNames(paste0("./", samples, "/quant.sf"), samples)
    
     # -----------------------------------------------------------------
     # ---- Step 1: Create Detailed Metadata from Your Sample Names ----
    
     # Extract metadata from sample names
     samples <- names(files)
    
     # Parse the complex sample names
     metadata <- data.frame(
     sample = samples,
     stringsAsFactors = FALSE
     )
    
     # Extract strain (everything before first underscore or hyphen treatment)
     metadata$strain <- sapply(strsplit(samples, "[-_]"), function(x) {
     if(x[1] %in% c("AYE", "O", "WT", "F")) {
         if(x[1] == "AYE" && length(x) > 1 && x[2] %in% c("WT", "T", "O")) {
         paste(x[1:2], collapse = "-")
         } else if(x[1] %in% c("O", "WT") && x[2] == "Trans") {
         paste(x[1:2], collapse = "-")
         } else {
         x[1]
         }
     } else {
         x[1]
     }
     })
    
     # Extract treatment type
     metadata$treatment <- sapply(samples, function(x) {
     if(grepl("_ctr", x)) return("ctrl")
     if(grepl("Diclo", x)) return("Diclo")
     if(grepl("Mero", x)) return("Mero")
     if(grepl("Azi", x)) return("Azi")
     if(grepl("Rifampicin", x)) return("Rifampicin")
     return("ctrl")
     })
    
     # Extract concentration
     metadata$concentration <- sapply(samples, function(x) {
     if(grepl("Diclo1250", x)) return("1250")
     if(grepl("Diclo750", x)) return("750")
     if(grepl("Diclo375", x)) return("375")
     if(grepl("Mero0.5", x)) return("0.5")
     if(grepl("Mero0.35", x)) return("0.35")
     if(grepl("Mero0.25", x)) return("0.25")
     if(grepl("Mero0.15", x)) return("0.15")
     if(grepl("Azi20", x)) return("20")
     if(grepl("Rifampicin2", x)) return("2")
     if(grepl("Rifampicin1.5", x)) return("1.5")
     return("0")
     })
    
     # Extract condition (solid vs liquid)
     metadata$condition <- ifelse(grepl("_solid", samples), "solid", "liquid")
    
     # Extract replicate
     metadata$replicate <- sapply(strsplit(samples, "_"), function(x) {
     rep_part <- x[length(x)]
     gsub("r", "", rep_part)
     })
    
     # Create combined group for easy comparisons
     metadata$group <- paste(metadata$strain, metadata$treatment, metadata$concentration, sep = "_")
    
     # Set row names
     rownames(metadata) <- metadata$sample
    
     # Reorder to match txi columns
     metadata <- metadata[colnames(txi$counts), ]
    
     # ---------------------------------------------
     # ---- Step 2: Choose Your Design Strategy ----
    
     # Strategy A: Full Factorial Design (if balanced)
     dds <- DESeqDataSetFromTximport(txi, metadata,
                              design = ~ strain + treatment + condition)
    
     # --> Strategy B: Combined Group Factor ⭐ RECOMMENDED
     metadata$group <- factor(paste(metadata$strain,
                                     metadata$treatment,
                                     metadata$concentration,
                                     metadata$condition,
                                     sep = "_"))
    
     dds <- DESeqDataSetFromTximport(txi, metadata,
                                     design = ~ group)
     dds <- DESeq(dds)
    
     # See all available comparisons
     resultsNames(dds)
    
     # -------------------------------------------------------------
     # ---- Step 3: Set Up Specific Comparisons from Your Notes ----
     # ==========================================
     # 1. Define Exact Comparisons from Your Notes
     # ==========================================
     planned_comparisons <- list(
     # --- Baseline / Strain Controls ---
     AYE_T_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-T_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_WT_ctr          = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_WT_ctr         = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_T                 = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-T_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_T               = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-T_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_T              = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Condition Effects (Solid vs Liquid) ---
     AYE_WT_ctr_solid_vs_AYE_WT_ctr     = list(treat = "AYE-WT_ctrl_0_solid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_O_ctr       = list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-O_ctrl_0_liquid"),
     AYE_T_ctr_solid_vs_AYE_T_ctr       = list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Diclofenac ---
     AYE_WT_Diclo750_vs_AYE_WT_ctr      = list(treat = "AYE-WT_Diclo_750_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-T_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-O_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Diclo375_vs_AYE_WT_ctr     = list(treat = "O-Trans_Diclo_375_liquid",  ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_Diclo750_vs_AYE_WT_ctr    = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     Diclo_AYE_WT_1250_solid_vs_solid_ctr = list(treat = "AYE-WT_Diclo_1250_solid", ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Meropenem ---
     AYE_WT_Mero_vs_AYE_WT_ctr          = list(treat = "AYE-WT_Mero_0.35_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-O_Mero_0.5_liquid",       ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Mero_vs_AYE_WT_ctr         = list(treat = "O-Trans_Mero_0.25_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_T_ctr            = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Azithromycin (Solid) ---
     AYE_WT_Azi_vs_solid_ctr            = list(treat = "AYE-WT_Azi_20_solid", ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_Azi_vs_solid_ctr             = list(treat = "AYE-T_Azi_20_solid",  ctrl = "AYE-T_ctrl_0_solid"),
     AYE_O_Azi_vs_solid_ctr             = list(treat = "AYE-O_Azi_20_solid",  ctrl = "AYE-O_ctrl_0_solid"),
     F_Azi_vs_F_solid_ctr               = list(treat = "F_Azi_20_solid",      ctrl = "F_ctrl_0_solid"),
    
     # --- Rifampicin ---
     AYE_WT_Rif_vs_AYE_WT_ctr           = list(treat = "AYE-WT_Rifampicin_1.5_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Rif_vs_AYE_T_ctr             = list(treat = "AYE-T_Rifampicin_2_liquid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Rif_vs_AYE_O_ctr             = list(treat = "AYE-O_Rifampicin_2_liquid",    ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Rif_vs_O_Trans_ctr         = list(treat = "O-Trans_Rifampicin_2_liquid",  ctrl = "O-Trans_ctrl_0_liquid")
     )
    
     # ==========================================
     # 2. Verification & Validation Script
     # ==========================================
     # Identify which column in colData holds your group names
     group_col <- if("group" %in% colnames(colData(dds))) "group" else
                 if("treatment" %in% colnames(colData(dds))) "treatment" else
                 stop("❌ Please specify the correct colData column containing group names.")
    
     actual_groups <- unique(colData(dds)[[group_col]])
    
     cat("\n", paste(rep("=", 85), collapse=""), "\n")
     cat("📋 VERIFICATION OF NOTE-DERIVED COMPARISONS\n")
     cat(paste(rep("=", 85), collapse=""), "\n\n")
    
     validation_results <- data.frame(
     Comparison_Name = character(),
     Treatment_String = character(),
     Control_String = character(),
     Status = character(),
     Suggested_Contrast = character(),
     stringsAsFactors = FALSE
     )
    
     for(name in names(planned_comparisons)) {
     trt <- planned_comparisons[[name]]$treat
     ctl <- planned_comparisons[[name]]$ctrl
    
     # Find closest matches in actual data
     trt_match <- actual_groups[grepl(trt, actual_groups, fixed = TRUE)]
     ctl_match <- actual_groups[grepl(ctl, actual_groups, fixed = TRUE)]
    
     status <- if(length(trt_match) > 0 && length(ctl_match) > 0) "✅ VALID" else "⚠️  CHECK"
     contrast_str <- if(status == "✅ VALID")
         paste0('c("', group_col, '", "', trt_match[1], '", "', ctl_match[1], '")') else "N/A"
    
     validation_results <- rbind(validation_results, data.frame(
         Comparison_Name = name,
         Treatment_String = trt,
         Control_String = ctl,
         Status = status,
         Suggested_Contrast = contrast_str,
         stringsAsFactors = FALSE
     ))
    
     cat(sprintf("%-45s | T:%-25s C:%-20s | %s\n", name, trt, ctl, status))
     if(status == "⚠️  CHECK") {
         if(length(trt_match) == 0) cat("   🔍 Treat not found. Closest: ", paste(head(actual_groups[grepl(strsplit(trt, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
         if(length(ctl_match) == 0) cat("   🔍 Ctrl not found.  Closest: ", paste(head(actual_groups[grepl(strsplit(ctl, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
     }
     }
    
     # ==========================================
     # 3. Auto-Generate DESeq2 results() Calls (Optional)
     # ==========================================
     valid_comparisons <- validation_results[validation_results$Status == "✅ VALID", ]
     if(nrow(valid_comparisons) > 0) {
     cat("\n📜 READY-TO-RUN DESeq2 CONTRASTS:\n")
     cat(paste(rep("-", 60), collapse=""), "\n")
     for(i in seq_len(nrow(valid_comparisons))) {
         cat(sprintf('res_%s <- results(dds, contrast = %s)\n',
                     gsub("[^A-Za-z0-9]", "_", valid_comparisons$Comparison_Name[i]),
                     valid_comparisons$Suggested_Contrast[i]))
     }
     } else {
     cat("\n⚠️  No exact matches found. Check your colData group naming convention.\n")
     }
    
     # -----------------------------
     # ---- Step 4: PCA figures ----
    
     # 🔍 What each figure shows:
     #
     #    01_PCA_by_Strain.png → Tests if genetic background (AYE-WT, AYE-T, AYE-O, Trans, F) is the dominant source of variation.
     #    02_PCA_by_Treatment.png → Shows clustering by antibiotic/drug exposure (ctrl, Diclo, Mero, Azi, Rifampicin).
     #    03_PCA_by_Condition.png → Reveals batch/growth media effects (solid vs liquid).
     #    04_PCA_CombinedGroups.png → Full experimental grouping with labeled sample names for quick outlier detection.
     #    05_PCA_Ellipses.png → Adds 95% confidence boundaries per strain to visualize group spread and overlap.
     #
     # ⚠️ Quick Checklist Before Running:
     #
     #    Ensure metadata columns (strain, treatment, condition, group) are attached to colData(dds).
     #    If ggrepel is missing, run install.packages("ggrepel").
     #    All PNGs will save to your current working directory (getwd()).
    
     # Install if missing: install.packages(c("ggplot2", "ggrepel"))
     library(DESeq2)
     library(ggplot2)
     library(ggrepel)
    
     # 1. Variance Stabilizing Transformation & Extract PCA Data
     vsd <- vst(dds, blind = FALSE)
     pca_data <- plotPCA(vsd, intgroup = c("strain", "treatment", "condition", "group"), returnData = TRUE)
     percent_var <- round(100 * attr(pca_data, "percentVar"))
    
     # Consistent theme for all plots
     base_theme <- theme_bw(base_size = 12) +
     theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
             legend.position = "right",
             legend.title = element_text(face = "bold"),
             panel.grid.major = element_line(color = "grey90"),
             panel.grid.minor = element_blank())
    
     # --- Plot 1: Colored by Strain ---
     p1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2.5, max.overlaps = 20, show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Strain",
         color = "Strain", shape = "Condition") +
     base_theme
     ggsave("01_PCA_by_Strain.png", p1, width = 8, height = 6, dpi = 300)
    
     # --- Plot 2: Colored by Treatment ---
     p2 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = treatment, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Treatment",
         color = "Treatment", shape = "Condition") +
     base_theme
     ggsave("02_PCA_by_Treatment.png", p2, width = 8, height = 6, dpi = 300)
    
     # --- Plot 3: Colored by Condition (Solid vs Liquid) ---
     p3 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = condition, shape = strain)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Growth Condition",
         color = "Condition", shape = "Strain") +
     base_theme
     ggsave("03_PCA_by_Condition.png", p3, width = 8, height = 6, dpi = 300)
    
     # --- Plot 4: Combined Groups with Sample Labels ---
     p4 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = group)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2, max.overlaps = 30, box.padding = 0.3) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Combined Experimental Groups",
         color = "Group") +
     base_theme + theme(legend.position = "none")
     ggsave("04_PCA_CombinedGroups.png", p4, width = 9, height = 7, dpi = 300)
    
     # --- Plot 5: 95% Confidence Ellipses (by Strain) ---
     p5 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, fill = strain)) +
     geom_point(size = 3, alpha = 0.7) +
     stat_ellipse(level = 0.95, alpha = 0.2, geom = "polygon", show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: 95% Confidence Ellipses by Strain",
         color = "Strain", fill = "Strain") +
     base_theme
     ggsave("05_PCA_Ellipses.png", p5, width = 8, height = 6, dpi = 300)
    
     message("✅ All 5 PCA plots saved to working directory!")
  8. Run Differential Expression & PCA Analysis Complete

     (r_env) cd ~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/
     #(r_env) Rscript complete_deg_pipeline.R  #For standard cutoff in the project, we use complete_deg_pipeline_custom_cutoff.R
    
     # Adapted the script to the following requests:
     # (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and < -1.2 for the KEGG and GO analyses.
     # (b) Baseline / Strain Controls: use genes with a cutoff of log2 fold change > 1.4 and < -1.4 for the KEGG and GO analyses.
     # (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.
    
     # How it works:
     #   * Rifampicin: The script looks for "Rif" in the comparison name (e.g., 28_AYE_WT_Rif_vs_Ctrl) and applies |log2FC| >= 1.2.
     #   * Baseline/Strain Controls: The script looks for "_ctr_vs_" in the comparison name (e.g., 01_AYE_T_ctr_vs_AYE_WT_ctr) and applies |log2FC| >= 1.4.
     #   * All Others: Falls back to the original 2.0 cutoff.
     #   * The console output will now explicitly print which cutoff is being used for each specific comparison.
    
     (r_env) Rscript complete_deg_pipeline_custom_cutoff.R
  9. KEGG and GO annotations in non-model organisms

    (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and 1.4 and < -1.4 for the KEGG and GO analyses. (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

* Preparing file 1 eggnog_out.emapper.annotations.txt for the R-code below: (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

    Install EggNOG-mapper:

        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env

    Run annotation:

        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        #Download CU459141_protein_.fasta from NCBI
        python ~/Scripts/update_fasta_header.py CU459141_protein_.fasta CU459141_protein.fasta
        emapper.py -i CU459141_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.

    Extract KEGG KO IDs from annotations.emapper.annotations.

* Preparing file 2 blast2go_annot.annot2_ for the R-code below:

  - Basic (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
    * Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

  - Advanced (GO Terms from 'Blast2GO 5 Basic'): Interpro based protein families / domains --> Button interpro

    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."

  - MERGE the results of InterPro GO IDs (advanced) to GO IDs (basic) and generate final GO IDs, saved in blast2go_annot.annot2

    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." --> "Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation)."
    * (NOT_USED) Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

  - PREPARING go_terms and ec_terms: annot_* file (NOTE that blast2go_annot.annot2 is after merging InterPro_GO_IDs and GO_IDs):

    cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_

10.2. Perform KEGG and GO Enrichment in R

      (r_env) cd /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete

      #For |deg_cutoff_log_foldchange| >=1.4
      sed "s/01_AYE_T_ctr_vs_AYE_WT_ctr/02_AYE_O_ctr_vs_AYE_WT_ctr/g" 1.R > 2.R
      ...

      #For |deg_cutoff_log_foldchange| >=2.0
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/09_AYE_O_ctr_solid_vs_liquid/g" 8.R > 9.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/10_AYE_T_ctr_solid_vs_liquid/g" 8.R > 10.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/11_AYE_O_ctr_solid_vs_AYE_WT_solid/g" 8.R > 11.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/12_AYE_T_ctr_solid_vs_AYE_WT_solid/g" 8.R > 12.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/13_AYE_WT_Diclo750_vs_Ctrl/g" 8.R > 13.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/14_AYE_T_Diclo375_vs_Ctrl/g" 8.R > 14.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/15_AYE_O_Diclo375_vs_Ctrl/g" 8.R > 15.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/16_O_Trans_Diclo375_vs_Ctrl/g" 8.R > 16.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/17_WT_Trans_Diclo750_vs_Ctrl/g" 8.R > 17.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid/g" 8.R > 18.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/19_AYE_WT_Mero_vs_Ctrl/g" 8.R > 19.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/20_AYE_T_Mero_vs_Ctrl/g" 8.R > 20.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/21_AYE_O_Mero_vs_Ctrl/g" 8.R > 21.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/22_O_Trans_Mero_vs_Ctrl/g" 8.R > 22.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/23_AYE_T_Mero_vs_AYE_T_Ctrl/g" 8.R > 23.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/24_AYE_WT_Azi_solid_vs_Ctrl_solid/g" 8.R > 24.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/25_AYE_T_Azi_solid_vs_Ctrl_solid/g" 8.R > 25.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/26_AYE_O_Azi_solid_vs_Ctrl_solid/g" 8.R > 26.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/27_F_Azi_solid_vs_Ctrl_solid/g" 8.R > 27.R

      #For |deg_cutoff_log_foldchange| >=1.2
      sed "s/28_AYE_WT_Rif_vs_Ctrl/29_AYE_T_Rif_vs_Ctrl/g" 28.R > 29.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/30_AYE_O_Rif_vs_Ctrl/g" 28.R > 30.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/31_O_Trans_Rif_vs_Ctrl/g" 28.R > 31.R

      (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 1.R
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid KEGG IDs: 4
      #  Enriched pathways: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 50
      #  Enriched pathways: 4
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid GO IDs: 16
      #  Enriched GO-terms: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 151
      #  Enriched GO-terms: 3
      #...

  10.3. Finalizing the KEGG and GO Enrichment table

        1. NOTE (Already realized in the code): geneIDs in KEGG_Enrichment have been already translated from ko to geneID in H0N29_*-format; If not, nachmachen using eggnog-res, 因为 eggnog里有1-1-mspping Info between ko-Name and GeneID.
        2. NEED_MANUAL_DELETION (Already setting the cutoff in the code): p.adjust values have been calculated, we have to filter all records in GO_Enrichment-results by |p.adjust|<=0.05. DON'T_NEED_ANY_MORE, since pvalueCutoff = 0.05 settings in enricher. Alternative using pvalueCutoff=1.0, marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment.
        3. NOTE (Not occuring in the new dataset): In rare case, the description is missing for some IDs, e.g. GO term: GO:0006807: replace GO:0006807  obsolete nitrogen compound metabolic process;  ko00975: Metabolism, Biosynthesis of other secondary metabolites