Author Archives: gene_x

Automated β-Lactamase Gene Detection with NCBI AMRFinderPlus (Data_Patricia_AMRFinderPlus_2025, v2)

1. Installation and Database Setup

To install and prepare NCBI AMRFinderPlus in the bacto environment:

mamba activate bacto
mamba install ncbi-amrfinderplus
mamba update ncbi-amrfinderplus

mamba activate bacto
amrfinder -u

This will:
- Download and install the latest AMRFinderPlus version and its database.
- Create /home/jhuang/mambaforge/envs/bacto/share/amrfinderplus/data/.
- Symlink the latest database version for use.

Check available organism options for annotation:

amrfinder --list_organisms
#Available --organism options: Acinetobacter_baumannii, Burkholderia_cepacia, Burkholderia_mallei, Burkholderia_pseudomallei, Campylobacter, Citrobacter_freundii, Clostridioides_difficile, Corynebacterium_diphtheriae, Enterobacter_asburiae, Enterobacter_cloacae, Enterococcus_faecalis, Enterococcus_faecium, Escherichia, Klebsiella_oxytoca, Klebsiella_pneumoniae, Neisseria_gonorrhoeae, Neisseria_meningitidis, Pseudomonas_aeruginosa, Salmonella, Serratia_marcescens, Staphylococcus_aureus, Staphylococcus_pseudintermedius, Streptococcus_agalactiae, Streptococcus_pneumoniae, Streptococcus_pyogenes, Vibrio_cholerae, Vibrio_parahaemolyticus, Vibrio_vulnificus

Supported values include species such as Escherichia, Klebsiella_pneumoniae, Enterobacter_cloacae, Pseudomonas_aeruginosa and many others.

2. Batch Analysis: Bash Script for Genome Screening

Use the following script to screen multiple genomes using AMRFinderPlus and output only β-lactam/beta-lactamase hits from a metadata table.

Input: genome_metadata.tsv — tab-separated columns: filename_TAB_organism, with header.

filename    organism
58.fasta    Escherichia coli
92.fasta    Klebsiella pneumoniae
125.fasta   Enterobacter cloacae complex
128.fasta   Enterobacter cloacae complex
130.fasta   Enterobacter cloacae complex
147.fasta   Citrobacter freundii
149.fasta   Citrobacter freundii
160.fasta   Citrobacter braakii
161.fasta   Citrobacter braakii
168.fasta   Providencia stuartii
184.fasta   Klebsiella aerogenes
65.fasta    Pseudomonas aeruginosa
201.fasta   Pseudomonas aeruginosa
209.fasta   Pseudomonas aeruginosa
167.fasta   Serratia marcescens

Run:

cd ~/DATA/Data_Patricia_AMRFinderPlus_2025/genomes
./run_amrfinder_and_summarize.sh genome_metadata.tsv
#./run_amrfinder_and_summarize.sh genome_metadata_149.tsv
#OR_DETECT_RUN: amrfinder -n 92.fasta -o amrfinder_results/92.amrfinder.tsv --plus --organism Klebsiella_pneumoniae --threads 1

python summarize_from_amrfinder_results.py amrfinder_results
# or, since that's the default:
# python summarize_from_amrfinder_results.py

Produce

AMRFinder-wide outputs
- amrfinder_all.tsv
- amrfinder_summary_by_isolate_gene.tsv
- amrfinder_summary_by_gene.tsv
- amrfinder_summary_by_class.tsv (if a class column exists)
- amrfinder_summary.xlsx (with multiple sheets)
β-lactam-only outputs (if Class and Subclass are present)
- beta_lactam_all.tsv
- beta_lactam_summary_by_gene.tsv
- beta_lactam_summary_by_isolate_gene.tsv
- beta_lactam_all.xlsx
- beta_lactam_summary.xlsx

Report

Please find attached the updated AMRFinderPlus summary files, now including isolate 167. For β-lactam–specific results, please see beta_lactam_all.xlsx and beta_lactam_summary.xlsx. In particular, beta_lactam_summary.xlsx contains two sheets:

by_gene – aggregated counts and isolate lists for each β-lactam gene
by_isolate_gene – per-isolate overview of detected β-lactam genes

Script:

run_amrfinder_and_summarize.sh

    #!/usr/bin/env bash
    set -euo pipefail

    META_FILE="${1:-}"

    if [[ -z "$META_FILE" || ! -f "$META_FILE" ]]; then
    echo "Usage: $0 genome_metadata.tsv" >&2
    exit 1
    fi

    OUTDIR="amrfinder_results"
    mkdir -p "$OUTDIR"

    echo ">>> Checking AMRFinder installation..."
    amrfinder -V || { echo "ERROR: amrfinder not working"; exit 1; }
    echo

    echo ">>> Running AMRFinderPlus on all genomes listed in $META_FILE"

    # --- loop over metadata file ---
    # expected columns: filename
<TAB>organism
    tail -n +2 "$META_FILE" | while IFS=$'\t' read -r fasta organism; do
    # skip empty lines
    [[ -z "$fasta" ]] && continue

    if [[ ! -f "$fasta" ]]; then
    echo "WARN: FASTA file '$fasta' not found, skipping."
    continue
    fi

    isolate_id="${fasta%.fasta}"

    # map free-text organism to AMRFinder --organism names (optional)
    org_opt=""
    case "$organism" in
    "Escherichia coli")              org_opt="--organism Escherichia" ;;
    "Klebsiella pneumoniae")         org_opt="--organism Klebsiella_pneumoniae" ;;
    "Enterobacter cloacae complex")  org_opt="--organism Enterobacter_cloacae" ;;
    "Citrobacter freundii")          org_opt="--organism Citrobacter_freundii" ;;
    "Citrobacter braakii")           org_opt="--organism Citrobacter_freundii" ;;
    "Pseudomonas aeruginosa")        org_opt="--organism Pseudomonas_aeruginosa" ;;
    "Serratia marcescens")           org_opt="--organism Serratia_marcescens" ;;
    # others (Providencia stuartii, Klebsiella aerogenes)
    # currently have no organism-specific rules in AMRFinder, so we omit --organism
    *)                               org_opt="" ;;
    esac

    out_tsv="${OUTDIR}/${isolate_id}.amrfinder.tsv"

    echo "  - ${fasta} (${organism}) -> ${out_tsv} ${org_opt}"
    amrfinder -n "$fasta" -o "$out_tsv" --plus $org_opt
    done

    echo ">>> AMRFinderPlus runs finished."
    echo ">>> All done."
    echo "   - Individual reports: ${OUTDIR}/*.amrfinder.tsv"

summarize_from_amrfinder_results.py

    #!/usr/bin/env python3
    """
    summarize_from_amrfinder_results.py

    Usage:
    python summarize_from_amrfinder_results.py [amrfinder_results_dir]

    Default directory is "amrfinder_results" (relative to current working dir).

    This script:
    1) Reads all *.amrfinder.tsv in the given directory
    2) Merges them into a combined table
    3) Generates AMRFinder-wide summaries (amrfinder_* files)
    4) Applies a β-lactam filter:

            Element type == "AMR" (case-insensitive)
    AND Class or Subclass contains "beta-lactam" (case-insensitive)

    and generates β-lactam-only summaries (beta_lactam_* files).

    It NEVER re-runs AMRFinder; it only uses existing TSV files.
    """

    import sys
    import os
    import glob
    import re

    try:
    import pandas as pd
    except ImportError:
    sys.stderr.write(
            "ERROR: pandas is not installed.\n"
            "Install with something like:\n"
            "  mamba install pandas openpyxl -c conda-forge -c bioconda\n"
    )
    sys.exit(1)

    # ---------------------------------------------------------------------
    # Helpers
    # ---------------------------------------------------------------------

    def read_one(path):
    """Read one *.amrfinder.tsv and add an 'isolate_id' column from the filename."""
    df = pd.read_csv(path, sep="\t", dtype=str)
    df.columns = [c.strip() for c in df.columns]
    isolate_id = os.path.basename(path).replace(".amrfinder.tsv", "")
    df["isolate_id"] = isolate_id
    return df

    def pick(df, *candidates):
    """Return the first existing column name among candidates (normalized names)."""
    for c in candidates:
            if c in df.columns:
            return c
    return None

    # ---------------------------------------------------------------------
    # AMRFinder-wide summaries (no β-lactam filter)
    # ---------------------------------------------------------------------

    def make_amrfinder_summaries(
    df_all,
    col_gene,
    col_seq,
    col_class,
    col_subcls,
    col_ident,
    col_cov,
    col_iso,
    ):
    """Summaries for ALL AMRFinder hits (no β-lactam filter)."""
    if df_all.empty:
            print("[amrfinder] No rows in merged table, skipping summaries.")
            return

    # full merged table
    df_all.to_csv("amrfinder_all.tsv", sep="\t", index=False)
    print(">>> Full AMRFinder table written to: amrfinder_all.tsv")

    # ---- summary by isolate × gene ----
    rows = []
    for (iso, gene), sub in df_all.groupby([col_iso, col_gene], dropna=False):
            row = {
            "isolate_id": iso,
            "Gene_symbol": sub[col_gene].iloc[0],
            "n_hits": len(sub),
            }
            if col_seq is not None:
            row["Sequence_name"] = sub[col_seq].iloc[0]
            if col_class is not None:
            row["Class"] = sub[col_class].iloc[0]
            if col_subcls is not None:
            row["Subclass"] = sub[col_subcls].iloc[0]
            if col_ident is not None:
            vals = pd.to_numeric(sub[col_ident], errors="coerce")
            row["%identity_min"] = vals.min()
            row["%identity_max"] = vals.max()
            if col_cov is not None:
            vals = pd.to_numeric(sub[col_cov], errors="coerce")
            row["%coverage_min"] = vals.min()
            row["%coverage_max"] = vals.max()
            rows.append(row)

    summary_iso_gene = pd.DataFrame(rows)
    summary_iso_gene.to_csv(
            "amrfinder_summary_by_isolate_gene.tsv", sep="\t", index=False
    )
    print(">>> Isolate × gene summary written to: amrfinder_summary_by_isolate_gene.tsv")

    # ---- summary by gene ----
    def join(vals):
            uniq = sorted(set(vals.dropna().astype(str)))
            return ",".join(uniq)

    rows = []
    for gene, sub in df_all.groupby(col_gene, dropna=False):
            row = {
            "Gene_symbol": sub[col_gene].iloc[0],
            "n_isolates": sub[col_iso].nunique(),
            "isolates": ",".join(sorted(set(sub[col_iso].dropna().astype(str)))),
            "n_hits": len(sub),
            }
            if col_seq is not None:
            row["Sequence_name"] = join(sub[col_seq])
            if col_class is not None:
            row["Class"] = join(sub[col_class])
            if col_subcls is not None:
            row["Subclass"] = join(sub[col_subcls])
            rows.append(row)

    summary_gene = pd.DataFrame(rows)
    summary_gene = summary_gene.sort_values("n_isolates", ascending=False)
    summary_gene.to_csv("amrfinder_summary_by_gene.tsv", sep="\t", index=False)
    print(">>> Gene-level summary written to: amrfinder_summary_by_gene.tsv")

    # ---- summary by class/subclass ----
    summary_class = None
    if col_class is not None:
            group_cols = [col_class]
            if col_subcls is not None:
            group_cols.append(col_subcls)

            summary_class = (
            df_all.groupby(group_cols, dropna=False)
            .agg(
                    n_isolates=(col_iso, "nunique"),
                    n_hits=(col_iso, "size"),
            )
            .reset_index()
            )
            summary_class.to_csv("amrfinder_summary_by_class.tsv", sep="\t", index=False)
            print(">>> Class-level summary written to: amrfinder_summary_by_class.tsv")
    else:
            print(">>> No 'class' column found; amrfinder_summary_by_class.tsv not created.")

    # ---- Excel workbook ----
    try:
            with pd.ExcelWriter("amrfinder_summary.xlsx") as xw:
            df_all.to_excel(xw, sheet_name="amrfinder_all", index=False)
            summary_iso_gene.to_excel(xw, sheet_name="by_isolate_gene", index=False)
            summary_gene.to_excel(xw, sheet_name="by_gene", index=False)
            if summary_class is not None:
                    summary_class.to_excel(xw, sheet_name="by_class", index=False)
            print(">>> Excel workbook written: amrfinder_summary.xlsx")
    except Exception as e:
            print("WARNING: could not write amrfinder_summary.xlsx:", e)

    # ---------------------------------------------------------------------
    # β-lactam summaries
    # ---------------------------------------------------------------------

    def make_beta_lactam_summaries(
    df_beta,
    col_gene,
    col_seq,
    col_subcls,
    col_ident,
    col_cov,
    col_iso,
    ):
    """Summaries for β-lactam subset (after mask)."""
    if df_beta.empty:
            print("[beta_lactam] No β-lactam hits in subset, skipping.")
            return

    # full β-lactam table
    beta_all_tsv = "beta_lactam_all.tsv"
    df_beta.to_csv(beta_all_tsv, sep="\t", index=False)
    print(">>> β-lactam / β-lactamase hits written to: %s" % beta_all_tsv)

    # -------- summary by gene (with list of isolates) --------
    group_cols = [col_gene]
    if col_seq is not None:
            group_cols.append(col_seq)
    if col_subcls is not None:
            group_cols.append(col_subcls)

    def join_isolates(vals):
            uniq = sorted(set(vals.dropna().astype(str)))
            return ",".join(uniq)

    summary_gene = (
            df_beta.groupby(group_cols, dropna=False)
            .agg(
            n_isolates=(col_iso, "nunique"),
            isolates=(col_iso, join_isolates),
            n_hits=(col_iso, "size"),
            )
            .reset_index()
    )

    rename_map = {}
    if col_gene is not None:
            rename_map[col_gene] = "Gene_symbol"
    if col_seq is not None:
            rename_map[col_seq] = "Sequence_name"
    if col_subcls is not None:
            rename_map[col_subcls] = "Subclass"
    summary_gene.rename(columns=rename_map, inplace=True)

    sum_gene_tsv = "beta_lactam_summary_by_gene.tsv"
    summary_gene.to_csv(sum_gene_tsv, sep="\t", index=False)
    print(">>> Gene-level β-lactam summary written to: %s" % sum_gene_tsv)
    print("    (includes 'isolates' = comma-separated isolate_ids)")

    # -------- summary by isolate & gene (with annotation) --------
    rows = []
    for (iso, gene), sub in df_beta.groupby([col_iso, col_gene], dropna=False):
            row = {
            "isolate_id": iso,
            "Gene_symbol": sub[col_gene].iloc[0],
            "n_hits": len(sub),
            }
            if col_seq is not None:
            row["Sequence_name"] = sub[col_seq].iloc[0]
            if col_subcls is not None:
            row["Subclass"] = sub[col_subcls].iloc[0]

            if col_ident is not None:
            vals = pd.to_numeric(sub[col_ident], errors="coerce")
            row["%identity_min"] = vals.min()
            row["%identity_max"] = vals.max()
            if col_cov is not None:
            vals = pd.to_numeric(sub[col_cov], errors="coerce")
            row["%coverage_min"] = vals.min()
            row["%coverage_max"] = vals.max()

            rows.append(row)

    summary_iso_gene = pd.DataFrame(rows)
    sum_iso_gene_tsv = "beta_lactam_summary_by_isolate_gene.tsv"
    summary_iso_gene.to_csv(sum_iso_gene_tsv, sep="\t", index=False)
    print(">>> Isolate × gene β-lactam summary written to: %s" % sum_iso_gene_tsv)
    print("    (includes 'Gene_symbol' and 'Sequence_name' annotation columns)")

    # -------- optional Excel exports --------
    try:
            with pd.ExcelWriter("beta_lactam_all.xlsx") as xw:
            df_beta.to_excel(xw, sheet_name="beta_lactam_all", index=False)
            with pd.ExcelWriter("beta_lactam_summary.xlsx") as xw:
            summary_gene.to_excel(xw, sheet_name="by_gene", index=False)
            summary_iso_gene.to_excel(xw, sheet_name="by_isolate_gene", index=False)
            print(">>> Excel workbooks written: beta_lactam_all.xlsx, beta_lactam_summary.xlsx")
    except Exception as e:
            print("WARNING: could not write β-lactam Excel files:", e)

    # ---------------------------------------------------------------------
    # Main
    # ---------------------------------------------------------------------

    def main():
    outdir = sys.argv[1] if len(sys.argv) > 1 else "amrfinder_results"

    if not os.path.isdir(outdir):
            sys.stderr.write("ERROR: directory '%s' not found.\n" % outdir)
            sys.exit(1)

    files = sorted(glob.glob(os.path.join(outdir, "*.amrfinder.tsv")))
    if not files:
            sys.stderr.write("ERROR: no *.amrfinder.tsv files found in '%s'.\n" % outdir)
            sys.exit(1)

    print(">>> Found %d AMRFinder TSV files in: %s" % (len(files), outdir))
    for f in files:
            print("   -", os.path.basename(f))

    dfs = [read_one(f) for f in files]
    df = pd.concat(dfs, ignore_index=True)

    # normalize column names for internal use
    norm_cols = {c: c.strip().lower().replace(" ", "_") for c in df.columns}
    df.rename(columns=norm_cols, inplace=True)

    # locate columns (handles your Element type / subtype + older formats)
    col_gene       = pick(df, "gene_symbol", "genesymbol")
    col_seq        = pick(df, "sequence_name", "sequencename")
    col_elemtype   = pick(df, "element_type")
    col_elemsub    = pick(df, "element_subtype")
    col_class      = pick(df, "class")
    col_subcls     = pick(df, "subclass")
    col_ident      = pick(df, "%identity_to_reference_sequence", "identity")
    col_cov        = pick(df, "%coverage_of_reference_sequence", "coverage_of_reference_sequence")
    col_iso        = "isolate_id"

    print("\nDetected columns:")
    for label, col in [
            ("gene", col_gene),
            ("sequence", col_seq),
            ("element_type", col_elemtype),
            ("element_subtype", col_elemsub),
            ("class", col_class),
            ("subclass", col_subcls),
            ("%identity", col_ident),
            ("%coverage", col_cov),
            ("isolate_id", col_iso),
    ]:
            print("  %-14s: %s" % (label, col))

    if col_gene is None:
            sys.stderr.write(
            "ERROR: could not find a gene symbol column "
            "(expected something like 'Gene symbol' in the original AMRFinder output).\n"
            )
            sys.exit(1)

    print("\n=== Generating AMRFinder-wide summaries (all hits) ===")
    make_amrfinder_summaries(
            df_all=df,
            col_gene=col_gene,
            col_seq=col_seq,
            col_class=col_class,
            col_subcls=col_subcls,
            col_ident=col_ident,
            col_cov=col_cov,
            col_iso=col_iso,
    )

    # -----------------------------------------------------------------
    # β-lactam subset
    #
    # New logic for your current data:
    #   Element type == "AMR"
    #   AND Class or Subclass contains "beta-lactam"
    #
    # Falls back to just Class/Subclass if Element type not present.
    # -----------------------------------------------------------------
    if (col_elemtype is not None) or (col_class is not None or col_subcls is not None):

            # element type AMR (if column exists, otherwise all True)
            if col_elemtype is not None:
            mask_amr = df[col_elemtype].str.contains("AMR", case=False, na=False)
            else:
            mask_amr = pd.Series(True, index=df.index)

            # beta-lactam pattern (handles BETA-LACTAM, beta lactam, etc.)
            beta_pattern = re.compile(r"beta[- ]?lactam", re.IGNORECASE)

            mask_beta = pd.Series(False, index=df.index)
            if col_class is not None:
            mask_beta |= df[col_class].fillna("").str.contains(beta_pattern)
            if col_subcls is not None:
            mask_beta |= df[col_subcls].fillna("").str.contains(beta_pattern)

            mask = mask_amr & mask_beta
            df_beta = df.loc[mask].copy()

            if df_beta.empty:
            print(
                    "\nWARNING: No β-lactam hits found "
                    "(Element type == 'AMR' AND Class/Subclass contains 'beta-lactam')."
            )
            else:
            print(
                    "\n=== β-lactam subset ===\n"
                    "  kept %d of %d rows where Element type is 'AMR' and "
                    "Class/Subclass contains 'beta-lactam'\n"
                    % (len(df_beta), len(df))
            )
            make_beta_lactam_summaries(
                    df_beta=df_beta,
                    col_gene=col_gene,
                    col_seq=col_seq,
                    col_subcls=col_subcls,
                    col_ident=col_ident,
                    col_cov=col_cov,
                    col_iso=col_iso,
            )
    else:
            print(
            "\nWARNING: Cannot apply β-lactam filter because Element type and/or "
            "class/subclass columns were not found. Only amrfinder_* "
            "outputs were generated."
            )

    if __name__ == "__main__":
    main()

Automated Kymograph Track Filtering & Lake File Generation (Data_Vero_Kymographs)

Leave a reply

Title: Automated Kymograph Track Filtering & Lake File Generation (kymograph轨迹自动过滤与Lake文件生成流程)

Step 1 – Track Filtering with 1_filter_track.py

（用1_filter_track.py进行轨迹过滤）运行命令：

python 1_filter_track.py

核心思路：对每个原始*_blue.csv轨迹文件，根据位置和寿命（lifetime）进行过滤，将保留的轨迹和被剔除的轨迹分别存放于两个目录：

filtered/ → 通过过滤条件保留下来的轨迹
separated/ → 不满足过滤条件被剔除的轨迹共有74个原始*_blue.csv文件。确保对每个原始blue文件，针对每种过滤条件输出对应文件：
有轨迹通过过滤时，生成正常的filtered CSV（含数据行）
无轨迹通过过滤时，生成占位placeholder文件，格式正确，仅含header和注释，无数据此设计确保后续2_update_lakes.py能正常读取，并判定该条件下无轨迹，保证流水线完整一致。

Step 2 – Organize filtered CSVs and Fix p940 Naming Bug

（整理过滤结果CSV，修正文件名命名错误）创建文件夹：

mkdir filtered_blue_position filtered_blue_position_1s filtered_blue_position_5s filtered_blue_lifetime_5s_only

移动对应过滤文件：

1) 绑定位置2.2–3.8 µm

mv filtered/*_blue_position.csv filtered_blue_position

2) 绑定位置且寿命≥1s

mv filtered/*_blue_position_1s.csv filtered_blue_position_1s

3) 绑定位置且寿命≥5s

mv filtered/*_blue_position_5s.csv filtered_blue_position_5s

4) 寿命≥5s不限制位置

mv filtered/*_blue_lifetime_5s_only.csv filtered_blue_lifetime_5s_only

每个目录保留74个CSV文件（包含真实轨迹和header-only占位符）。修正p940命名bug（文件名中p940与lake文件中940不匹配），统一去除多余的p：

find filtered_blue_position -type f -name 'p*_p[0-9][0-9][0-9]_*.csv' -exec rename 's/_p([0-9]{3})/_$1/' {} +
（同理在其它三个目录执行相同命令）

保证轨迹CSV名与lake文件中kymograph名称一一对应。

Step 3 – Write filtered tracks back to lake files

（把过滤后轨迹写回lake文件）运行命令更新lake文件（每组过滤条件对应一组输出目录）：

python 2_update_lakes.py --merged_lake_folder lakes_raw --filtered_folder filtered_blue_position --output_folder lakes_blue_position_2.2-3.8 | tee blue_position_2.2-3.8.LOG

python 2_update_lakes.py --merged_lake_folder lakes_raw --filtered_folder filtered_blue_position_1s --output_folder lakes_blue_position_2.2-3.8_length_min_1s | tee blue_position_2.2-3.8_length_min_1s.LOG

python 2_update_lakes.py --merged_lake_folder lakes_raw --filtered_folder filtered_blue_position_5s --output_folder lakes_blue_position_2.2-3.8_length_min_5s | tee blue_position_2.2-3.8_length_min_5s.LOG

python 2_update_lakes.py --merged_lake_folder lakes_raw --filtered_folder filtered_blue_lifetime_5s_only --output_folder lakes_blue_length_min_5s | tee blue_length_min_5s.LOG

处理逻辑：

通过kymograph名称匹配filtered_*目录对应CSV
根据CSV内容重建blue轨迹文本，写回lake JSON
分类日志输出三种情况：

Updated：找到CSV且≥1条轨迹，更新保存轨迹
CSV存在但无轨迹或读取失败，移除kymograph及关联H5链接
无匹配CSV，移除kymograph及H5链接
- 日志统计统计各case数量、CSV总数、未使用“孤儿”CSV

最终实现每个replicate拥有多组更新的lake文件，各文件中kymographs、experiments[…].dataset、file_viewer的H5链接一致对应，确保完整性和可追踪性。

此流程自动化实现kymograph轨迹质量控制与lake文件二次生成，支持多样过滤条件，保证下游数据分析准确可靠。

Protected: Top 50

Enter your password to view comments.

FAU“身体活动与健康”硕士项目：申请指南与入学要求

Leave a reply

FAU“身体活动与健康”硕士项目：申请指南与入学要求

如何申请 (How to apply)

“身体活动与健康”硕士项目（MA Programme Physical Activity and Health）只能在冬季学期开始（课程于2024年10月开课），针对2025/26冬季学期的申请将于2025年2月15日开始。申请截止日期为2025年5月31日。我们建议非欧盟公民最迟于2025年3月31日前提交申请，以便有充足时间办理签证手续。所有所需申请材料必须通过线上系统[Campo(https://www.campo.fau.de/qisserver/pages/cs/sys/portal/hisinoneStartPage.faces)提交。（请不要邮寄任何申请材料到FAU，所有文件须通过Campo平台在线上传。）

所需材料 (Required Documents)

在申请“身体活动与健康”硕士项目时，需要提交以下文件：

个人简历
动机信（1–2页），说明你申请该项目的兴趣、动机与资质
德国高校毕业生：提交所有教育阶段的毕业证书及成绩单（如成绩单、Studienbuch等）复印件。
国际高校毕业生：提交所有教育阶段的经认证的毕业证书及成绩单复印件。
若你的学位为体育教育、心理学、社会学、政治学、人类学或医学：请提交与你本项目高度相关课程的清单，并附上至少一年全职的相关领域（运动科学/康复科学/治疗科学/公共卫生）的工作经验证明。
对于母语非英语且本科/硕士授课语言不是英语者：至少需提供B2级别英语能力证明。
对于母语非德语者（如有）：至少A1级别的德语语言证书。

动机信 (Cover letter)

动机信是你申请材料的重要部分。请说明为什么想加入本项目，以及你未来的职业规划。此外，应提及你先前在身体活动、物理治疗或公共卫生等主题领域的经验。篇幅应为1至2页。

个人简历 (Curriculum vitae/Resume)

简历应简要说明你的中学和大学学习经历，列出最近就读的所有学校或大学。包括与你申请项目相关的实习、兼职或全职工作经历。同时应注明出生日期与地点、国籍及现居地点。可以使用Europass简历模板（下载模板及说明，或访问Europass主页）。

经认证的复印件（仅限国际学位申请者）Certified copies (applicants with international degrees only)

需提交中学和大学期间所有学历及成绩单的经认证复印件。这些文件仅通过电子邮件提交（不接受邮寄或传真）。所有复印件必须经过正式认证。认证文件须：

含有认证机构印章；
由认证人员签字；
明确标注认证日期；
认证机构及人员具备认证资格。通常，学校管理部门有权办理认证；如不确定，可咨询就近的德国大使馆或领事馆。

课程清单（适用于体育教育、心理学、社会学、政治学、人类学、医学等学位申请者）

项目对非运动科学、物理治疗、康复科学、健康教育等背景的学生开放。请列出与你本项目相关的所有课程，例如运动科学、体育教育、物理治疗、康复科学、老年学、公共卫生、流行病学、研究方法或统计学等。

Listing of courses/classes with high relevance to our programme (for applicants with degrees in Physical Education, Psychology, Sociology, Political Science, Anthropology, or Medicine only) The programme is open to students who do not have degrees in Sport Science, Kinesiology /Exercise Science, Physiotherapy, Rehabilitation Science, Health Education, Health Science/Public Health.

Such other degrees can be e.g. Physical Education, Psychology, Sociology, Political Science, Anthropology, or Medicine). This list should provide us with a brief summary of all classes or coursework that you have attended and that are relevant to the subject areas of physical activity and/or (public) health. Potential examples include courses/classes covering the topics of sport science, physical education, physical therapy, rehabilitation science, kinesiology, gerontology, public health, epidemiology, research methods, or statistics.

工作经验（适用于体育教育、心理学、社会学、政治学、人类学、医学等学位申请者）

具有上述专业背景的学生，需提供至少一年全职相关工作经验（运动科学、康复科学、治疗科学或公共卫生领域）证明，可由相关机构出具证明信。

Documentation of 1 year work experience in the fields of Sport Science/ Rehabilitation Science or Therapeutic Science/ Public Health (for applicants with degrees in Physical Education, Psychology, Sociology, Political Science, Anthropology, or Medicine only) The programme is open to students who do not have degrees in Sport Science, Kinesiology /Exercise Science, Physiotherapy, Rehabilitation Science, Therapeutic Science, Health Education, or Health Science/Public Health.

Such other degrees can be e.g. Physical Education, Psychology, Sociology, Political Science, Anthropology, or Medicine. Students with such degrees need to document at least 1 year of work experience (full-time) in the fields of Sport Science, Reahbilitation Science or Therapeutic Science, or Public Health in order to be eligible to apply to the programme. The documentation can be an attached letter from the institution/company.

英语语言证书（仅适用于母语非英语者）

本项目以英语授课，需要具备足够的听、说、读、写能力。若母语非英语且本科/硕士授课语言非英语，需提供语言证书证明达到我们要求的水平。最低要求为CEFR体系的B2级。详情见入学要求。

德语语言证书（仅适用于母语非德语者）

根据州级规定，所有母语非德语学生须在入学一年内至少达到A1级德语水平。若已有德语水平，请在申请材料中提供证明；若尚未具备，也可申请。大学提供免费德语课程，可在第一学年内学习。所有课程与考试均以英语进行。

申请提交地点

所有申请材料须通过Campo平台提交。（请不要邮寄任何文件至FAU，所有文件仅通过Campo上传。）

申请审核

DSS部门两位教师将依据以下标准评审申请：

运动科学、康复科学/治疗科学及公共卫生方面的先前知识；
相关领域（体育教育、心理学、社会学、政治学、人类学或医学）的知识背景；
研究方法（如统计学、质性研究）的知识；
在运动科学、康复科学/治疗科学及公共卫生领域的实践经验（如实习或工作经验）。由于通常申请数量众多，请预留至少4周等待评审结果。

补充信息

如有关于硕士项目内容或申请流程的疑问，请联系项目顾问Karim Abu-Omar。若你已通过Campo提交申请，请在联系时务必提供申请编号（application-ID），并在需要通过邮件发送的文件名称中注明该编号。]

“身体活动与健康”硕士项目的申请资格需通过以下条件证明：

具有以下学科之一的高等教育第一阶段学位（例如学士学位，或德国体系中的“Diplom”或“Staatsexamen”）：

运动科学（以健康为重点）
运动机能学/运动科学（以健康为重点）
康复科学/治疗科学
健康教育
健康科学/公共卫生

在特殊情况下，若申请人完成了以下相关领域的类似学位，也可被录取，例如体育教育、心理学、社会学、政治学、人类学或医学。申请人需提供证明，证明其已在运动科学/康复科学/治疗科学/公共卫生等领域修读了至少20个ECTS学分的课程，或在这些领域拥有至少1年的全职工作经验。

最低成绩要求：

对于采用百分制的评分体系：总成绩须达到75%或以上。
对于采用4分制GPA体系（如美国）：GPA须达到3.00或以上。
对于德国学生：成绩须达到2.5或以下。

目前仍在读本科的学生，在修完至少140个ECTS学分后即可申请。正式录取前，必须提交最终成绩单及学士学位证书；被录取的申请者若尚未提交最终文件，其录取为有条件录取。

语言要求：英语

本硕士项目的所有课程均以英语授课。所有母语非英语的申请者，须提供至少达到CEFR欧洲语言能力等级框架B2级的英语语言能力证明。

若你持有其他类型的语言证书，可参考以下证书等级对照表，以了解与CEFR B2等级约等的分数范围：语言证书对照表请注意：该对照表仅用于参考，不具法律效力。若提交的证书或成绩未标明CEFR等级，将由大学逐一评估是否符合要求。未能提供CEFR B2水平英语证明的申请人，可能需要在入学前于大学语言中心参加英语水平测试。

语言要求：德语

入学时无须提供德语能力证明，但学生须在赴FAU就读的第一学年内学习德语，至少达到A1级。建议申请者具备基本德语能力，特别是第二学年以项目研究为主的课程阶段。大学语言中心为所有语言水平的学生提供免费的德语课程。

学费

本硕士项目不收取学费。

Protected: Pachtvertrag

Enter your password to view comments.

Protected: 呼吸道合胞病毒（RSV）小分子入侵抑制剂的验证研究

Enter your password to view comments.

Top 32 list of microbiology journals with their Impact Factors from 2024, including publisher

Leave a reply

Top 32 list of microbiology journals with their Impact Factors from 2024, including publisher and other relevant information based on the latest available data from the source:

Rank	Journal Name	Impact Factor 2024	Publisher
1	Nature Reviews Microbiology	~103.3	Springer Nature
2	Nature Microbiology	~19.4	Springer Nature
3	Clinical Microbiology Reviews	~19.3	American Society for Microbiology (ASM)
4	Cell Host \& Microbe	~19.2	Cell Press
5	Annual Review of Microbiology	~12.5	Annual Reviews
6	Trends in Microbiology	~11.0	Cell Press
7	Gut Microbes	~12.0	Taylor \& Francis
8	Microbiome	~11.1	Springer Nature
9	Clinical Infectious Diseases	~9.1	Oxford University Press
10	Journal of Clinical Microbiology*	~6.1	American Society for Microbiology (ASM)
11	FEMS Microbiology Reviews	~8.9	Oxford University Press
12	The ISME Journal	~9.5	Springer Nature
13	Environmental Microbiology	~8.2	Wiley
14	Microbes and Infection	~7.5	Elsevier
15	Journal of Medical Microbiology	~4.4	Microbiology Society
16	Frontiers in Microbiology	~6.4	Frontiers Media
17	MicrobiologyOpen	~3.6	Wiley
18	Microbial Ecology	~4.9	Springer Nature
19	Journal of Bacteriology	~4.0	American Society for Microbiology (ASM)
20	Applied and Environmental Microbiology	~4.5	American Society for Microbiology (ASM)
21	Pathogens and Disease	~3.3	Oxford University Press
22	Microbial Biotechnology	~7.3	Wiley
23	Antonie van Leeuwenhoek	~3.8	Springer Nature
24	Journal of Antimicrobial Chemotherapy	~5.2	Oxford University Press
25	Virulence	~5.4	Taylor \& Francis
26	mBio	~6.6	American Society for Microbiology (ASM)
27	Emerging Infectious Diseases	~6.3	CDC
28	Microbial Cell Factories	~6.0	Springer Nature
29	Microbial Pathogenesis	~4.4	Elsevier
30	Journal of Virology	~5.8	American Society for Microbiology (ASM)
31	Microbiology Spectrum	~4.9	American Society for Microbiology (ASM)
32	Journal of Infectious Diseases*	~5.9	Oxford University Press

Use.ai vs Perplexity.ai

Leave a reply

Use.ai和Perplexity.ai两个网站都支持调用多个先进的AI模型以满足不同用户需求，但在模型种类和实力上存在差异。

Use.ai集成了多达10个知名模型，包括GroK4、Deepinfra Kimi K2、Llama 3.3、Qwen 3 Max、Google Gemini、Deepseek、Claude Opus 4.1、OpenAI GPT-5、GPT-4o和GPT-4o Mini。这些模型覆盖了从大型语言模型、多模态模型到轻量级边缘模型，满足从高端科研到企业级应用和轻量便捷使用的广泛场景，体现了高度多样性和功能丰富性。

而Perplexity.ai主要以OpenAI的GPT系列模型为基础，支持GPT-4、GPT-3.5等主流大语言模型，同时融合了实时网络搜索和信息检索功能，增强了回答的实时性和准确性。虽然模型数量较少，但其优势在于结合强大的搜索引擎技术，能够提供带有权威引用的智能问答，提升信息可信度。

综合比较，Use.ai在可调用模型数量和模型多样性上占优，更适合需要多模型灵活运用的复杂任务场景；而Perplexity.ai则在信息实时性和权威性方面表现突出，适合对搜索结果准确性有较高要求的用户。

结合这两个平台各自优势，用户可根据自身需求选择：若重视多模型丰富性和多场景支持，推荐Use.ai；若注重即时、准确、有来源保障的答案检索，Perplexity.ai是更优选择。

以上内容结合了两平台的模型资源和功能特点，帮助用户在AI应用中做出更明智的选择Use.ai和Perplexity.ai两平台均助力提升智能问答和信息获取体验，满足未来多样化的人工智能需求。

Workflow using PICRUSt2 for Data_Karoline_16S_2025 (v2)

Leave a reply

Environment Setup: It sets up a Conda environment named picrust2, using the conda create command and then activates this environment using conda activate picrust2.

#https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.2.0-beta)#minimum-requirements-to-run-full-tutorial
mamba create -n picrust2 -c bioconda -c conda-forge picrust2    #2.5.3  #=2.2.0_b
mamba activate /home/jhuang/miniconda3/envs/picrust2

Under docker-env (qiime2-amplicon-2023.9)

Export QIIME2 feature table and representative sequences

#docker pull quay.io/qiime2/core:2023.9
#docker run -it --rm \
#-v /mnt/md1/DATA/Data_Karoline_16S_2025:/data \
#-v /home/jhuang/REFs:/home/jhuang/REFs \
#quay.io/qiime2/core:2023.9 bash
#cd /data
# === SETTINGS ===
FEATURE_TABLE_QZA="dada2_tests2/test_7_f240_r240/table.qza"
REP_SEQS_QZA="dada2_tests2/test_7_f240_r240/rep-seqs.qza"

# === STEP 1: EXPORT QIIME2 ARTIFACTS ===
mkdir -p qiime2_export
qiime tools export --input-path $FEATURE_TABLE_QZA --output-path qiime2_export
qiime tools export --input-path $REP_SEQS_QZA --output-path qiime2_export

Convert BIOM to TSV for Picrust2 input

biom convert \
-i qiime2_export/feature-table.biom \
-o qiime2_export/feature-table.tsv \
--to-tsv

Under env (picrust2): mamba activate /home/jhuang/miniconda3/envs/picrust2

Run PICRUSt2 pipeline

tail -n +2 qiime2_export/feature-table.tsv > qiime2_export/feature-table-fixed.tsv
picrust2_pipeline.py \
-s qiime2_export/dna-sequences.fasta \
-i qiime2_export/feature-table-fixed.tsv \
-o picrust2_out \
-p 100

#This will:
#* Place sequences in the reference tree (using EPA-NG),
#* Predict gene family abundances (e.g., EC, KO, PFAM, TIGRFAM),
#* Predict pathway abundances.

#In current PICRUSt2 (with picrust2_pipeline.py), you do not run hsp.py separately.
#Instead, picrust2_pipeline.py internally runs the HSP step for all functional categories automatically. It outputs all the prediction files (16S_predicted_and_nsti.tsv.gz, COG_predicted.tsv.gz, PFAM_predicted.tsv.gz, KO_predicted.tsv.gz, EC_predicted.tsv.gz, TIGRFAM_predicted.tsv.gz, PHENO_predicted.tsv.gz) in the output directory.

mkdir picrust2_out_advanced; cd picrust2_out_advanced
#If you still want to run hsp.py manually (advanced use / debugging), the commands correspond directly:
hsp.py -i 16S -t ../picrust2_out/out.tre -o 16S_predicted_and_nsti.tsv.gz -p 100 -n
hsp.py -i COG -t ../picrust2_out/out.tre -o COG_predicted.tsv.gz -p 100
hsp.py -i PFAM -t ../picrust2_out/out.tre -o PFAM_predicted.tsv.gz -p 100
hsp.py -i KO -t ../picrust2_out/out.tre -o KO_predicted.tsv.gz -p 100
hsp.py -i EC -t ../picrust2_out/out.tre -o EC_predicted.tsv.gz -p 100
hsp.py -i TIGRFAM -t ../picrust2_out/out.tre -o TIGRFAM_predicted.tsv.gz -p 100
hsp.py -i PHENO -t ../picrust2_out/out.tre -o PHENO_predicted.tsv.gz -p 100

Metagenome prediction per functional category (if needed separately)

#cd picrust2_out_advanced
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f COG_predicted.tsv.gz -o COG_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f EC_predicted.tsv.gz -o EC_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f KO_predicted.tsv.gz -o KO_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f PFAM_predicted.tsv.gz -o PFAM_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f TIGRFAM_predicted.tsv.gz -o TIGRFAM_metagenome_out --strat_out

# Add descriptions in gene family tables
add_descriptions.py -i COG_metagenome_out/pred_metagenome_unstrat.tsv.gz -m COG -o COG_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz   # EC and METACYC is a pair, EC for gene_annotation and METACYC for pathway_annotation
add_descriptions.py -i PFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m PFAM -o PFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i TIGRFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m TIGRFAM -o TIGRFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

Pathway inference (MetaCyc pathways from EC numbers)

#cd picrust2_out_advanced
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_contrib.tsv.gz -o EC_pathways_out -p 100
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -o EC_pathways_out_per_seq -p 100 --per_sequence_contrib --per_sequence_abun EC_metagenome_out/seqtab_norm.tsv.gz --per_sequence_function EC_predicted.tsv.gz
#ERROR due to missing .../pathway_mapfiles/KEGG_pathways_to_KO.tsv
pathway_pipeline.py -i COG_metagenome_out/pred_metagenome_contrib.tsv.gz -o KEGG_pathways_out -p 100 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv
pathway_pipeline.py -i KO_metagenome_out/pred_metagenome_strat.tsv.gz -o KEGG_pathways_out -p 100 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv

add_descriptions.py -i EC_pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o EC_pathways_out/path_abun_unstrat_descrip.tsv.gz
gunzip EC_pathways_out/path_abun_unstrat_descrip.tsv.gz

#Error - no rows remain after regrouping input table. The default pathway and regroup mapfiles are meant for EC numbers. Note that KEGG pathways are not supported since KEGG is a closed-source database, but you can input custom pathway mapfiles if you have access. If you are using a custom function database did you mean to set the --no-regroup flag and/or change the default pathways mapfile used?
#If ERROR --> USE the METACYC for downstream analyses!!!

#ERROR due to missing .../description_mapfiles/KEGG_pathways_info.tsv.gz
#add_descriptions.py -i KO_pathways_out/path_abun_unstrat.tsv.gz -o KEGG_pathways_out/path_abun_unstrat_descrip.tsv.gz --custom_map_table /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/description_mapfiles/KEGG_pathways_info.tsv.gz

#NOTE: Target-analysis for the pathway "mixed acid fermentation"

Visualization

#7.1 STAMP
#https://github.com/picrust/picrust2/wiki/STAMP-example
#Note that STAMP can only be opened under Windows

# It needs two files: path_abun_unstrat_descrip.tsv.gz as "Profile file" and metadata.tsv as "Group metadata file".
cp ~/DATA/Data_Karoline_16S_2025/picrust2_out_advanced/EC_pathways_out/path_abun_unstrat_descrip.tsv ~/DATA/Access_to_Win10/

cut -d$'\t' -f1 qiime2_metadata.tsv > 1
cut -d$'\t' -f3 qiime2_metadata.tsv > 3
cut -d$'\t' -f5-6 qiime2_metadata.tsv > 5_6
paste -d$'\t' 1 3 > 1_3
paste -d$'\t' 1_3 5_6 > metadata.tsv
#SampleID --> SampleID
SampleID        Group   pre_post        Sex_age
sample-A1       Group1  3d.post.stroke  male.aged
sample-A2       Group1  3d.post.stroke  male.aged
sample-A3       Group1  3d.post.stroke  male.aged
cp ~/DATA/Data_Karoline_16S_2025/metadata.tsv ~/DATA/Access_to_Win10/
# MANULLY_EDITING: keeping the only needed records in metadata.tsv: Group 9 (J1–J4, J6, J7, J10, J11) and Group 10 (K1–K6).

#7.2. ALDEx2
https://bioconductor.org/packages/release/bioc/html/ALDEx2.html

Under docker-env (qiime2-amplicon-2023.9)

(NOT_NEEDED) Convert pathway output to BIOM and re-import to QIIME2 gunzip picrust2_out/pathways_out/path_abun_unstrat.tsv.gz biom convert \ -i picrust2_out/pathways_out/path_abun_unstrat.tsv \ -o picrust2_out/path_abun_unstrat.biom \ –table-type=”Pathway table” \ –to-hdf5

qiime tools import \
--input-path picrust2_out/path_abun_unstrat.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV210Format \
--output-path path_abun.qza

#qiime tools export --input-path path_abun.qza --output-path exported_path_abun
#qiime tools peek path_abun.qza
echo "✅ PICRUSt2 pipeline complete. Output in: picrust2_out"

Short answer: unless you had a very clear, pre-specified directional hypothesis, you should use a two-sided test.

A bit more detail:

* Two-sided t-test

        * Tests: “Are the means different?” (could be higher or lower).
        * Standard default in most biological and clinical studies and usually what reviewers expect.
        * More conservative than a one-sided test.

* One-sided t-test

        * Tests: “Is Group A greater than Group B?” (or strictly less than).
        * You should only use it if before looking at the data you had a strong reason to expect a specific direction and you would ignore/consider uninterpretable a difference in the opposite direction.
        * Using one-sided just to gain significance is considered bad practice.

For your pathway analysis (exploratory, many pathways, q-value correction), the safest and most defensible choice is to:

* Use a two-sided t-test (equal variance or Welch’s, depending on variance assumptions).

So I’d recommend rerunning STAMP with Type: Two-sided and reporting those results.

#--> Using a two-sided Welch's t-test in STAMP, that is the unequal-variance version (does not assume equal variances and is more conservative than “t-test (equal variance)” referring to the classical unpaired Student’s t-test.

Statistics in STAMP

* For multiple groups:
    * Statistical test: ANOVA, Kruskal-Wallis H-test
    * Post-hoc test: Games-Howell, Scheffe, Tukey-Kramer, Welch's (uncorrected) (by default 0.95)
    * Effect size: Eta-squared
    * Multiple test correction: Benjamini-Hochberg FDR, Bonferroni, No correction
* For two groups
    * Statistical test: t-test (equal variance), Welch's t-test, White's non-parametric t-test
    * Type: One-sided, Two-sided
    * CI method: "DP: Welch's inverted" (by default 0.95)
    * Multiple test correction: Benjamini-Hochberg FDR, Bonferroni, No correction, Sidak, Storey FDR
* For two samples
    * Statistical test: Bootstrap, Chi-square test, Chi-square test (w/Yates'), Difference between proportions, Fisher's exact test, G-test, G-test (w/Yates'), G-test (w/Yates') + Fisher's, Hypergeometric, Permutation
    * Type: One-sided, Two-sided
    * CI method: "DP: Asymptotic", "DP: Asymptotic-CC", "DP: Newcomber-Wilson", "DR: Haldane adjustment", "RP: Asymptotic" (by default 0.95)
    * Multiple test correction: Benjamini-Hochberg FDR, Bonferroni, No correction, Sidak, Storey FDR

Reporting

Please find attached the results of the pathway analysis. The Excel file contains the full statistics for all pathways; those with adjusted p-values (Benjamini–Hochberg) ≤ 0.05 are highlighted in yellow and are the ones illustrated in the figure.

The analysis was performed using Welch’s t-test (two-sided) with Benjamini–Hochberg correction for multiple testing.

Automated β-Lactamase Gene Detection with NCBI AMRFinderPlus processing Data_Patricia_AMRFinderPlus_2025

Leave a reply

1. Installation and Database Setup

To install and prepare NCBI AMRFinderPlus in the bacto environment:

mamba activate bacto
mamba install ncbi-amrfinderplus
mamba update ncbi-amrfinderplus

mamba activate bacto
amrfinder -u

This will:
- Download and install the latest AMRFinderPlus version and its database.
- Create /home/jhuang/mambaforge/envs/bacto/share/amrfinderplus/data/.
- Symlink the latest database version for use.

Check available organism options for annotation:

amrfinder --list_organisms

Supported values include species such as Escherichia, Klebsiella_pneumoniae, Enterobacter_cloacae, Pseudomonas_aeruginosa and many others.

2. Batch Analysis: Bash Script for Genome Screening

Use the following script to screen multiple genomes using AMRFinderPlus and output only β-lactam/beta-lactamase hits from a metadata table.

Input: genome_metadata.tsv — tab-separated columns: filename_TAB_organism, with header.

Run:

cd ~/DATA/Data_Patricia_AMRFinderPlus_2025/genomes
./run_amrfinder_beta_lactam.sh genome_metadata.tsv

Script logic:

Validates metadata input and AMRFinder installation.
Loops through each genome in the metadata table:
- Maps text organism names to proper AMRFinder --organism codes when possible (“Escherichia coli” → --organism Escherichia).
- Executes AMRFinderPlus, saving output for each isolate.
- Collects all individual output tables.
After annotation, Python code merges all results, filters for β-lactam/beta-lactamase genes, and creates summary tables.

Script:

#!/usr/bin/env bash
set -euo pipefail

META_FILE="${1:-}"

if [[ -z "$META_FILE" || ! -f "$META_FILE" ]]; then
  echo "Usage: $0 genome_metadata.tsv" >&2
  exit 1
fi

OUTDIR="amrfinder_results"
mkdir -p "$OUTDIR"

echo ">>> Checking AMRFinder installation..."
amrfinder -V || { echo "ERROR: amrfinder not working"; exit 1; }
echo

echo ">>> Running AMRFinderPlus on all genomes listed in $META_FILE"

# --- loop over metadata file ---
# expected columns: filename
<TAB>organism
tail -n +2 "$META_FILE" | while IFS=$'\t' read -r fasta organism; do
  # skip empty lines
  [[ -z "$fasta" ]] && continue

  if [[ ! -f "$fasta" ]]; then
    echo "WARN: FASTA file '$fasta' not found, skipping."
    continue
  fi

  isolate_id="${fasta%.fasta}"

  # map free-text organism to AMRFinder --organism names (optional)
  org_opt=""
  case "$organism" in
    "Escherichia coli")        org_opt="--organism Escherichia" ;;
    "Klebsiella pneumoniae")   org_opt="--organism Klebsiella_pneumoniae" ;;
    "Enterobacter cloacae complex") org_opt="--organism Enterobacter_cloacae" ;;
    "Citrobacter freundii")    org_opt="--organism Citrobacter_freundii" ;;
    "Citrobacter braakii")    org_opt="--organism Citrobacter_freundii" ;;
    "Pseudomonas aeruginosa")  org_opt="--organism Pseudomonas_aeruginosa" ;;
    # others (Providencia stuartii, Klebsiella aerogenes)
    # currently have no organism-specific rules in AMRFinder, so we omit --organism
    *)                         org_opt="" ;;
  esac

  out_tsv="${OUTDIR}/${isolate_id}.amrfinder.tsv"

  echo "  - ${fasta} (${organism}) -> ${out_tsv} ${org_opt}"
  amrfinder -n "$fasta" -o "$out_tsv" --plus $org_opt
done

echo ">>> AMRFinderPlus runs finished. Filtering β-lactam hits..."

python3 - "$OUTDIR" << 'EOF'
import sys, os, glob

outdir = sys.argv[1]
files = sorted(glob.glob(os.path.join(outdir, "*.amrfinder.tsv")))
if not files:
    print("ERROR: No AMRFinder output files found in", outdir)
    sys.exit(1)

try:
    import pandas as pd
    use_pandas = True
except ImportError:
    use_pandas = False

def read_one(path):
    import pandas as _pd
    # AMRFinder TSV is tab-separated with a header line
    df = _pd.read_csv(path, sep='\t', dtype=str)
    df.columns = [c.strip() for c in df.columns]
    # add isolate_id from filename
    isolate_id = os.path.basename(path).replace(".amrfinder.tsv", "")
    df["isolate_id"] = isolate_id
    return df

if not use_pandas:
    print("WARNING: pandas not installed; only raw TSV merging will be done.")
    # very minimal merging: just concatenate files
    with open("beta_lactam_all.tsv", "w") as out:
        first = True
        for f in files:
            with open(f) as fh:
                header = fh.readline()
                if first:
                    out.write(header.strip() + "\tisolate_id\n")
                    first = False
                for line in fh:
                    if not line.strip():
                        continue
                    iso = os.path.basename(f).replace(".amrfinder.tsv", "")
                    out.write(line.rstrip("\n") + "\t" + iso + "\n")
    sys.exit(0)

# --- full pandas-based processing ---
dfs = [read_one(f) for f in files]
df = pd.concat(dfs, ignore_index=True)

# normalize column names (lowercase, no spaces) for internal use
norm_cols = {c: c.strip().lower().replace(" ", "_") for c in df.columns}
df.rename(columns=norm_cols, inplace=True)

# try to locate key columns with flexible names
def pick(*candidates):
    for c in candidates:
        if c in df.columns:
            return c
    return None

col_gene   = pick("gene_symbol", "genesymbol")
col_seq    = pick("sequence_name", "sequencename")
col_class  = pick("class")
col_subcls = pick("subclass")
col_ident  = pick("%identity_to_reference_sequence", "identity")
col_cov    = pick("%coverage_of_reference_sequence", "coverage_of_reference_sequence")
col_iso    = "isolate_id"

missing = [c for c in [col_gene, col_seq, col_class, col_subcls, col_iso] if c is None]
if missing:
    print("ERROR: Some required columns are missing in AMRFinder output:", missing)
    sys.exit(1)

# β-lactam filter: class==AMR and subclass contains "beta-lactam"
mask = (df[col_class].str.contains("AMR", case=False, na=False) &
        df[col_subcls].str.contains("beta-lactam", case=False, na=False))
df_beta = df.loc[mask].copy()

if df_beta.empty:
    print("WARNING: No β-lactam hits found.")
else:
    print(f"Found {len(df_beta)} β-lactam / β-lactamase hits.")

# write full β-lactam table
beta_all_tsv = "beta_lactam_all.tsv"
df_beta.to_csv(beta_all_tsv, sep='\t', index=False)
print(f">>> β-lactam / β-lactamase hits written to: {beta_all_tsv}")

# -------- summary by gene (with list of isolates) --------
group_cols = [col_gene, col_seq, col_subcls]

def join_isolates(vals):
    # unique, sorted isolates as comma-separated string
    uniq = sorted(set(vals))
    return ",".join(uniq)

summary_gene = (
    df_beta
    .groupby(group_cols, dropna=False)
    .agg(
        n_isolates=(col_iso, "nunique"),
        isolates=(col_iso, join_isolates),
        n_hits=("isolate_id", "size")
    )
    .reset_index()
)

# nicer column names for output
summary_gene.rename(columns={
    col_gene: "Gene_symbol",
    col_seq: "Sequence_name",
    col_subcls: "Subclass"
}, inplace=True)

sum_gene_tsv = "beta_lactam_summary_by_gene.tsv"
summary_gene.to_csv(sum_gene_tsv, sep='\t', index=False)
print(f">>> Gene-level summary written to: {sum_gene_tsv}")
print("    (now includes 'isolates' = comma-separated isolate_ids)")

# -------- summary by isolate & gene (with annotation) --------
agg_dict = {
    col_gene: ("Gene_symbol", "first"),
    col_seq: ("Sequence_name", "first"),
    col_subcls: ("Subclass", "first"),
}
if col_ident:
    agg_dict[col_ident] = ("%identity_min", "min")
    agg_dict[col_ident + "_max"] = ("%identity_max", "max")
if col_cov:
    agg_dict[col_cov] = ("%coverage_min", "min")
    agg_dict[col_cov + "_max"] = ("%coverage_max", "max")

# build aggregation manually (because we want nice column names)
gb = df_beta.groupby([col_iso, col_gene], dropna=False)
rows = []
for (iso, gene), sub in gb:
    row = {
        "isolate_id": iso,
        "Gene_symbol": sub[col_gene].iloc[0],
        "Sequence_name": sub[col_seq].iloc[0],
        "Subclass": sub[col_subcls].iloc[0],
        "n_hits": len(sub)
    }
    if col_ident:
        vals = pd.to_numeric(sub[col_ident], errors="coerce")
        row["%identity_min"] = vals.min()
        row["%identity_max"] = vals.max()
    if col_cov:
        vals = pd.to_numeric(sub[col_cov], errors="coerce")
        row["%coverage_min"] = vals.min()
        row["%coverage_max"] = vals.max()
    rows.append(row)

summary_iso_gene = pd.DataFrame(rows)

sum_iso_gene_tsv = "beta_lactam_summary_by_isolate_gene.tsv"
summary_iso_gene.to_csv(sum_iso_gene_tsv, sep='\t', index=False)
print(f">>> Isolate × gene summary written to: {sum_iso_gene_tsv}")
print("    (now includes 'Gene_symbol' and 'Sequence_name' annotation columns)")

# -------- optional Excel exports --------
try:
    with pd.ExcelWriter("beta_lactam_all.xlsx") as xw:
        df_beta.to_excel(xw, sheet_name="beta_lactam_all", index=False)
    with pd.ExcelWriter("beta_lactam_summary.xlsx") as xw:
        summary_gene.to_excel(xw, sheet_name="by_gene", index=False)
        summary_iso_gene.to_excel(xw, sheet_name="by_isolate_gene", index=False)
    print(">>> Excel workbooks written: beta_lactam_all.xlsx, beta_lactam_summary.xlsx")
except Exception as e:
    print("WARNING: could not write Excel files:", e)

EOF

echo ">>> All done."
echo "   - Individual reports: ${OUTDIR}/*.amrfinder.tsv"
echo "   - Merged β-lactam table: beta_lactam_all.tsv"
echo "   - Gene summary: beta_lactam_summary_by_gene.tsv (with isolate list)"
echo "   - Isolate × gene summary: beta_lactam_summary_by_isolate_gene.tsv (with annotation)"
echo "   - Excel (if pandas + openpyxl installed): beta_lactam_all.xlsx, beta_lactam_summary.xlsx"

3. Reporting and File Outputs

Files Generated:

beta_lactam_all.tsv: All β-lactam/beta-lactamase hits across genomes.
beta_lactam_summary_by_gene.tsv: Per-gene summary, including a column with all isolate IDs.
beta_lactam_summary_by_isolate_gene.tsv: Gene and isolate summary; includes “Gene symbol”, “Sequence name”, “Subclass”, annotation, and min/max identity/coverage.
If pandas is installed: beta_lactam_all.xlsx, beta_lactam_summary.xlsx.

Description of improvements:

Gene-level summary now lists isolates carrying each β-lactamase gene.
Isolate × gene summary includes full annotation and quantitative metrics: gene symbol, sequence name, subclass, plus minimum and maximum percent identity/coverage.

Hand these files directly to collaborators:

beta_lactam_summary_by_isolate_gene.tsv or beta_lactam_summary.xlsx have all necessary gene and annotation information in final form.

4. Excel Export (if pandas is installed after the fact)

If the bacto environment lacks pandas, simply perform Excel conversion outside it:

mamba deactivate

python3 - << 'PYCODE'
import pandas as pd
df = pd.read_csv("beta_lactam_all.tsv", sep="\t")
df.to_excel("beta_lactam_all.xlsx", index=False)
print("Saved: beta_lactam_all.xlsx")
PYCODE

#Replace "," with ", " in beta_lactam_summary_by_gene.tsv so that the number can be correctly formatted; save it to a Excel-format.
mv beta_lactam_all.xlsx AMR_summary.xlsx  # Then delete the first empty column.
mv beta_lactam_summary.xlsx BETA-LACTAM_summary.xlsx
mv beta_lactam_summary_by_gene.xlsx BETA-LACTAM_summary_by_gene.xlsx

Summary and Notes

The system is fully automated from installation to reporting.
All command lines are modular and suitable for direct inclusion in bioinformatics SOPs.
Output files have expanded annotation and isolate information for downstream analytics and sharing.
This approach ensures traceability, transparency, and rapid communication of β-lactamase annotation results for large datasets.

中文版本

基于NCBI AMRFinderPlus的自动化β-内酰胺酶注释流程

1. 安装与数据库设置

在bacto环境下安装AMRFinderPlus，并确保数据库更新：

mamba activate bacto
mamba install ncbi-amrfinderplus
mamba update ncbi-amrfinderplus
mamba activate bacto
amrfinder -u

这将自动下载最新数据库，并确保环境目录正确建立与软链接。

查询支持的物种选项：

amrfinder --list_organisms

2. 批量分析与脚本调用

使用如下脚本，高效批量筛查基因组β-内酰胺酶基因，并生成结果与汇总文件。输入表格式：filename_TAB_organism，首行为表头。

cd ~/DATA/Data_Patricia_AMRFinderPlus_2025/genomes
./run_amrfinder_beta_lactam.sh genome_metadata.tsv

脚本逻辑简明，自动映射物种名、循环注释所有基因组、收集所有结果，之后调用Python脚本合并并筛选β-内酰胺酶基因。

3. 结果文件说明

beta_lactam_all.tsv：所有β-内酰胺酶相关基因注释全表
beta_lactam_summary_by_gene.tsv：按基因注释，含所有分离株列表
beta_lactam_summary_by_isolate_gene.tsv：分离株×基因详细表，包含注释、同源信息等
若安装pandas：另有Excel版beta_lactam_all.xlsx与beta_lactam_summary.xlsx

改进之处：

汇总表显式展示每个基因对应的分离株ID
分离株×基因表包含完整功能注释，identity/coverage等量化指标

直接将上述TSV或Excel表交给协作方即可，无需额外整理。

4. 补充Excel导出

如环境未装pandas，可以离线导出Excel：

mamba deactivate

python3 - << 'PYCODE'
import pandas as pd
df = pd.read_csv("beta_lactam_all.tsv", sep="\t")
df.to_excel("beta_lactam_all.xlsx", index=False)
print("Saved: beta_lactam_all.xlsx")
PYCODE

总结

步骤明确，可拓展与自动化。
输出表格式完善，满足批量汇报与协作需要。
所有命令与脚本可直接嵌入项目标准操作流程，支持可追溯和数据复用。

]