Automated β-Lactamase Gene Detection with NCBI AMRFinderPlus processing Data_Patricia_AMRFinderPlus_2025

1. Installation and Database Setup

To install and prepare NCBI AMRFinderPlus in the bacto environment:

mamba activate bacto
mamba install ncbi-amrfinderplus
mamba update ncbi-amrfinderplus

mamba activate bacto
amrfinder -u
  • This will:
    • Download and install the latest AMRFinderPlus version and its database.
    • Create /home/jhuang/mambaforge/envs/bacto/share/amrfinderplus/data/.
    • Symlink the latest database version for use.

Check available organism options for annotation:

amrfinder --list_organisms
  • Supported values include species such as Escherichia, Klebsiella_pneumoniae, Enterobacter_cloacae, Pseudomonas_aeruginosa and many others.

2. Batch Analysis: Bash Script for Genome Screening

Use the following script to screen multiple genomes using AMRFinderPlus and output only β-lactam/beta-lactamase hits from a metadata table.

Input: genome_metadata.tsv — tab-separated columns: filename_TAB_organism, with header.

Run:

cd ~/DATA/Data_Patricia_AMRFinderPlus_2025/genomes
./run_amrfinder_beta_lactam.sh genome_metadata.tsv

Script logic:

  • Validates metadata input and AMRFinder installation.
  • Loops through each genome in the metadata table:
    • Maps text organism names to proper AMRFinder --organism codes when possible (“Escherichia coli” → --organism Escherichia).
    • Executes AMRFinderPlus, saving output for each isolate.
    • Collects all individual output tables.
  • After annotation, Python code merges all results, filters for β-lactam/beta-lactamase genes, and creates summary tables.

Script:

#!/usr/bin/env bash
set -euo pipefail

META_FILE="${1:-}"

if [[ -z "$META_FILE" || ! -f "$META_FILE" ]]; then
  echo "Usage: $0 genome_metadata.tsv" >&2
  exit 1
fi

OUTDIR="amrfinder_results"
mkdir -p "$OUTDIR"

echo ">>> Checking AMRFinder installation..."
amrfinder -V || { echo "ERROR: amrfinder not working"; exit 1; }
echo

echo ">>> Running AMRFinderPlus on all genomes listed in $META_FILE"

# --- loop over metadata file ---
# expected columns: filename
<TAB>organism
tail -n +2 "$META_FILE" | while IFS=$'\t' read -r fasta organism; do
  # skip empty lines
  [[ -z "$fasta" ]] && continue

  if [[ ! -f "$fasta" ]]; then
    echo "WARN: FASTA file '$fasta' not found, skipping."
    continue
  fi

  isolate_id="${fasta%.fasta}"

  # map free-text organism to AMRFinder --organism names (optional)
  org_opt=""
  case "$organism" in
    "Escherichia coli")        org_opt="--organism Escherichia" ;;
    "Klebsiella pneumoniae")   org_opt="--organism Klebsiella_pneumoniae" ;;
    "Enterobacter cloacae complex") org_opt="--organism Enterobacter_cloacae" ;;
    "Citrobacter freundii")    org_opt="--organism Citrobacter_freundii" ;;
    "Citrobacter braakii")    org_opt="--organism Citrobacter_freundii" ;;
    "Pseudomonas aeruginosa")  org_opt="--organism Pseudomonas_aeruginosa" ;;
    # others (Providencia stuartii, Klebsiella aerogenes)
    # currently have no organism-specific rules in AMRFinder, so we omit --organism
    *)                         org_opt="" ;;
  esac

  out_tsv="${OUTDIR}/${isolate_id}.amrfinder.tsv"

  echo "  - ${fasta} (${organism}) -> ${out_tsv} ${org_opt}"
  amrfinder -n "$fasta" -o "$out_tsv" --plus $org_opt
done

echo ">>> AMRFinderPlus runs finished. Filtering β-lactam hits..."

python3 - "$OUTDIR" << 'EOF'
import sys, os, glob

outdir = sys.argv[1]
files = sorted(glob.glob(os.path.join(outdir, "*.amrfinder.tsv")))
if not files:
    print("ERROR: No AMRFinder output files found in", outdir)
    sys.exit(1)

try:
    import pandas as pd
    use_pandas = True
except ImportError:
    use_pandas = False

def read_one(path):
    import pandas as _pd
    # AMRFinder TSV is tab-separated with a header line
    df = _pd.read_csv(path, sep='\t', dtype=str)
    df.columns = [c.strip() for c in df.columns]
    # add isolate_id from filename
    isolate_id = os.path.basename(path).replace(".amrfinder.tsv", "")
    df["isolate_id"] = isolate_id
    return df

if not use_pandas:
    print("WARNING: pandas not installed; only raw TSV merging will be done.")
    # very minimal merging: just concatenate files
    with open("beta_lactam_all.tsv", "w") as out:
        first = True
        for f in files:
            with open(f) as fh:
                header = fh.readline()
                if first:
                    out.write(header.strip() + "\tisolate_id\n")
                    first = False
                for line in fh:
                    if not line.strip():
                        continue
                    iso = os.path.basename(f).replace(".amrfinder.tsv", "")
                    out.write(line.rstrip("\n") + "\t" + iso + "\n")
    sys.exit(0)

# --- full pandas-based processing ---
dfs = [read_one(f) for f in files]
df = pd.concat(dfs, ignore_index=True)

# normalize column names (lowercase, no spaces) for internal use
norm_cols = {c: c.strip().lower().replace(" ", "_") for c in df.columns}
df.rename(columns=norm_cols, inplace=True)

# try to locate key columns with flexible names
def pick(*candidates):
    for c in candidates:
        if c in df.columns:
            return c
    return None

col_gene   = pick("gene_symbol", "genesymbol")
col_seq    = pick("sequence_name", "sequencename")
col_class  = pick("class")
col_subcls = pick("subclass")
col_ident  = pick("%identity_to_reference_sequence", "identity")
col_cov    = pick("%coverage_of_reference_sequence", "coverage_of_reference_sequence")
col_iso    = "isolate_id"

missing = [c for c in [col_gene, col_seq, col_class, col_subcls, col_iso] if c is None]
if missing:
    print("ERROR: Some required columns are missing in AMRFinder output:", missing)
    sys.exit(1)

# β-lactam filter: class==AMR and subclass contains "beta-lactam"
mask = (df[col_class].str.contains("AMR", case=False, na=False) &
        df[col_subcls].str.contains("beta-lactam", case=False, na=False))
df_beta = df.loc[mask].copy()

if df_beta.empty:
    print("WARNING: No β-lactam hits found.")
else:
    print(f"Found {len(df_beta)} β-lactam / β-lactamase hits.")

# write full β-lactam table
beta_all_tsv = "beta_lactam_all.tsv"
df_beta.to_csv(beta_all_tsv, sep='\t', index=False)
print(f">>> β-lactam / β-lactamase hits written to: {beta_all_tsv}")

# -------- summary by gene (with list of isolates) --------
group_cols = [col_gene, col_seq, col_subcls]

def join_isolates(vals):
    # unique, sorted isolates as comma-separated string
    uniq = sorted(set(vals))
    return ",".join(uniq)

summary_gene = (
    df_beta
    .groupby(group_cols, dropna=False)
    .agg(
        n_isolates=(col_iso, "nunique"),
        isolates=(col_iso, join_isolates),
        n_hits=("isolate_id", "size")
    )
    .reset_index()
)

# nicer column names for output
summary_gene.rename(columns={
    col_gene: "Gene_symbol",
    col_seq: "Sequence_name",
    col_subcls: "Subclass"
}, inplace=True)

sum_gene_tsv = "beta_lactam_summary_by_gene.tsv"
summary_gene.to_csv(sum_gene_tsv, sep='\t', index=False)
print(f">>> Gene-level summary written to: {sum_gene_tsv}")
print("    (now includes 'isolates' = comma-separated isolate_ids)")

# -------- summary by isolate & gene (with annotation) --------
agg_dict = {
    col_gene: ("Gene_symbol", "first"),
    col_seq: ("Sequence_name", "first"),
    col_subcls: ("Subclass", "first"),
}
if col_ident:
    agg_dict[col_ident] = ("%identity_min", "min")
    agg_dict[col_ident + "_max"] = ("%identity_max", "max")
if col_cov:
    agg_dict[col_cov] = ("%coverage_min", "min")
    agg_dict[col_cov + "_max"] = ("%coverage_max", "max")

# build aggregation manually (because we want nice column names)
gb = df_beta.groupby([col_iso, col_gene], dropna=False)
rows = []
for (iso, gene), sub in gb:
    row = {
        "isolate_id": iso,
        "Gene_symbol": sub[col_gene].iloc[0],
        "Sequence_name": sub[col_seq].iloc[0],
        "Subclass": sub[col_subcls].iloc[0],
        "n_hits": len(sub)
    }
    if col_ident:
        vals = pd.to_numeric(sub[col_ident], errors="coerce")
        row["%identity_min"] = vals.min()
        row["%identity_max"] = vals.max()
    if col_cov:
        vals = pd.to_numeric(sub[col_cov], errors="coerce")
        row["%coverage_min"] = vals.min()
        row["%coverage_max"] = vals.max()
    rows.append(row)

summary_iso_gene = pd.DataFrame(rows)

sum_iso_gene_tsv = "beta_lactam_summary_by_isolate_gene.tsv"
summary_iso_gene.to_csv(sum_iso_gene_tsv, sep='\t', index=False)
print(f">>> Isolate × gene summary written to: {sum_iso_gene_tsv}")
print("    (now includes 'Gene_symbol' and 'Sequence_name' annotation columns)")

# -------- optional Excel exports --------
try:
    with pd.ExcelWriter("beta_lactam_all.xlsx") as xw:
        df_beta.to_excel(xw, sheet_name="beta_lactam_all", index=False)
    with pd.ExcelWriter("beta_lactam_summary.xlsx") as xw:
        summary_gene.to_excel(xw, sheet_name="by_gene", index=False)
        summary_iso_gene.to_excel(xw, sheet_name="by_isolate_gene", index=False)
    print(">>> Excel workbooks written: beta_lactam_all.xlsx, beta_lactam_summary.xlsx")
except Exception as e:
    print("WARNING: could not write Excel files:", e)

EOF

echo ">>> All done."
echo "   - Individual reports: ${OUTDIR}/*.amrfinder.tsv"
echo "   - Merged β-lactam table: beta_lactam_all.tsv"
echo "   - Gene summary: beta_lactam_summary_by_gene.tsv (with isolate list)"
echo "   - Isolate × gene summary: beta_lactam_summary_by_isolate_gene.tsv (with annotation)"
echo "   - Excel (if pandas + openpyxl installed): beta_lactam_all.xlsx, beta_lactam_summary.xlsx"

3. Reporting and File Outputs

Files Generated:

  • beta_lactam_all.tsv: All β-lactam/beta-lactamase hits across genomes.
  • beta_lactam_summary_by_gene.tsv: Per-gene summary, including a column with all isolate IDs.
  • beta_lactam_summary_by_isolate_gene.tsv: Gene and isolate summary; includes “Gene symbol”, “Sequence name”, “Subclass”, annotation, and min/max identity/coverage.
  • If pandas is installed: beta_lactam_all.xlsx, beta_lactam_summary.xlsx.

Description of improvements:

  • Gene-level summary now lists isolates carrying each β-lactamase gene.
  • Isolate × gene summary includes full annotation and quantitative metrics: gene symbol, sequence name, subclass, plus minimum and maximum percent identity/coverage.

Hand these files directly to collaborators:

  • beta_lactam_summary_by_isolate_gene.tsv or beta_lactam_summary.xlsx have all necessary gene and annotation information in final form.

4. Excel Export (if pandas is installed after the fact)

If the bacto environment lacks pandas, simply perform Excel conversion outside it:

mamba deactivate

python3 - << 'PYCODE'
import pandas as pd
df = pd.read_csv("beta_lactam_all.tsv", sep="\t")
df.to_excel("beta_lactam_all.xlsx", index=False)
print("Saved: beta_lactam_all.xlsx")
PYCODE

#Replace "," with ", " in beta_lactam_summary_by_gene.tsv so that the number can be correctly formatted; save it to a Excel-format.
mv beta_lactam_all.xlsx AMR_summary.xlsx  # Then delete the first empty column.
mv beta_lactam_summary.xlsx BETA-LACTAM_summary.xlsx
mv beta_lactam_summary_by_gene.xlsx BETA-LACTAM_summary_by_gene.xlsx

Summary and Notes

  • The system is fully automated from installation to reporting.
  • All command lines are modular and suitable for direct inclusion in bioinformatics SOPs.
  • Output files have expanded annotation and isolate information for downstream analytics and sharing.
  • This approach ensures traceability, transparency, and rapid communication of β-lactamase annotation results for large datasets.

中文版本

基于NCBI AMRFinderPlus的自动化β-内酰胺酶注释流程

1. 安装与数据库设置

在bacto环境下安装AMRFinderPlus,并确保数据库更新:

mamba activate bacto
mamba install ncbi-amrfinderplus
mamba update ncbi-amrfinderplus
mamba activate bacto
amrfinder -u
  • 这将自动下载最新数据库,并确保环境目录正确建立与软链接。

查询支持的物种选项:

amrfinder --list_organisms

2. 批量分析与脚本调用

使用如下脚本,高效批量筛查基因组β-内酰胺酶基因,并生成结果与汇总文件。 输入表格式:filename_TAB_organism,首行为表头。

cd ~/DATA/Data_Patricia_AMRFinderPlus_2025/genomes
./run_amrfinder_beta_lactam.sh genome_metadata.tsv

脚本逻辑简明,自动映射物种名、循环注释所有基因组、收集所有结果,之后调用Python脚本合并并筛选β-内酰胺酶基因。

3. 结果文件说明

  • beta_lactam_all.tsv:所有β-内酰胺酶相关基因注释全表
  • beta_lactam_summary_by_gene.tsv:按基因注释,含所有分离株列表
  • beta_lactam_summary_by_isolate_gene.tsv:分离株×基因详细表,包含注释、同源信息等
  • 若安装pandas:另有Excel版beta_lactam_all.xlsxbeta_lactam_summary.xlsx

改进之处:

  • 汇总表显式展示每个基因对应的分离株ID
  • 分离株×基因表包含完整功能注释,identity/coverage等量化指标

直接将上述TSV或Excel表交给协作方即可,无需额外整理。

4. 补充Excel导出

如环境未装pandas,可以离线导出Excel:

mamba deactivate

python3 - << 'PYCODE'
import pandas as pd
df = pd.read_csv("beta_lactam_all.tsv", sep="\t")
df.to_excel("beta_lactam_all.xlsx", index=False)
print("Saved: beta_lactam_all.xlsx")
PYCODE

总结

  • 步骤明确,可拓展与自动化。
  • 输出表格式完善,满足批量汇报与协作需要。
  • 所有命令与脚本可直接嵌入项目标准操作流程,支持可追溯和数据复用。

]

Leave a Reply

Your email address will not be published. Required fields are marked *