Automated β-Lactamase Gene Detection with NCBI AMRFinderPlus (Data_Patricia_AMRFinderPlus_2025, v2)

1. Installation and Database Setup

To install and prepare NCBI AMRFinderPlus in the bacto environment:

mamba activate bacto
mamba install ncbi-amrfinderplus
mamba update ncbi-amrfinderplus

mamba activate bacto
amrfinder -u
  • This will:
    • Download and install the latest AMRFinderPlus version and its database.
    • Create /home/jhuang/mambaforge/envs/bacto/share/amrfinderplus/data/.
    • Symlink the latest database version for use.

Check available organism options for annotation:

amrfinder --list_organisms
#Available --organism options: Acinetobacter_baumannii, Burkholderia_cepacia, Burkholderia_mallei, Burkholderia_pseudomallei, Campylobacter, Citrobacter_freundii, Clostridioides_difficile, Corynebacterium_diphtheriae, Enterobacter_asburiae, Enterobacter_cloacae, Enterococcus_faecalis, Enterococcus_faecium, Escherichia, Klebsiella_oxytoca, Klebsiella_pneumoniae, Neisseria_gonorrhoeae, Neisseria_meningitidis, Pseudomonas_aeruginosa, Salmonella, Serratia_marcescens, Staphylococcus_aureus, Staphylococcus_pseudintermedius, Streptococcus_agalactiae, Streptococcus_pneumoniae, Streptococcus_pyogenes, Vibrio_cholerae, Vibrio_parahaemolyticus, Vibrio_vulnificus
  • Supported values include species such as Escherichia, Klebsiella_pneumoniae, Enterobacter_cloacae, Pseudomonas_aeruginosa and many others.

2. Batch Analysis: Bash Script for Genome Screening

Use the following script to screen multiple genomes using AMRFinderPlus and output only β-lactam/beta-lactamase hits from a metadata table.

Input: genome_metadata.tsv — tab-separated columns: filename_TAB_organism, with header.

filename    organism
58.fasta    Escherichia coli
92.fasta    Klebsiella pneumoniae
125.fasta   Enterobacter cloacae complex
128.fasta   Enterobacter cloacae complex
130.fasta   Enterobacter cloacae complex
147.fasta   Citrobacter freundii
149.fasta   Citrobacter freundii
160.fasta   Citrobacter braakii
161.fasta   Citrobacter braakii
168.fasta   Providencia stuartii
184.fasta   Klebsiella aerogenes
65.fasta    Pseudomonas aeruginosa
201.fasta   Pseudomonas aeruginosa
209.fasta   Pseudomonas aeruginosa
167.fasta   Serratia marcescens

Run:

cd ~/DATA/Data_Patricia_AMRFinderPlus_2025/genomes
./run_amrfinder_and_summarize.sh genome_metadata.tsv
#./run_amrfinder_and_summarize.sh genome_metadata_149.tsv
#OR_DETECT_RUN: amrfinder -n 92.fasta -o amrfinder_results/92.amrfinder.tsv --plus --organism Klebsiella_pneumoniae --threads 1

python summarize_from_amrfinder_results.py amrfinder_results
# or, since that's the default:
# python summarize_from_amrfinder_results.py

Produce

  • AMRFinder-wide outputs

    • amrfinder_all.tsv
    • amrfinder_summary_by_isolate_gene.tsv
    • amrfinder_summary_by_gene.tsv
    • amrfinder_summary_by_class.tsv (if a class column exists)
    • amrfinder_summary.xlsx (with multiple sheets)
  • β-lactam-only outputs (if Class and Subclass are present)

    • beta_lactam_all.tsv
    • beta_lactam_summary_by_gene.tsv
    • beta_lactam_summary_by_isolate_gene.tsv
    • beta_lactam_all.xlsx
    • beta_lactam_summary.xlsx

Report

Please find attached the updated AMRFinderPlus summary files, now including isolate 167. For β-lactam–specific results, please see beta_lactam_all.xlsx and beta_lactam_summary.xlsx. In particular, beta_lactam_summary.xlsx contains two sheets:

  • by_gene – aggregated counts and isolate lists for each β-lactam gene
  • by_isolate_gene – per-isolate overview of detected β-lactam genes

Script:

  • run_amrfinder_and_summarize.sh

        #!/usr/bin/env bash
        set -euo pipefail
    
        META_FILE="${1:-}"
    
        if [[ -z "$META_FILE" || ! -f "$META_FILE" ]]; then
        echo "Usage: $0 genome_metadata.tsv" >&2
        exit 1
        fi
    
        OUTDIR="amrfinder_results"
        mkdir -p "$OUTDIR"
    
        echo ">>> Checking AMRFinder installation..."
        amrfinder -V || { echo "ERROR: amrfinder not working"; exit 1; }
        echo
    
        echo ">>> Running AMRFinderPlus on all genomes listed in $META_FILE"
    
        # --- loop over metadata file ---
        # expected columns: filename
    <TAB>organism
        tail -n +2 "$META_FILE" | while IFS=$'\t' read -r fasta organism; do
        # skip empty lines
        [[ -z "$fasta" ]] && continue
    
        if [[ ! -f "$fasta" ]]; then
        echo "WARN: FASTA file '$fasta' not found, skipping."
        continue
        fi
    
        isolate_id="${fasta%.fasta}"
    
        # map free-text organism to AMRFinder --organism names (optional)
        org_opt=""
        case "$organism" in
        "Escherichia coli")              org_opt="--organism Escherichia" ;;
        "Klebsiella pneumoniae")         org_opt="--organism Klebsiella_pneumoniae" ;;
        "Enterobacter cloacae complex")  org_opt="--organism Enterobacter_cloacae" ;;
        "Citrobacter freundii")          org_opt="--organism Citrobacter_freundii" ;;
        "Citrobacter braakii")           org_opt="--organism Citrobacter_freundii" ;;
        "Pseudomonas aeruginosa")        org_opt="--organism Pseudomonas_aeruginosa" ;;
        "Serratia marcescens")           org_opt="--organism Serratia_marcescens" ;;
        # others (Providencia stuartii, Klebsiella aerogenes)
        # currently have no organism-specific rules in AMRFinder, so we omit --organism
        *)                               org_opt="" ;;
        esac
    
        out_tsv="${OUTDIR}/${isolate_id}.amrfinder.tsv"
    
        echo "  - ${fasta} (${organism}) -> ${out_tsv} ${org_opt}"
        amrfinder -n "$fasta" -o "$out_tsv" --plus $org_opt
        done
    
        echo ">>> AMRFinderPlus runs finished."
        echo ">>> All done."
        echo "   - Individual reports: ${OUTDIR}/*.amrfinder.tsv"
  • summarize_from_amrfinder_results.py

        #!/usr/bin/env python3
        """
        summarize_from_amrfinder_results.py
    
        Usage:
        python summarize_from_amrfinder_results.py [amrfinder_results_dir]
    
        Default directory is "amrfinder_results" (relative to current working dir).
    
        This script:
        1) Reads all *.amrfinder.tsv in the given directory
        2) Merges them into a combined table
        3) Generates AMRFinder-wide summaries (amrfinder_* files)
        4) Applies a β-lactam filter:
    
                Element type == "AMR" (case-insensitive)
        AND Class or Subclass contains "beta-lactam" (case-insensitive)
    
        and generates β-lactam-only summaries (beta_lactam_* files).
    
        It NEVER re-runs AMRFinder; it only uses existing TSV files.
        """
    
        import sys
        import os
        import glob
        import re
    
        try:
        import pandas as pd
        except ImportError:
        sys.stderr.write(
                "ERROR: pandas is not installed.\n"
                "Install with something like:\n"
                "  mamba install pandas openpyxl -c conda-forge -c bioconda\n"
        )
        sys.exit(1)
    
        # ---------------------------------------------------------------------
        # Helpers
        # ---------------------------------------------------------------------
    
        def read_one(path):
        """Read one *.amrfinder.tsv and add an 'isolate_id' column from the filename."""
        df = pd.read_csv(path, sep="\t", dtype=str)
        df.columns = [c.strip() for c in df.columns]
        isolate_id = os.path.basename(path).replace(".amrfinder.tsv", "")
        df["isolate_id"] = isolate_id
        return df
    
        def pick(df, *candidates):
        """Return the first existing column name among candidates (normalized names)."""
        for c in candidates:
                if c in df.columns:
                return c
        return None
    
        # ---------------------------------------------------------------------
        # AMRFinder-wide summaries (no β-lactam filter)
        # ---------------------------------------------------------------------
    
        def make_amrfinder_summaries(
        df_all,
        col_gene,
        col_seq,
        col_class,
        col_subcls,
        col_ident,
        col_cov,
        col_iso,
        ):
        """Summaries for ALL AMRFinder hits (no β-lactam filter)."""
        if df_all.empty:
                print("[amrfinder] No rows in merged table, skipping summaries.")
                return
    
        # full merged table
        df_all.to_csv("amrfinder_all.tsv", sep="\t", index=False)
        print(">>> Full AMRFinder table written to: amrfinder_all.tsv")
    
        # ---- summary by isolate × gene ----
        rows = []
        for (iso, gene), sub in df_all.groupby([col_iso, col_gene], dropna=False):
                row = {
                "isolate_id": iso,
                "Gene_symbol": sub[col_gene].iloc[0],
                "n_hits": len(sub),
                }
                if col_seq is not None:
                row["Sequence_name"] = sub[col_seq].iloc[0]
                if col_class is not None:
                row["Class"] = sub[col_class].iloc[0]
                if col_subcls is not None:
                row["Subclass"] = sub[col_subcls].iloc[0]
                if col_ident is not None:
                vals = pd.to_numeric(sub[col_ident], errors="coerce")
                row["%identity_min"] = vals.min()
                row["%identity_max"] = vals.max()
                if col_cov is not None:
                vals = pd.to_numeric(sub[col_cov], errors="coerce")
                row["%coverage_min"] = vals.min()
                row["%coverage_max"] = vals.max()
                rows.append(row)
    
        summary_iso_gene = pd.DataFrame(rows)
        summary_iso_gene.to_csv(
                "amrfinder_summary_by_isolate_gene.tsv", sep="\t", index=False
        )
        print(">>> Isolate × gene summary written to: amrfinder_summary_by_isolate_gene.tsv")
    
        # ---- summary by gene ----
        def join(vals):
                uniq = sorted(set(vals.dropna().astype(str)))
                return ",".join(uniq)
    
        rows = []
        for gene, sub in df_all.groupby(col_gene, dropna=False):
                row = {
                "Gene_symbol": sub[col_gene].iloc[0],
                "n_isolates": sub[col_iso].nunique(),
                "isolates": ",".join(sorted(set(sub[col_iso].dropna().astype(str)))),
                "n_hits": len(sub),
                }
                if col_seq is not None:
                row["Sequence_name"] = join(sub[col_seq])
                if col_class is not None:
                row["Class"] = join(sub[col_class])
                if col_subcls is not None:
                row["Subclass"] = join(sub[col_subcls])
                rows.append(row)
    
        summary_gene = pd.DataFrame(rows)
        summary_gene = summary_gene.sort_values("n_isolates", ascending=False)
        summary_gene.to_csv("amrfinder_summary_by_gene.tsv", sep="\t", index=False)
        print(">>> Gene-level summary written to: amrfinder_summary_by_gene.tsv")
    
        # ---- summary by class/subclass ----
        summary_class = None
        if col_class is not None:
                group_cols = [col_class]
                if col_subcls is not None:
                group_cols.append(col_subcls)
    
                summary_class = (
                df_all.groupby(group_cols, dropna=False)
                .agg(
                        n_isolates=(col_iso, "nunique"),
                        n_hits=(col_iso, "size"),
                )
                .reset_index()
                )
                summary_class.to_csv("amrfinder_summary_by_class.tsv", sep="\t", index=False)
                print(">>> Class-level summary written to: amrfinder_summary_by_class.tsv")
        else:
                print(">>> No 'class' column found; amrfinder_summary_by_class.tsv not created.")
    
        # ---- Excel workbook ----
        try:
                with pd.ExcelWriter("amrfinder_summary.xlsx") as xw:
                df_all.to_excel(xw, sheet_name="amrfinder_all", index=False)
                summary_iso_gene.to_excel(xw, sheet_name="by_isolate_gene", index=False)
                summary_gene.to_excel(xw, sheet_name="by_gene", index=False)
                if summary_class is not None:
                        summary_class.to_excel(xw, sheet_name="by_class", index=False)
                print(">>> Excel workbook written: amrfinder_summary.xlsx")
        except Exception as e:
                print("WARNING: could not write amrfinder_summary.xlsx:", e)
    
        # ---------------------------------------------------------------------
        # β-lactam summaries
        # ---------------------------------------------------------------------
    
        def make_beta_lactam_summaries(
        df_beta,
        col_gene,
        col_seq,
        col_subcls,
        col_ident,
        col_cov,
        col_iso,
        ):
        """Summaries for β-lactam subset (after mask)."""
        if df_beta.empty:
                print("[beta_lactam] No β-lactam hits in subset, skipping.")
                return
    
        # full β-lactam table
        beta_all_tsv = "beta_lactam_all.tsv"
        df_beta.to_csv(beta_all_tsv, sep="\t", index=False)
        print(">>> β-lactam / β-lactamase hits written to: %s" % beta_all_tsv)
    
        # -------- summary by gene (with list of isolates) --------
        group_cols = [col_gene]
        if col_seq is not None:
                group_cols.append(col_seq)
        if col_subcls is not None:
                group_cols.append(col_subcls)
    
        def join_isolates(vals):
                uniq = sorted(set(vals.dropna().astype(str)))
                return ",".join(uniq)
    
        summary_gene = (
                df_beta.groupby(group_cols, dropna=False)
                .agg(
                n_isolates=(col_iso, "nunique"),
                isolates=(col_iso, join_isolates),
                n_hits=(col_iso, "size"),
                )
                .reset_index()
        )
    
        rename_map = {}
        if col_gene is not None:
                rename_map[col_gene] = "Gene_symbol"
        if col_seq is not None:
                rename_map[col_seq] = "Sequence_name"
        if col_subcls is not None:
                rename_map[col_subcls] = "Subclass"
        summary_gene.rename(columns=rename_map, inplace=True)
    
        sum_gene_tsv = "beta_lactam_summary_by_gene.tsv"
        summary_gene.to_csv(sum_gene_tsv, sep="\t", index=False)
        print(">>> Gene-level β-lactam summary written to: %s" % sum_gene_tsv)
        print("    (includes 'isolates' = comma-separated isolate_ids)")
    
        # -------- summary by isolate & gene (with annotation) --------
        rows = []
        for (iso, gene), sub in df_beta.groupby([col_iso, col_gene], dropna=False):
                row = {
                "isolate_id": iso,
                "Gene_symbol": sub[col_gene].iloc[0],
                "n_hits": len(sub),
                }
                if col_seq is not None:
                row["Sequence_name"] = sub[col_seq].iloc[0]
                if col_subcls is not None:
                row["Subclass"] = sub[col_subcls].iloc[0]
    
                if col_ident is not None:
                vals = pd.to_numeric(sub[col_ident], errors="coerce")
                row["%identity_min"] = vals.min()
                row["%identity_max"] = vals.max()
                if col_cov is not None:
                vals = pd.to_numeric(sub[col_cov], errors="coerce")
                row["%coverage_min"] = vals.min()
                row["%coverage_max"] = vals.max()
    
                rows.append(row)
    
        summary_iso_gene = pd.DataFrame(rows)
        sum_iso_gene_tsv = "beta_lactam_summary_by_isolate_gene.tsv"
        summary_iso_gene.to_csv(sum_iso_gene_tsv, sep="\t", index=False)
        print(">>> Isolate × gene β-lactam summary written to: %s" % sum_iso_gene_tsv)
        print("    (includes 'Gene_symbol' and 'Sequence_name' annotation columns)")
    
        # -------- optional Excel exports --------
        try:
                with pd.ExcelWriter("beta_lactam_all.xlsx") as xw:
                df_beta.to_excel(xw, sheet_name="beta_lactam_all", index=False)
                with pd.ExcelWriter("beta_lactam_summary.xlsx") as xw:
                summary_gene.to_excel(xw, sheet_name="by_gene", index=False)
                summary_iso_gene.to_excel(xw, sheet_name="by_isolate_gene", index=False)
                print(">>> Excel workbooks written: beta_lactam_all.xlsx, beta_lactam_summary.xlsx")
        except Exception as e:
                print("WARNING: could not write β-lactam Excel files:", e)
    
        # ---------------------------------------------------------------------
        # Main
        # ---------------------------------------------------------------------
    
        def main():
        outdir = sys.argv[1] if len(sys.argv) > 1 else "amrfinder_results"
    
        if not os.path.isdir(outdir):
                sys.stderr.write("ERROR: directory '%s' not found.\n" % outdir)
                sys.exit(1)
    
        files = sorted(glob.glob(os.path.join(outdir, "*.amrfinder.tsv")))
        if not files:
                sys.stderr.write("ERROR: no *.amrfinder.tsv files found in '%s'.\n" % outdir)
                sys.exit(1)
    
        print(">>> Found %d AMRFinder TSV files in: %s" % (len(files), outdir))
        for f in files:
                print("   -", os.path.basename(f))
    
        dfs = [read_one(f) for f in files]
        df = pd.concat(dfs, ignore_index=True)
    
        # normalize column names for internal use
        norm_cols = {c: c.strip().lower().replace(" ", "_") for c in df.columns}
        df.rename(columns=norm_cols, inplace=True)
    
        # locate columns (handles your Element type / subtype + older formats)
        col_gene       = pick(df, "gene_symbol", "genesymbol")
        col_seq        = pick(df, "sequence_name", "sequencename")
        col_elemtype   = pick(df, "element_type")
        col_elemsub    = pick(df, "element_subtype")
        col_class      = pick(df, "class")
        col_subcls     = pick(df, "subclass")
        col_ident      = pick(df, "%identity_to_reference_sequence", "identity")
        col_cov        = pick(df, "%coverage_of_reference_sequence", "coverage_of_reference_sequence")
        col_iso        = "isolate_id"
    
        print("\nDetected columns:")
        for label, col in [
                ("gene", col_gene),
                ("sequence", col_seq),
                ("element_type", col_elemtype),
                ("element_subtype", col_elemsub),
                ("class", col_class),
                ("subclass", col_subcls),
                ("%identity", col_ident),
                ("%coverage", col_cov),
                ("isolate_id", col_iso),
        ]:
                print("  %-14s: %s" % (label, col))
    
        if col_gene is None:
                sys.stderr.write(
                "ERROR: could not find a gene symbol column "
                "(expected something like 'Gene symbol' in the original AMRFinder output).\n"
                )
                sys.exit(1)
    
        print("\n=== Generating AMRFinder-wide summaries (all hits) ===")
        make_amrfinder_summaries(
                df_all=df,
                col_gene=col_gene,
                col_seq=col_seq,
                col_class=col_class,
                col_subcls=col_subcls,
                col_ident=col_ident,
                col_cov=col_cov,
                col_iso=col_iso,
        )
    
        # -----------------------------------------------------------------
        # β-lactam subset
        #
        # New logic for your current data:
        #   Element type == "AMR"
        #   AND Class or Subclass contains "beta-lactam"
        #
        # Falls back to just Class/Subclass if Element type not present.
        # -----------------------------------------------------------------
        if (col_elemtype is not None) or (col_class is not None or col_subcls is not None):
    
                # element type AMR (if column exists, otherwise all True)
                if col_elemtype is not None:
                mask_amr = df[col_elemtype].str.contains("AMR", case=False, na=False)
                else:
                mask_amr = pd.Series(True, index=df.index)
    
                # beta-lactam pattern (handles BETA-LACTAM, beta lactam, etc.)
                beta_pattern = re.compile(r"beta[- ]?lactam", re.IGNORECASE)
    
                mask_beta = pd.Series(False, index=df.index)
                if col_class is not None:
                mask_beta |= df[col_class].fillna("").str.contains(beta_pattern)
                if col_subcls is not None:
                mask_beta |= df[col_subcls].fillna("").str.contains(beta_pattern)
    
                mask = mask_amr & mask_beta
                df_beta = df.loc[mask].copy()
    
                if df_beta.empty:
                print(
                        "\nWARNING: No β-lactam hits found "
                        "(Element type == 'AMR' AND Class/Subclass contains 'beta-lactam')."
                )
                else:
                print(
                        "\n=== β-lactam subset ===\n"
                        "  kept %d of %d rows where Element type is 'AMR' and "
                        "Class/Subclass contains 'beta-lactam'\n"
                        % (len(df_beta), len(df))
                )
                make_beta_lactam_summaries(
                        df_beta=df_beta,
                        col_gene=col_gene,
                        col_seq=col_seq,
                        col_subcls=col_subcls,
                        col_ident=col_ident,
                        col_cov=col_cov,
                        col_iso=col_iso,
                )
        else:
                print(
                "\nWARNING: Cannot apply β-lactam filter because Element type and/or "
                "class/subclass columns were not found. Only amrfinder_* "
                "outputs were generated."
                )
    
        if __name__ == "__main__":
        main()

Leave a Reply

Your email address will not be published. Required fields are marked *