Standalone Pipeline for Motif Conservation Analysis of AdeJ and AdeB in Acinetobacter baumannii for Data_Tam_DNAseq_2025_E.hormaechei-adeABadeIJ_adeIJK_CM1_CM2_on_ATCC19606

Overview

This post documents a standalone sequence-analysis pipeline that I used to test whether residues in eight candidate motifs are conserved across AdeJ and AdeB homologs from Acinetobacter baumannii. The goal of the workflow is to move from raw sequence collection to a residue-level conservation assessment that can be reproduced from the command line without any hidden manual steps.

The analysis is organized around eight motifs:

AdeJ: DIKDY, DNYQFDSK, AIKIA, GNGQAS
AdeB: DLSDY, QAYNFAIL, AIQLS, TSGTAE

The key idea of the pipeline is simple: first collect comparable AdeJ and AdeB proteins, then align them, calculate position-wise conservation scores, and finally map the predefined motifs onto the alignment-derived consensus sequence so that each motif can be interpreted residue by residue.

Complete workflow

1. Retrieval of AdeJ and AdeB protein sequences from NCBI

The first step retrieves AdeJ and AdeB protein sequences from the NCBI protein database using Biopython Entrez. AdeJ and AdeB are queried separately, because they represent distinct homolog groups and should not be mixed during downstream alignment.

A length restriction is applied during the NCBI search itself and then enforced again after download. This double filter is useful because it enriches for near full-length proteins and reduces the number of fragments, truncated annotations, or unusual entries that could distort the alignment.

In the standalone version of the pipeline, the following recommended length windows are used:

AdeJ: 1000–1070 aa
AdeB: 1000–1050 aa

The retrieval step produces:

adej_protein_sequences.filtered.fasta
adeb_protein_sequences.filtered.fasta
adej_filtered_out.tsv
adeb_filtered_out.tsv

The TSV files record which sequences were excluded locally after download.

2. Multiple-sequence alignment with MAFFT

The filtered AdeJ and AdeB sequence sets are aligned independently with MAFFT. The parameters used in this workflow are:

--adjustdirection
--localpair
--maxiterate 1000

--adjustdirection helps correct sequence direction if needed, while --localpair and --maxiterate 1000 provide an iterative local-pair alignment strategy suitable for homologous protein families.

For each protein family, the pipeline writes two alignment formats:

FASTA alignment for downstream parsing
CLUSTAL alignment for visual inspection

This produces:

adej_aligned.fasta
adej_aligned.aln
adeb_aligned.fasta
adeb_aligned.aln

3. Per-position conservation analysis using Shannon entropy

After alignment, conservation is quantified across the full length of each aligned protein using Shannon entropy. The entropy is calculated independently for every alignment column.

Interpretation of the values is as follows:

Entropy = 0 means the position is fully conserved
Low entropy means the position is highly conserved with only limited variation
High entropy means the position is more variable across the sequence set

In this implementation, the entropy calculation reproduces the original script behavior by counting all symbols that appear in the alignment column, including gaps. This keeps the outputs consistent with the original analysis.

The result is a position-by-position conservation profile, for example:

Position  293: -0.000

Here, -0.000 should be interpreted numerically as zero.

4. Consensus-sequence generation and motif localization

To connect the motif definitions to the alignment positions, a consensus sequence is generated from each alignment by taking the most frequent residue at every aligned position. The pipeline then searches the consensus sequence for the predefined motifs and reports their coordinates using 1-based indexing.

This step is important because it links three levels of information:

the motif sequence itself,
the location of that motif in the consensus sequence,
and the residue-wise entropy values at the same positions.

This allows each motif to be evaluated quantitatively and not only visually.

5. Interpretation of the current result

Using this workflow, the motif that was fully conserved in the analyzed AdeB dataset was:

AIQLS (positions 293–297 in the AdeB consensus sequence)

All five positions in this motif had entropy values of zero, indicating complete conservation at each residue.

Why this pipeline is reproducible

The workflow is fully script-based. Each step produces explicit output files that are used as input by the next step, and the entire analysis can be run from the shell with a single wrapper script.

The standalone package therefore makes it possible to:

reproduce the sequence retrieval,
regenerate the alignments,
recalculate all position-wise entropy scores,
and recover the motif coordinates from the consensus sequence.

Source code

Below I include the full source code for each standalone component of the pipeline.

1. `1_fetch_AdeJ_and_AdeB.py`

#!/usr/bin/env python3
"""Download length-filtered AdeJ and AdeB protein sequences from NCBI.

This script reproduces the first step of the AdeJ/AdeB motif-conservation
pipeline. It performs two NCBI Entrez protein searches, downloads the matching
FASTA records in batches, and applies a second length filter locally before
writing the final FASTA files.
"""

from __future__ import annotations

import argparse
from io import StringIO
from pathlib import Path
from time import sleep
from typing import Iterable

from Bio import Entrez, SeqIO

TARGETS = {
    "AdeJ": {
        "search_term": "Acinetobacter baumannii[organism] AND AdeJ[protein] AND 1000:1070[SLEN]",
        "min_len": 1000,
        "max_len": 1070,
        "output_fasta": "adej_protein_sequences.filtered.fasta",
        "report_tsv": "adej_filtered_out.tsv",
    },
    "AdeB": {
        "search_term": "Acinetobacter baumannii[organism] AND AdeB[protein] AND 1000:1050[SLEN]",
        "min_len": 1000,
        "max_len": 1050,
        "output_fasta": "adeb_protein_sequences.filtered.fasta",
        "report_tsv": "adeb_filtered_out.tsv",
    },
}

RETMAX = 10000
BATCH_SIZE = 200
PAUSE_SECONDS = 0.34
MAX_RETRIES = 3

def batched(items: list[str], batch_size: int) -> Iterable[list[str]]:
    for start in range(0, len(items), batch_size):
        yield items[start : start + batch_size]

def search_ids(search_term: str, retmax: int = RETMAX) -> list[str]:
    with Entrez.esearch(db="protein", term=search_term, retmax=retmax) as handle:
        record = Entrez.read(handle)
    return record["IdList"]

def fetch_fasta_batch(id_batch: list[str]) -> str:
    last_error = None
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            with Entrez.efetch(
                db="protein",
                id=",".join(id_batch),
                rettype="fasta",
                retmode="text",
            ) as handle:
                return handle.read()
        except Exception as exc:  # pragma: no cover
            last_error = exc
            sleep(attempt)
    raise RuntimeError(f"Failed to fetch batch after {MAX_RETRIES} attempts: {last_error}")

def fetch_and_filter_sequences(
    protein_name: str,
    ids: list[str],
    min_len: int,
    max_len: int,
    output_fasta: str,
    report_tsv: str,
) -> None:
    kept = 0
    rejected = 0

    with open(output_fasta, "w") as fasta_out, open(report_tsv, "w") as report_out:
        report_out.write("accession\tlength\treason\tdescription\n")

        for id_batch in batched(ids, BATCH_SIZE):
            fasta_text = fetch_fasta_batch(id_batch)
            records = SeqIO.parse(StringIO(fasta_text), "fasta")

            for record in records:
                seq_len = len(record.seq)
                if min_len <= seq_len <= max_len:
                    SeqIO.write(record, fasta_out, "fasta")
                    kept += 1
                else:
                    reason = f"outside_{min_len}_{max_len}"
                    report_out.write(
                        f"{record.id}\t{seq_len}\t{reason}\t{record.description}\n"
                    )
                    rejected += 1

            sleep(PAUSE_SECONDS)

    print(
        f"{protein_name}: kept {kept} sequences in range {min_len}-{max_len} aa; "
        f"filtered out {rejected}."
    )
    print(f"  FASTA output : {Path(output_fasta).resolve()}")
    print(f"  Filter report: {Path(report_tsv).resolve()}")

def main() -> None:
    parser = argparse.ArgumentParser(description="Download length-filtered AdeJ/AdeB protein sequences from NCBI.")
    parser.add_argument("--email", required=True, help="Email address for NCBI Entrez.")
    args = parser.parse_args()

    Entrez.email = args.email

    for protein_name, cfg in TARGETS.items():
        ids = search_ids(cfg["search_term"])
        print(f"{protein_name}: found {len(ids)} NCBI hits after length-restricted search")
        fetch_and_filter_sequences(
            protein_name=protein_name,
            ids=ids,
            min_len=cfg["min_len"],
            max_len=cfg["max_len"],
            output_fasta=cfg["output_fasta"],
            report_tsv=cfg["report_tsv"],
        )

if __name__ == "__main__":
    main()

2. `2_run_mafft.sh`

#!/usr/bin/env bash
set -euo pipefail

if ! command -v mafft >/dev/null 2>&1; then
  echo "Error: mafft is not installed or not in PATH." >&2
  exit 1
fi

ADEJ_FASTA="${1:-adej_protein_sequences.filtered.fasta}"
ADEB_FASTA="${2:-adeb_protein_sequences.filtered.fasta}"

mafft --adjustdirection --maxiterate 1000 --localpair "$ADEJ_FASTA" > adej_aligned.fasta
mafft --adjustdirection --clustalout --maxiterate 1000 --localpair "$ADEJ_FASTA" > adej_aligned.aln

mafft --adjustdirection --maxiterate 1000 --localpair "$ADEB_FASTA" > adeb_aligned.fasta
mafft --adjustdirection --clustalout --maxiterate 1000 --localpair "$ADEB_FASTA" > adeb_aligned.aln

3. `3_calculate_per_position_Shannon_entropy.py`

#!/usr/bin/env python3
"""Calculate Shannon entropy for every position in a protein alignment.

By default this reproduces the original pipeline behavior and counts every
symbol in the alignment column, including gaps, when computing entropy.
"""

from __future__ import annotations

import argparse
import math
from collections import Counter
from typing import Iterable

from Bio import AlignIO

def shannon_entropy(column: Iterable[str]) -> float:
    freqs = Counter(column)
    total = float(sum(freqs.values()))
    return -sum((count / total) * math.log2(count / total) for count in freqs.values())

def main() -> None:
    parser = argparse.ArgumentParser(description="Calculate per-position Shannon entropy from an aligned FASTA file.")
    parser.add_argument("alignment", help="Aligned FASTA file, e.g. adej_aligned.fasta")
    parser.add_argument("-o", "--output", help="Optional output file. Defaults to stdout.")
    args = parser.parse_args()

    alignment = AlignIO.read(args.alignment, "fasta")
    lines = []
    for idx in range(alignment.get_alignment_length()):
        column = [str(record.seq[idx]) for record in alignment]
        score = shannon_entropy(column)
        lines.append(f"Position {idx + 1:4d}: {score:.3f}")

    text = "\n".join(lines) + "\n"
    if args.output:
        with open(args.output, "w") as handle:
            handle.write(text)
    else:
        print(text, end="")

if __name__ == "__main__":
    main()

4. `4_get_motif_positions.py`

#!/usr/bin/env python3
"""Generate a consensus sequence and locate candidate motifs in that consensus."""

from __future__ import annotations

import argparse
from collections import Counter

from Bio import AlignIO

MOTIFS = {
    "AdeJ": {
        "DIKDY": "DIKDY",
        "DNYQFDSK": "DNYQFDSK",
        "AIKIA": "AIKIA",
        "GNGQAS": "GNGQAS",
    },
    "AdeB": {
        "DLSDY": "DLSDY",
        "QAYNFAIL": "QAYNFAIL",
        "AIQLS": "AIQLS",
        "TSGTAE": "TSGTAE",
    },
}

def generate_consensus(alignment) -> str:
    consensus = []
    for i in range(len(alignment[0])):
        column = [str(record.seq[i]) for record in alignment]
        most_common = Counter(column).most_common(1)[0][0]
        consensus.append(most_common)
    return "".join(consensus)

def find_motif_positions_in_consensus(seq: str, motif: str) -> list[tuple[int, int]]:
    positions = []
    start = 1
    while True:
        found = seq.find(motif, start)
        if found == -1:
            break
        positions.append((found + 1, found + len(motif)))
        start = found + 1
    return positions

def main() -> None:
    parser = argparse.ArgumentParser(description="Find candidate motif positions in a consensus sequence derived from an alignment.")
    parser.add_argument("alignment", help="Aligned FASTA file, e.g. adej_aligned.fasta")
    parser.add_argument("protein", choices=["AdeJ", "AdeB"], help="Protein family for motif lookup")
    parser.add_argument("-o", "--output", help="Optional output file. Defaults to stdout.")
    args = parser.parse_args()

    alignment = AlignIO.read(args.alignment, "fasta")
    consensus_sequence = generate_consensus(alignment)

    motif_positions_in_consensus = {}
    for motif_name, motif_sequence in MOTIFS[args.protein].items():
        positions = find_motif_positions_in_consensus(consensus_sequence, motif_sequence)
        motif_positions_in_consensus[motif_name] = positions

    text = (
        f"Motif positions in the consensus sequence of {args.protein}:\n"
        f"{motif_positions_in_consensus}\n\n"
        f"Consensus sequence:\n{consensus_sequence}\n"
    )

    if args.output:
        with open(args.output, "w") as handle:
            handle.write(text)
    else:
        print(text, end="")

if __name__ == "__main__":
    main()

5. `run_pipeline.sh`

#!/usr/bin/env bash
set -euo pipefail

if [[ $# -lt 1 ]]; then
  echo "Usage: bash run_pipeline.sh 
<your_email_for_ncbi>" >&2
  exit 1
fi

EMAIL="$1"

python3 1_fetch_AdeJ_and_AdeB.py --email "$EMAIL"
bash 2_run_mafft.sh
python3 3_calculate_per_position_Shannon_entropy.py adej_aligned.fasta -o adej_aligned.scores
python3 3_calculate_per_position_Shannon_entropy.py adeb_aligned.fasta -o adeb_aligned.scores
python3 4_get_motif_positions.py adej_aligned.fasta AdeJ -o adej_motif_positions.txt
python3 4_get_motif_positions.py adeb_aligned.fasta AdeB -o adeb_motif_positions.txt

echo "Pipeline finished. Generated files:"
ls -1 adej_* adeb_*

6. `requirements.txt`

biopython>=1.80

How to reproduce the analysis

Install Python dependencies:

pip install -r requirements.txt

Install MAFFT and make sure it is available in your shell PATH.
Run the complete workflow:

bash run_pipeline.sh your_email@example.com

Standalone package

For convenience, I also provide the standalone package as a zip archive so the full pipeline can be downloaded and executed locally.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Standalone Pipeline for Motif Conservation Analysis of AdeJ and AdeB in Acinetobacter baumannii

Standalone Pipeline for Motif Conservation Analysis of AdeJ and AdeB in Acinetobacter baumannii for Data_Tam_DNAseq_2025_E.hormaechei-adeABadeIJ_adeIJK_CM1_CM2_on_ATCC19606

Overview

Complete workflow

1. Retrieval of AdeJ and AdeB protein sequences from NCBI

2. Multiple-sequence alignment with MAFFT

3. Per-position conservation analysis using Shannon entropy

4. Consensus-sequence generation and motif localization

5. Interpretation of the current result

Why this pipeline is reproducible

Source code

1. `1_fetch_AdeJ_and_AdeB.py`

2. `2_run_mafft.sh`

3. `3_calculate_per_position_Shannon_entropy.py`

4. `4_get_motif_positions.py`

5. `run_pipeline.sh`

6. `requirements.txt`

How to reproduce the analysis

Standalone package

Leave a Reply Cancel reply

Standalone Pipeline for Motif Conservation Analysis of AdeJ and AdeB in Acinetobacter baumannii for Data_Tam_DNAseq_2025_E.hormaechei-adeABadeIJ_adeIJK_CM1_CM2_on_ATCC19606

Overview

Complete workflow

1. Retrieval of AdeJ and AdeB protein sequences from NCBI

2. Multiple-sequence alignment with MAFFT

3. Per-position conservation analysis using Shannon entropy

4. Consensus-sequence generation and motif localization

5. Interpretation of the current result

Why this pipeline is reproducible

Source code

1. 1_fetch_AdeJ_and_AdeB.py

2. 2_run_mafft.sh

3. 3_calculate_per_position_Shannon_entropy.py

4. 4_get_motif_positions.py

5. run_pipeline.sh

6. requirements.txt

How to reproduce the analysis

Standalone package

Leave a Reply Cancel reply

1. `1_fetch_AdeJ_and_AdeB.py`

2. `2_run_mafft.sh`

3. `3_calculate_per_position_Shannon_entropy.py`

4. `4_get_motif_positions.py`

5. `run_pipeline.sh`

6. `requirements.txt`