Standalone Pipeline for Motif Conservation Analysis of AdeJ and AdeB in Acinetobacter baumannii for Data_Tam_DNAseq_2025_E.hormaechei-adeABadeIJ_adeIJK_CM1_CM2_on_ATCC19606
ade_motif_pipeline_standalone.zip
Overview
This post documents a standalone sequence-analysis pipeline that I used to test whether residues in eight candidate motifs are conserved across AdeJ and AdeB homologs from Acinetobacter baumannii. The goal of the workflow is to move from raw sequence collection to a residue-level conservation assessment that can be reproduced from the command line without any hidden manual steps.
The analysis is organized around eight motifs:
- AdeJ:
DIKDY,DNYQFDSK,AIKIA,GNGQAS - AdeB:
DLSDY,QAYNFAIL,AIQLS,TSGTAE
The key idea of the pipeline is simple: first collect comparable AdeJ and AdeB proteins, then align them, calculate position-wise conservation scores, and finally map the predefined motifs onto the alignment-derived consensus sequence so that each motif can be interpreted residue by residue.
Complete workflow
1. Retrieval of AdeJ and AdeB protein sequences from NCBI
The first step retrieves AdeJ and AdeB protein sequences from the NCBI protein database using Biopython Entrez. AdeJ and AdeB are queried separately, because they represent distinct homolog groups and should not be mixed during downstream alignment.
A length restriction is applied during the NCBI search itself and then enforced again after download. This double filter is useful because it enriches for near full-length proteins and reduces the number of fragments, truncated annotations, or unusual entries that could distort the alignment.
In the standalone version of the pipeline, the following recommended length windows are used:
- AdeJ: 1000–1070 aa
- AdeB: 1000–1050 aa
The retrieval step produces:
adej_protein_sequences.filtered.fastaadeb_protein_sequences.filtered.fastaadej_filtered_out.tsvadeb_filtered_out.tsv
The TSV files record which sequences were excluded locally after download.
2. Multiple-sequence alignment with MAFFT
The filtered AdeJ and AdeB sequence sets are aligned independently with MAFFT. The parameters used in this workflow are:
--adjustdirection--localpair--maxiterate 1000
--adjustdirection helps correct sequence direction if needed, while --localpair and --maxiterate 1000 provide an iterative local-pair alignment strategy suitable for homologous protein families.
For each protein family, the pipeline writes two alignment formats:
- FASTA alignment for downstream parsing
- CLUSTAL alignment for visual inspection
This produces:
adej_aligned.fastaadej_aligned.alnadeb_aligned.fastaadeb_aligned.aln
3. Per-position conservation analysis using Shannon entropy
After alignment, conservation is quantified across the full length of each aligned protein using Shannon entropy. The entropy is calculated independently for every alignment column.
Interpretation of the values is as follows:
- Entropy = 0 means the position is fully conserved
- Low entropy means the position is highly conserved with only limited variation
- High entropy means the position is more variable across the sequence set
In this implementation, the entropy calculation reproduces the original script behavior by counting all symbols that appear in the alignment column, including gaps. This keeps the outputs consistent with the original analysis.
The result is a position-by-position conservation profile, for example:
Position 293: -0.000
Here, -0.000 should be interpreted numerically as zero.
4. Consensus-sequence generation and motif localization
To connect the motif definitions to the alignment positions, a consensus sequence is generated from each alignment by taking the most frequent residue at every aligned position. The pipeline then searches the consensus sequence for the predefined motifs and reports their coordinates using 1-based indexing.
This step is important because it links three levels of information:
- the motif sequence itself,
- the location of that motif in the consensus sequence,
- and the residue-wise entropy values at the same positions.
This allows each motif to be evaluated quantitatively and not only visually.
5. Interpretation of the current result
Using this workflow, the motif that was fully conserved in the analyzed AdeB dataset was:
- AIQLS (positions 293–297 in the AdeB consensus sequence)
All five positions in this motif had entropy values of zero, indicating complete conservation at each residue.
Why this pipeline is reproducible
The workflow is fully script-based. Each step produces explicit output files that are used as input by the next step, and the entire analysis can be run from the shell with a single wrapper script.
The standalone package therefore makes it possible to:
- reproduce the sequence retrieval,
- regenerate the alignments,
- recalculate all position-wise entropy scores,
- and recover the motif coordinates from the consensus sequence.
Source code
Below I include the full source code for each standalone component of the pipeline.
1. 1_fetch_AdeJ_and_AdeB.py
#!/usr/bin/env python3
"""Download length-filtered AdeJ and AdeB protein sequences from NCBI.
This script reproduces the first step of the AdeJ/AdeB motif-conservation
pipeline. It performs two NCBI Entrez protein searches, downloads the matching
FASTA records in batches, and applies a second length filter locally before
writing the final FASTA files.
"""
from __future__ import annotations
import argparse
from io import StringIO
from pathlib import Path
from time import sleep
from typing import Iterable
from Bio import Entrez, SeqIO
TARGETS = {
"AdeJ": {
"search_term": "Acinetobacter baumannii[organism] AND AdeJ[protein] AND 1000:1070[SLEN]",
"min_len": 1000,
"max_len": 1070,
"output_fasta": "adej_protein_sequences.filtered.fasta",
"report_tsv": "adej_filtered_out.tsv",
},
"AdeB": {
"search_term": "Acinetobacter baumannii[organism] AND AdeB[protein] AND 1000:1050[SLEN]",
"min_len": 1000,
"max_len": 1050,
"output_fasta": "adeb_protein_sequences.filtered.fasta",
"report_tsv": "adeb_filtered_out.tsv",
},
}
RETMAX = 10000
BATCH_SIZE = 200
PAUSE_SECONDS = 0.34
MAX_RETRIES = 3
def batched(items: list[str], batch_size: int) -> Iterable[list[str]]:
for start in range(0, len(items), batch_size):
yield items[start : start + batch_size]
def search_ids(search_term: str, retmax: int = RETMAX) -> list[str]:
with Entrez.esearch(db="protein", term=search_term, retmax=retmax) as handle:
record = Entrez.read(handle)
return record["IdList"]
def fetch_fasta_batch(id_batch: list[str]) -> str:
last_error = None
for attempt in range(1, MAX_RETRIES + 1):
try:
with Entrez.efetch(
db="protein",
id=",".join(id_batch),
rettype="fasta",
retmode="text",
) as handle:
return handle.read()
except Exception as exc: # pragma: no cover
last_error = exc
sleep(attempt)
raise RuntimeError(f"Failed to fetch batch after {MAX_RETRIES} attempts: {last_error}")
def fetch_and_filter_sequences(
protein_name: str,
ids: list[str],
min_len: int,
max_len: int,
output_fasta: str,
report_tsv: str,
) -> None:
kept = 0
rejected = 0
with open(output_fasta, "w") as fasta_out, open(report_tsv, "w") as report_out:
report_out.write("accession\tlength\treason\tdescription\n")
for id_batch in batched(ids, BATCH_SIZE):
fasta_text = fetch_fasta_batch(id_batch)
records = SeqIO.parse(StringIO(fasta_text), "fasta")
for record in records:
seq_len = len(record.seq)
if min_len <= seq_len <= max_len:
SeqIO.write(record, fasta_out, "fasta")
kept += 1
else:
reason = f"outside_{min_len}_{max_len}"
report_out.write(
f"{record.id}\t{seq_len}\t{reason}\t{record.description}\n"
)
rejected += 1
sleep(PAUSE_SECONDS)
print(
f"{protein_name}: kept {kept} sequences in range {min_len}-{max_len} aa; "
f"filtered out {rejected}."
)
print(f" FASTA output : {Path(output_fasta).resolve()}")
print(f" Filter report: {Path(report_tsv).resolve()}")
def main() -> None:
parser = argparse.ArgumentParser(description="Download length-filtered AdeJ/AdeB protein sequences from NCBI.")
parser.add_argument("--email", required=True, help="Email address for NCBI Entrez.")
args = parser.parse_args()
Entrez.email = args.email
for protein_name, cfg in TARGETS.items():
ids = search_ids(cfg["search_term"])
print(f"{protein_name}: found {len(ids)} NCBI hits after length-restricted search")
fetch_and_filter_sequences(
protein_name=protein_name,
ids=ids,
min_len=cfg["min_len"],
max_len=cfg["max_len"],
output_fasta=cfg["output_fasta"],
report_tsv=cfg["report_tsv"],
)
if __name__ == "__main__":
main()
2. 2_run_mafft.sh
#!/usr/bin/env bash
set -euo pipefail
if ! command -v mafft >/dev/null 2>&1; then
echo "Error: mafft is not installed or not in PATH." >&2
exit 1
fi
ADEJ_FASTA="${1:-adej_protein_sequences.filtered.fasta}"
ADEB_FASTA="${2:-adeb_protein_sequences.filtered.fasta}"
mafft --adjustdirection --maxiterate 1000 --localpair "$ADEJ_FASTA" > adej_aligned.fasta
mafft --adjustdirection --clustalout --maxiterate 1000 --localpair "$ADEJ_FASTA" > adej_aligned.aln
mafft --adjustdirection --maxiterate 1000 --localpair "$ADEB_FASTA" > adeb_aligned.fasta
mafft --adjustdirection --clustalout --maxiterate 1000 --localpair "$ADEB_FASTA" > adeb_aligned.aln
3. 3_calculate_per_position_Shannon_entropy.py
#!/usr/bin/env python3
"""Calculate Shannon entropy for every position in a protein alignment.
By default this reproduces the original pipeline behavior and counts every
symbol in the alignment column, including gaps, when computing entropy.
"""
from __future__ import annotations
import argparse
import math
from collections import Counter
from typing import Iterable
from Bio import AlignIO
def shannon_entropy(column: Iterable[str]) -> float:
freqs = Counter(column)
total = float(sum(freqs.values()))
return -sum((count / total) * math.log2(count / total) for count in freqs.values())
def main() -> None:
parser = argparse.ArgumentParser(description="Calculate per-position Shannon entropy from an aligned FASTA file.")
parser.add_argument("alignment", help="Aligned FASTA file, e.g. adej_aligned.fasta")
parser.add_argument("-o", "--output", help="Optional output file. Defaults to stdout.")
args = parser.parse_args()
alignment = AlignIO.read(args.alignment, "fasta")
lines = []
for idx in range(alignment.get_alignment_length()):
column = [str(record.seq[idx]) for record in alignment]
score = shannon_entropy(column)
lines.append(f"Position {idx + 1:4d}: {score:.3f}")
text = "\n".join(lines) + "\n"
if args.output:
with open(args.output, "w") as handle:
handle.write(text)
else:
print(text, end="")
if __name__ == "__main__":
main()
4. 4_get_motif_positions.py
#!/usr/bin/env python3
"""Generate a consensus sequence and locate candidate motifs in that consensus."""
from __future__ import annotations
import argparse
from collections import Counter
from Bio import AlignIO
MOTIFS = {
"AdeJ": {
"DIKDY": "DIKDY",
"DNYQFDSK": "DNYQFDSK",
"AIKIA": "AIKIA",
"GNGQAS": "GNGQAS",
},
"AdeB": {
"DLSDY": "DLSDY",
"QAYNFAIL": "QAYNFAIL",
"AIQLS": "AIQLS",
"TSGTAE": "TSGTAE",
},
}
def generate_consensus(alignment) -> str:
consensus = []
for i in range(len(alignment[0])):
column = [str(record.seq[i]) for record in alignment]
most_common = Counter(column).most_common(1)[0][0]
consensus.append(most_common)
return "".join(consensus)
def find_motif_positions_in_consensus(seq: str, motif: str) -> list[tuple[int, int]]:
positions = []
start = 1
while True:
found = seq.find(motif, start)
if found == -1:
break
positions.append((found + 1, found + len(motif)))
start = found + 1
return positions
def main() -> None:
parser = argparse.ArgumentParser(description="Find candidate motif positions in a consensus sequence derived from an alignment.")
parser.add_argument("alignment", help="Aligned FASTA file, e.g. adej_aligned.fasta")
parser.add_argument("protein", choices=["AdeJ", "AdeB"], help="Protein family for motif lookup")
parser.add_argument("-o", "--output", help="Optional output file. Defaults to stdout.")
args = parser.parse_args()
alignment = AlignIO.read(args.alignment, "fasta")
consensus_sequence = generate_consensus(alignment)
motif_positions_in_consensus = {}
for motif_name, motif_sequence in MOTIFS[args.protein].items():
positions = find_motif_positions_in_consensus(consensus_sequence, motif_sequence)
motif_positions_in_consensus[motif_name] = positions
text = (
f"Motif positions in the consensus sequence of {args.protein}:\n"
f"{motif_positions_in_consensus}\n\n"
f"Consensus sequence:\n{consensus_sequence}\n"
)
if args.output:
with open(args.output, "w") as handle:
handle.write(text)
else:
print(text, end="")
if __name__ == "__main__":
main()
5. run_pipeline.sh
#!/usr/bin/env bash
set -euo pipefail
if [[ $# -lt 1 ]]; then
echo "Usage: bash run_pipeline.sh
<your_email_for_ncbi>" >&2
exit 1
fi
EMAIL="$1"
python3 1_fetch_AdeJ_and_AdeB.py --email "$EMAIL"
bash 2_run_mafft.sh
python3 3_calculate_per_position_Shannon_entropy.py adej_aligned.fasta -o adej_aligned.scores
python3 3_calculate_per_position_Shannon_entropy.py adeb_aligned.fasta -o adeb_aligned.scores
python3 4_get_motif_positions.py adej_aligned.fasta AdeJ -o adej_motif_positions.txt
python3 4_get_motif_positions.py adeb_aligned.fasta AdeB -o adeb_motif_positions.txt
echo "Pipeline finished. Generated files:"
ls -1 adej_* adeb_*
6. requirements.txt
biopython>=1.80
How to reproduce the analysis
- Install Python dependencies:
pip install -r requirements.txt
-
Install MAFFT and make sure it is available in your shell
PATH. -
Run the complete workflow:
bash run_pipeline.sh your_email@example.com
Standalone package
For convenience, I also provide the standalone package as a zip archive so the full pipeline can be downloaded and executed locally.