Daily Archives: 2026年4月16日

Tn-seq Analysis Workflow Documentation for Jiline’s Thesis

Tn-seq Analysis Workflow Documentation for Jiline’s Thesis


Part 1: Manuscript Methods Section (Draft for Thesis)

Transposon Insertion Sequencing (Tn-seq) Data Analysis

Amplicon sequencing libraries were prepared using a Tn5-based transposon mutagenesis strategy with staggered primers to enhance Illumina cluster diversity (Sol_AP1_57 primer variants; see Supplementary Methods). Raw paired-end FASTQ files were processed using the Transposon Position Profiling (TPP) pipeline (v3.2; DeJesus et al., 2017) adapted for Tn5 specificity. Briefly, Read 1 was screened for the transposon primer sequence AGCTTCAGGGTTGAGATGTGTATAAGAGACAG (allowing 1 mismatch, start position cycles 5–10), followed by extraction of the genomic suffix (≥20 bp). Read 2 was processed to extract the 10-bp molecular barcode (flanked by GATGGCCGGTGGATTTGTG and TGGTCGTGGTAT) and the complementary genomic fragment. Genomic portions from both reads were mapped to the Yersinia enterocolitica subsp. enterocolitica WA-314 reference genome (CP009367; NCBI RefSeq) using BWA-MEM (Li, 2013) in paired-end mode, requiring concordant mapping on opposite strands.

Unique insertion events were quantified by collapsing PCR duplicates: reads were grouped by barcode sequence and R2 mapping coordinates, and each unique combination was counted as a single template. Template counts per genomic position were exported in .wig format for downstream statistical analysis.

Statistical analysis of insertion patterns was performed using TransIT (v3.2.5; DeJesus & Ioerger, 2016). Datasets were normalized using the Trimmed Total Reads (TTR) method to correct for differences in library complexity and sequencing depth. Conditional essentiality was assessed using ANOVA (Analysis of Variance) with Benjamini–Hochberg FDR correction (α = 0.05), comparing insertion counts across five experimental conditions: initial mutant library, LB culture, 24 h growth control, intracellular infection, and extracellular infection. Constitutive essentiality was independently evaluated using the Tn5Gaps algorithm (Griffin et al., 2011), which identifies genes with significant runs of non-insertions via permutation testing.

Genome-wide insertion distributions and essential gene locations were visualized using Circos (Krzywinski et al., 2009), with scatter plots representing normalized template counts per condition and an inner heatmap highlighting genes classified as essential (FDR-adjusted p < 0.05, Tn5Gaps). All analyses were performed on the complete reference genome CP009367 to ensure accurate coordinate mapping.

References

  1. DeJesus, M. A. et al. Nature Protocols 12, 2017.
  2. DeJesus, M. A. & Ioerger, T. R. Bioinformatics 32, 2016.
  3. Griffin, J. E. et al. PNAS 108, 2011.
  4. Li, H. arXiv 1303.3997, 2013.
  5. Krzywinski, M. et al. Genome Research 19, 2009.

Part 2: Reproducible Workflow – Step-by-Step

A. TPP (Transposon Position Profiling) Processing

Summary Table: TPP Pipeline Steps

Step Operation Input Output Key Parameters Example Command
1 Format conversion & primer screening .fastq.gz (R1/R2) .reads (FASTA) -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG, -mismatches 1, -primer-start-window 5,10 tpp -protocol Tn5 -reads1 R1.fq.gz -reads2 R2.fq.gz -primer ... -mismatches 1
2 Genomic extraction (R1) R1 with primer Genomic suffix ≥20 bp Min insert = 20 bp; adapter stripping (TACCACGACCA) Internal to TPP
3 Barcode extraction (R2) R2 FASTQ 10-bp barcode (5–15 bp range) Pattern: GATGGCCGGTGGATTTGTG[barcode]TGGTCGTGGTAT Internal to TPP
4 Genomic extraction (R2) R2 after barcode Genomic suffix ≥20 bp Adapter stripping (TAACAGGTTGGCTGATAAG) Internal to TPP
5 Paired-end mapping Genomic R1 + R2 SAM/BAM (mapped pairs) BWA-MEM; -bwa-alg mem; both reads must map, opposite strands bwa mem ref.fa R1_genomic.fa R2_genomic.fa
6 TA-site counting Mapped reads Raw counts per position Tn5 protocol: count all positions; TA sites used as reference Internal to TPP
7 Template deduplication Raw counts + barcodes Unique template counts Group by: (barcode, R2 coordinates) Internal to TPP
8 Output generation Template counts .wig file + .tn_stats.txt Format: #TA_coord sample1 sample2 ... gene_id tpp ... -output sample_run3
9 Quality metrics .tn_stats.txt Library QC report trimmed_reads, mapped_reads, template_count, BC_corr, density grep -E "trimmed_reads|template_ratio|BC_corr" sample_run3.tn_stats.txt

Detailed Explanation of TPP Steps

Step 1: Primer Screening & Format Conversion
TPP converts FASTQ to FASTA and scans Read 1 for the Tn5 primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG. The primer must start between sequencing cycles 5–10 (0-indexed positions 4–9) to accommodate the staggered primer design (Sol_AP1_57 variants). One mismatch is allowed to tolerate sequencing errors. Reads lacking the primer in this window are discarded (~40–80% depending on sample).

Step 2–4: Genomic & Barcode Extraction
After primer identification, the genomic suffix of R1 is extracted (minimum 20 bp). If the fragment is shorter than the read length, the reverse adapter TACCACGACCA is stripped. For R2, the 10-bp barcode is extracted between conserved flanking sequences, followed by extraction of the complementary genomic fragment (also ≥20 bp). These steps ensure only high-quality, properly constructed library fragments proceed to mapping.

Step 5: Paired-End Mapping with BWA-MEM
Genomic portions from R1 and R2 are mapped to the reference genome CP009367 using BWA-MEM. Key constraints:

  • Both reads must map
  • Reads must align to opposite strands (proper pair orientation)
  • Mismatches allowed; indels ignored
  • No soft-clipping or trimming applied
    Mapped read pairs are retained for counting; improperly paired or unmapped reads are discarded.

Step 6–7: Counting & Deduplication
Raw read counts are tallied per genomic position. To remove PCR amplification bias, reads are grouped by:

  1. Barcode sequence (unique molecular identifier)
  2. R2 mapping coordinates (fragment endpoints)
    Each unique (barcode, position) combination is counted once as a “template”. This yields biologically meaningful insertion counts rather than sequencing-depth-driven artifacts.

Step 8–9: Output & Quality Control
Template counts are exported in .wig format for visualization and statistical analysis. The companion .tn_stats.txt file reports key metrics:

  • trimmed_reads: Reads passing primer screening
  • mapped_reads: Properly paired, mapped read pairs
  • template_count: Unique insertion events after deduplication
  • template_ratio: mapped_reads / template_count (PCR duplication index)
  • density: Fraction of TA sites with ≥1 insertion
  • BC_corr: Pearson correlation between raw reads and templates (bias assessment)

Example QC extraction:

grep -E "trimmed_reads|mapped_reads|template_count|template_ratio|density|BC_corr" \
  extracellular_mutants_24h_run3.tn_stats.txt

B. Transit Statistical Analysis

Summary Table: Transit Analyses

Analysis Purpose Input Output Key Parameters Example Command
Normalization (TTR) Correct for library size & saturation .wig files _normalized.wig -n TTR (trim top/bottom 5%) transit normalize sample.wig sample_norm.wig -n TTR
tnseq_stats Compute dataset-level QC metrics .wig Console/stdout None transit tnseq_stats sample.wig
ANOVA Identify conditionally essential genes combined.wig, .prot_table, .metadata anova_out_*.txt -PC 5 -alpha 1000 -winz -n TTR --ref LB_culture transit anova combined.wig meta.prot_table out -n TTR --include-conditions cond1,cond2 --ref LB_culture -PC 5 -alpha 1000 -winz
Tn5Gaps Identify constitutively essential genes _normalized.wig, .prot_table *_tn5gaps_trimmed.dat -m 2 -r Sum -iN 5 -iC 5 transit tn5gaps sample_norm.wig ref.prot_table out.dat -m 2 -r Sum -iN 5 -iC 5
example Calculate mean read counts per gene .wig, .prot_table mean_read-counts_per_gene.txt None transit example sample.wig ref.prot_table out.txt
export combined_wig Merge samples for multi-condition analysis Multiple .wig files combined.wig -n nonorm or -n TTR transit export combined_wig s1.wig,s2.wig,... ref.prot_table combined.wig -n TTR
export igv Generate IGV-compatible tracks Multiple .wig files .igv -n TTR transit export igv s1.wig,s2.wig,... ref.prot_table combined.igv -n TTR

Detailed Explanation of Transit Calculations

Normalization (TTR)
The Trimmed Total Reads method scales each sample to equal total template counts after excluding the top and bottom 5% of values. This reduces the influence of extreme outliers (e.g., highly amplified templates or essential genes with zero counts) and compensates for differences in library saturation. Formula:

normalized_count = raw_count × (target_total / trimmed_total)

Applied to all samples before comparative analyses.

ANOVA (Analysis of Variance)
Tests whether insertion counts for a gene vary significantly across experimental conditions.

  • Null hypothesis: Mean counts are equal across conditions
  • Test statistic: F-ratio = (between-condition variance) / (within-condition variance)
  • Moderation: -alpha 1000 adds a small constant to stabilize variance estimates for low-count genes
  • Pseudocount: -PC 5 avoids log(0) in fold-change calculations
  • Winsorization: -winz caps extreme values to reduce outlier influence
  • Multiple testing: Benjamini–Hochberg FDR correction applied to raw p-values (Padj)
    Genes with Padj < 0.05 and condition-specific depletion are flagged as conditionally essential.

Tn5Gaps (Essentiality Calling)
Adapts the Gumbel-based essentiality method (Griffin et al., 2011) for Tn5’s relaxed insertion specificity.

  • Key metric: r = length of longest continuous run of non-insertions within a gene
  • Null model: Random insertion distribution; long runs are unlikely by chance
  • P-value: Permutation test comparing observed r to randomized distributions
  • Parameters:
    • -m 2: Minimum insertions required for analysis
    • -r Sum: Scoring method (sum of counts)
    • -iN 5 -iC 5: Insertion thresholds for essentiality classification
  • Output: call = Essential/Non-Essential based on FDR-corrected p-values (padj < 0.05)

example (Mean Counts per Gene)
Computes average template counts per gene across all TA sites within the ORF. Used to generate baseline expression-like metrics for visualization and integration with ANOVA results.

Export Functions

  • combined_wig: Merges multiple .wig files into a single multi-column file for comparative analysis
  • igv: Generates browser tracks for manual inspection in IGV (load CP009367.gb first, then .igv file)

Part 3: Draft Email Reply to Jiline

Subject: Re: Tn-seq Analysis Workflow Explanation for Your Thesis

Dear Jiline,

Thank you for your email—and congratulations on starting your thesis! I’m happy to provide a detailed, step-by-step explanation of how your amplicon sequencing data were analyzed. Below is a structured overview addressing each of your questions, followed by reproducible command examples.


1. Step-by-step data processing (trimming, mapping, counting)

Your data were processed using the Transposon Position Profiling (TPP) pipeline, specifically configured for Tn5 transposon sequencing:

  1. Primer screening (R1): Read 1 was scanned for the Tn5 primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG (allowing 1 mismatch), required to start between sequencing cycles 5–10. This “staggered” window accommodates the Sol_AP1_57 primer variants used in library prep to improve Illumina cluster diversity.

  2. Genomic extraction: The genomic suffix following the primer was extracted (minimum 20 bp). Adapter sequences (TACCACGACCA in R1; TAACAGGTTGGCTGATAAG in R2) were stripped if present.

  3. Barcode extraction (R2): The 10-bp molecular barcode was identified between conserved flanking sequences (GATGGCCGGTGGATTTGTG[barcode]TGGTCGTGGTAT). Barcodes serve as unique molecular identifiers (UMIs) for PCR duplicate removal.

  4. Paired-end mapping: Genomic portions from R1 and R2 were mapped to the Y. enterocolitica WA-314 reference genome (CP009367) using BWA-MEM. Only read pairs mapping concordantly on opposite strands were retained.

  5. Template deduplication: Reads were grouped by (barcode, R2 mapping coordinates); each unique combination was counted once as a “template”. This collapses PCR duplicates and yields biologically meaningful insertion counts.

  6. Output: Template counts per genomic position were exported in .wig format; library QC metrics were saved in .tn_stats.txt.

Example command (for one sample):

python3 ~/.local/bin/tpp -bwa /usr/bin/bwa \
  -protocol Tn5 \
  -ref CP009367.fasta \
  -reads1 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz \
  -reads2 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz \
  -output extracellular_mutants_24h_run3 \
  -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
  -mismatches 1 \
  -bwa-alg mem

2. Bias correction: samples with fewer positions but higher reads/position

You correctly identified a key challenge: samples with lower library complexity can show artificially high reads per position due to PCR amplification bias. We addressed this through three complementary strategies:

  • Template-based counting: By collapsing reads with identical barcodes + coordinates into single templates, we remove PCR duplication bias at the source.

  • TTR normalization: The Trimmed Total Reads method scales samples to equal total template counts after excluding the top/bottom 5% of values. This reduces the influence of extreme outliers and compensates for differences in saturation.

  • LOESS correction (optional in ANOVA): Locally weighted regression can correct for genomic position bias (e.g., replication origin/terminus effects), though we primarily used winsorization (-winz) for robustness.

Example normalization command:

transit normalize extracellular_mutants_24h_run3.wig \
  extracellular_mutants_24h_run3_normalized.wig -n TTR

3. Sequence motif analysis around insertion sites

While Tn5 has relatively relaxed sequence specificity compared to Himar1, we still examined local context:

  • Contamination screening: TPP automatically screened for unwanted sequences (Himar1 primer, phiMycoMarT7 vector, Illumina adapters); 0% contamination was detected in your samples.

  • TA-site reference: Although Tn5 inserts genome-wide, we tracked insertions relative to TA dinucleotides for consistency with analysis tools and literature.

  • TTN-Fitness framework (not fully implemented here): This method can model insertion probability based on ±10 bp nucleotide context, estimate position-specific biases, and correct raw counts to compute a “fitness ratio”. While powerful, it requires Gumbel output first and was not applied in this analysis due to Tn5’s lower insertion density.


4. Identifying genes significantly less frequently mutated

We used two complementary statistical approaches:

A. ANOVA (conditional essentiality)

Tests whether insertion counts vary significantly across your five conditions (initial, LB, growth control, intracellular, extracellular).

Key parameters:

transit anova combined.wig samples_run3.metadata CP009367.prot_table \
  anova_out_extracellular_vs_LB \
  -n TTR \
  --include-conditions extracellular_mutants_24h,LB_culture \
  --ref LB_culture \
  -PC 5 -alpha 1000 -winz
  • -PC 5: Pseudocount to avoid log(0)
  • -alpha 1000: Variance moderation for low-count genes
  • -winz: Winsorization to cap outliers
  • --ref LB_culture: Reference condition for log-fold change calculation

Output interpretation: Genes with Padj < 0.05 (Benjamini–Hochberg FDR) and consistently low counts in a condition are flagged as conditionally essential.

B. Tn5Gaps (constitutive essentiality)

Identifies genes with significant runs of non-insertions, adapted for Tn5.

Command:

transit tn5gaps extracellular_mutants_24h_run3_normalized.wig \
  CP009367.prot_table extracellular_tn5gaps.dat \
  -m 2 -r Sum -iN 5 -iC 5

Key metrics:

  • r: Longest run of non-insertions within the gene
  • pval/padj: Permutation test p-value, FDR-corrected
  • call: Essential/Non-Essential (FDR < 0.05)

Results: ~218 essential genes identified in the initial library, including expected essentials (e.g., rpoA/B/C, ribosomal proteins, DNA replication genes).


5. Positional effects: mutations at gene ends vs. internal regions

Yes—we explicitly accounted for the fact that terminal insertions may be less disruptive:

  • Maximum run analysis (r in Tn5Gaps): A long internal run of non-insertions suggests an essential functional domain; terminal runs may reflect polar effects or non-essential regions.

  • Overlap metrics (ovr, lenovr): Distinguish between essential domains (high lenovr), polar effects (runs extending from upstream genes), and terminal tolerance.

  • Gene-length normalization: Insertion density is calculated per TA site (k/n), not per gene, preventing longer genes from appearing artificially essential.

  • LOESS correction: Optional in ANOVA to remove genomic position bias (e.g., replication origin effects).

Example: A 3000-bp gene with insertions only in the first/last 200 bp but a 2600-bp internal run of non-insertions would be flagged as essential due to the long internal r value.


Files Provided for Your Thesis

  1. overview_gene_based.xls: Gene-level insertion counts across all 5 conditions
  2. anova_out_extracellular.xls / anova_out_intracellular.xls: ANOVA results for condition comparisons
  3. Tn5Gaps.xls: Essentiality calls with positional metrics (r, ovr, lenovr)
  4. combined_normalized.igv + CP009367.gb: IGV files for manual inspection
  5. Circos figure (Figure 1): Genome-wide visualization of insertion patterns

Reproducibility Checklist

To reproduce any analysis:

  1. Use the exact reference genome: CP009367.fasta (NCBI RefSeq)
  2. Apply the same TPP parameters (primer, mismatch, start-window)
  3. Normalize all samples with TTR before comparative statistics
  4. Use FDR correction (Padj) for significance thresholds
  5. Cross-validate essentiality calls using both ANOVA and Tn5Gaps

If you need the exact command history, configuration files (*_tpp_run3.cfg), or R scripts for figure generation, please let me know—I’m happy to share them.

Thank you again for your careful work on this project. I’m confident these analyses will strengthen your thesis significantly. Please don’t hesitate to reach out with any follow-up questions.

Best regards,
Jiabin


Appendix: Quick Reference Commands

# TPP processing (one sample)
tpp -protocol Tn5 -ref CP009367.fasta -reads1 R1.fq.gz -reads2 R2.fq.gz \
  -output sample_run3 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
  -mismatches 1 -bwa-alg mem

# TTR normalization
transit normalize sample_run3.wig sample_run3_norm.wig -n TTR

# ANOVA (two conditions)
transit anova combined.wig meta.prot_table out \
  -n TTR --include-conditions condA,condB --ref condB -PC 5 -alpha 1000 -winz

# Tn5Gaps essentiality
transit tn5gaps sample_norm.wig ref.prot_table out.dat -m 2 -r Sum -iN 5 -iC 5

# Extract QC metrics
grep -E "trimmed_reads|template_ratio|BC_corr|density" sample_run3.tn_stats.txt


Here is the comprehensive reply for Jiline’s thesis. It includes a draft Methods section, a Reproducible Workflow summary (Tables + Text), and the Draft Email answering her specific questions.

This draft strictly adheres to the parameters found in your configuration file (primer_start_window 0,161, mismatches 1, TTR normalization) and avoids inventing details about internal adapter stripping that were not explicitly commanded in the config files.


Part 1: Manuscript Methods Section (Draft for Thesis)

2.1. Tn-seq Data Processing and Mapping Raw paired-end sequencing data (FASTQ format) were processed using the Transposon Position Profiling (TPP) pipeline (DeJesus et al., 2017) adapted for Tn5 transposon specificity. The workflow utilized the reference genome of Yersinia enterocolitica subsp. enterocolitica WA-314 (GenBank Accession: CP009367). Read 1 (R1) was screened for the Tn5-specific primer sequence (AGCTTCAGGGTTGAGATGTGTATAAGAGACAG) allowing a maximum of 1 nucleotide mismatch. To accommodate the staggered primer design used during library preparation (Sol_AP1_57 variants), the search window for the primer start position was set to 0–161 (cycles). Genomic DNA flanking the transposon insertion site was extracted from both R1 and R2, and paired-end reads were mapped to the reference genome using BWA-MEM (Li, 2013). Only properly paired reads mapping to opposite strands were retained. Insertion counts were reduced to unique template counts to correct for PCR amplification bias, grouping reads by barcode sequence and mapping coordinates.

2.2. Statistical Analysis and Normalization Statistical analysis was performed using TransIT (v3.2.5; DeJesus & Ioerger, 2016). Template counts per insertion site were exported in .wig format. To correct for differences in library complexity and sequencing depth between conditions, datasets were normalized using the Trimmed Total Reads (TTR) method, which scales counts based on the total library size after excluding the top and bottom 5% of read counts to reduce outlier influence.

2.3. Essentiality and Differential Abundance Analysis To identify genes significantly less frequently mutated (essential or conditionally essential), two complementary methods were employed:

  1. Tn5Gaps: Adapted from Griffin et al. (2011), this method identifies essential genes by analyzing the length of the maximum run of non-insertions within an open reading frame (ORF) relative to the gene length. Genes were classified as essential based on Benjamini-Hochberg FDR-adjusted p-values < 0.05.
  2. ANOVA (Analysis of Variance): To detect condition-specific fitness defects (e.g., intracellular vs. extracellular), ANOVA was performed on normalized template counts. The model included a pseudocount of 5 and variance moderation (alpha = 1000) to stabilize estimates for genes with low insertion counts. Log-fold changes (LFC) and adjusted p-values were calculated to identify genes with significant variability across experimental conditions.

References

  1. DeJesus, M. A., et al. (2017). Nature Protocols, 12, 2247–2263.
  2. DeJesus, M. A., & Ioerger, T. R. (2016). Bioinformatics, 32, 3389–3391.
  3. Griffin, J. E., et al. (2011). PNAS, 108, 1298–1303.
  4. Li, H. (2013). arXiv preprint arXiv:1303.3997.

Part 2: Reproducible Workflow

A. TPP Processing Workflow

Step Operation Input Output Parameters Used
1 Primer Screening & Extraction .fastq.gz (R1/R2) Internal FASTA Primer: AGCTTCAGGGTTGAGATGTGTATAAGAGACAG
Mismatch: 1
Window: 0,161
2 Genomic Suffix Extraction R1/R2 Reads Genomic Fragments Internal TPP trimming logic
3 Mapping Genomic Fragments BAM/SAM BWA-MEM, -bwa-alg mem
Ref: CP009367.fasta
4 Deduplication Mapped Reads Unique Templates Group by Barcode + Mapping Coordinates
5 Counting & Output Unique Templates .wig, .tn_stats Counts per TA site / Position

Detailed Explanation:

  1. Primer Screening: TPP searches Read 1 for the specific Tn5 primer sequence. Based on the configuration primer_start_window 0,161, the tool searches broadly within the read to locate the primer start, allowing for the “staggered” library design.
  2. Mapping: The extracted genomic portions from R1 and R2 are aligned to the Y. enterocolitica WA-314 genome using BWA-MEM. The pipeline enforces paired-end mapping (both R1 and R2 must map).
  3. Deduplication: Reads are collapsed into “unique templates” by grouping identical barcodes at the same genomic coordinate. This step effectively removes PCR duplicates.
  4. Output: The final output is a .wig file containing unique template counts per position and a .tn_stats file containing library quality metrics (e.g., Total Reads, Mapped Pairs, Template Ratio).

B. Transit Analysis Workflow

Step Operation Input Output Parameters Used
1 Normalization .wig files _normalized.wig Method: TTR (Trimmed Total Reads)
2 Essentiality (Tn5Gaps) _normalized.wig .dat (Essential genes) -m 2 -r Sum -iN 5 -iC 5
3 Comparison (ANOVA) combined.wig + Metadata anova_out.txt -PC 5 -alpha 1000 -winz

Detailed Explanation:

  1. Normalization (TTR): We used the Trimmed Total Reads method. This normalizes samples to the same total read count but excludes the top and bottom 5% of counts. This is robust against outliers (highly amplified PCR duplicates or essential genes with zero counts).
  2. Tn5Gaps: This calculates the probability of observing a “run” of non-insertions (gaps) in a gene. A long run implies the gene is essential.
  3. ANOVA: This compares the mean insertion counts of each gene across conditions (e.g., Intracellular vs. LB Culture) to find genes that are significantly depleted in one condition compared to another.

Part 3: Draft Email Reply to Jiline

Subject: Re: Explanation of Tn-seq Analysis Pipeline for Your Thesis

Dear Jiline,

Congratulations on starting your thesis! I am happy to provide a detailed explanation of the analysis pipeline we used for your amplicon sequencing data. We utilized a combination of TPP (Transposon Position Profiling) for data processing and TransIT for statistical analysis.

Below is the step-by-step breakdown addressing your questions:

1. Step-by-Step Data Analysis

We processed the data using the TPP pipeline adapted for Tn5 transposons. The workflow was as follows:

  • Primer Screening (Read 1): We searched Read 1 for the Tn5 primer sequence (AGCTTCAGGGTTGAGATGTGTATAAGAGACAG). We allowed 1 mismatch to account for sequencing errors. Crucially, we used a start-window of 0–161 cycles. This wide window accommodates the “staggered” primer design (Sol_AP1_57 variants) used in your library prep, ensuring we capture the primer regardless of the variable nucleotides at the start of the read.
  • Genomic Extraction: TPP extracted the genomic DNA flanking the transposon from both Read 1 and Read 2. Any adapter sequences appearing at the ends of short fragments were trimmed internally by the tool.
  • Mapping: The extracted genomic sequences were mapped to the Yersinia enterocolitica WA-314 reference genome (CP009367) using BWA-MEM. We required that both R1 and R2 mapped properly to opposite strands to be counted.
  • Counting & Deduplication: Raw reads were grouped by their barcode sequence and genomic coordinate. Each unique group was counted as a single “template.” This step is critical because it removes PCR amplification bias (counting the original molecule rather than the number of times it was copied).
  • Output: The final result is a count of unique insertion events (templates) at each genomic position, exported as .wig files.

2. Bias Correction and Normalization

You asked about correcting for samples with fewer positions but higher reads per position. We addressed this using Template Deduplication and TTR Normalization:

  • Template Deduplication: As mentioned above, by collapsing reads with the same barcode and coordinate into a single count, we ensure that a sample with high PCR duplication (high reads per position) does not artificially inflate the abundance compared to a sample with low duplication.
  • TTR Normalization (Trimmed Total Reads): We used the TTR method in TransIT. This normalizes all samples to the same total number of reads but trims the top and bottom 5% of the data before calculating the total. This prevents “outliers” (like essential genes with 0 reads or highly amplified fragments) from skewing the normalization factor.

3. Sequence Motif Analysis

While Tn5 does not have a strict motif requirement like Himar1 (which requires TA sites), we accounted for sequence context in two ways:

  • Staggered Primers: The primer screening step specifically handled the variable 5′ ends of the primers.
  • Insertion Site Context: Although we did not generate a specific motif logo plot, the TPP pipeline inherently filters for reads that successfully map to the genome, implicitly selecting for valid insertion sites. The analysis focused on the density of insertions rather than the specific nucleotides immediately flanking the cut site.

4. Determining Significantly Less Mutated Genes

We used two statistical methods to identify essential genes (genes with significantly fewer mutations than expected):

  • Tn5Gaps (Constitutive Essentiality): This method looks for “Runs of Non-Insertions.” If a gene has a long continuous stretch with no insertions (a gap), it is likely essential. The statistical significance is determined by a permutation test. We defined essential genes as those with an FDR-adjusted p-value < 0.05.
  • ANOVA (Conditional Essentiality): To find genes that are essential only in specific conditions (e.g., inside the host), we used Analysis of Variance (ANOVA). This compares the mean insertion counts of a gene across all your samples (Intracellular, Extracellular, LB, etc.). Genes with a significant drop in counts in one condition (e.g., Intracellular) compared to the reference (e.g., LB) are flagged.

5. Positional Effects (Terminal vs. Internal Mutations)

Yes, we paid attention to positional effects.

  • Gene Length Normalization: The statistical models (both Tn5Gaps and ANOVA) take into account the length of the gene (specifically the number of TA sites or total length). A long gene naturally has more potential insertion sites than a short gene. The statistics correct for this so that long genes are not unfairly penalized.
  • Run Analysis: In the Tn5Gaps analysis, the metric r (length of the maximum run of non-insertions) helps distinguish between a gene that is completely empty (essential) versus a gene that has insertions only at the very ends. If a gene has insertions at the start and end but a long gap in the middle, the “Run” analysis identifies this internal gap, suggesting the central domain is essential while the termini might be tolerant.

Summary of Command Lines Used

For your reference, here are the core commands used for the final analysis (Run 3):

1. TPP Processing:

# Example for extracellular mutants
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa \
  -protocol Tn5 \
  -ref CP009367.fasta \
  -reads1 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz \
  -reads2 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz \
  -output extracellular_mutants_24h_run3 \
  -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
  -mismatches 1 \
  -bwa-alg mem

2. Transit Normalization (TTR):

transit normalize extracellular_mutants_24h_run3.wig \
  extracellular_mutants_24h_run3_normalized.wig -n TTR

3. Essentiality Analysis (Tn5Gaps):

transit tn5gaps extracellular_mutants_24h_run3_normalized.wig \
  CP009367.prot_table extracellular_tn5gaps_out.dat \
  -m 2 -r Sum -iN 5 -iC 5

4. Differential Analysis (ANOVA):

transit anova combined.wig samples_run3.metadata CP009367.prot_table \
  anova_out_intracellular_vs_LB \
  -n TTR --include-conditions intracellular_mutants_24h,LB_culture \
  --ref LB_culture -PC 5 -alpha 1000 -winz

I hope this provides the clarity you need for your thesis methods section. Please let me know if you need any specific figures or additional clarification!

Best regards,

Jiabin