Part 1: Manuscript Methods Section
Raw paired-end sequencing data in FASTQ format were processed using the Transposon Position Profiling (TPP) pipeline (DeJesus et al., 2017), adapted for Tn5 transposon specificity. The analysis was performed using the reference genome of Yersinia enterocolitica subsp. enterocolitica WA-314 (GenBank accession: CP009367).
Read 1 (R1) was screened for the Tn5-specific primer sequence (AGCTTCAGGGTTGAGATGTGTATAAGAGACAG), allowing up to one nucleotide mismatch. Genomic DNA flanking the transposon insertion site was extracted from R1 and R2 reads, and paired-end reads were aligned to the reference genome using BWA-MEM (Li, 2013). Only properly paired reads mapping to opposite strands were retained for further analysis.
Unique insertion events were quantified after collapsing PCR duplicates. Reads were grouped according to barcode sequence and mapping coordinates, and each unique combination was counted as a single template. Template counts at each genomic position were exported in .wig format for downstream statistical analysis.
Statistical analysis of insertion patterns was conducted using Transit (v3.2.5; DeJesus & Ioerger, 2016). Datasets were normalized using the Trimmed Total Reads (TTR) method to correct for differences in library complexity and sequencing depth. Conditional essentiality was assessed by analysis of variance (ANOVA), followed by Benjamini-Hochberg false discovery rate (FDR) correction (α = 0.05), comparing insertion counts across five experimental conditions: initial mutant library, LB culture, 24-hour growth control, intracellular infection, and extracellular infection. Constitutive essentiality was evaluated independently using the Tn5Gaps algorithm (Griffin et al., 2011), which identifies genes containing significant runs of non-insertions by permutation testing.
Genome-wide insertion distributions and essential gene locations were visualized using Circos (Krzywinski et al., 2009). Scatter plots represented normalized template counts for each condition, and an inner heatmap highlighted genes classified as essential (FDR-adjusted p-value < 0.05, Tn5Gaps). All analyses were performed on the complete reference genome CP009367 to ensure accurate coordinate mapping.
References DeJesus, M. A. et al. Nature Protocols 12, 2017. DeJesus, M. A. & Ioerger, T. R. Bioinformatics 32, 2016. Griffin, J. E. et al. PNAS 108, 2011. Li, H. arXiv 1303.3997, 2013. Krzywinski, M. et al. Genome Research 19, 2009.
Part 2: Summary Table – Key Quality Metrics (Run3 – Final Analysis)
| METRIC | INITIAL_MUTANTS | LB_CULTURE | GROWTHOUT_CONTROL_24H | INTRACELLULAR_MUTANTS_24H | EXTRACELLULAR_MUTANTS_24H |
|---|---|---|---|---|---|
| Total reads | 49,821,406 | 43,486,192 | 70,663,823 | 51,244,639 | 47,473,664 |
| Valid Tn prefix | 20,339,623 (40.8%) | 22,631,019 (52.0%) | 26,777,280 (37.9%) | 23,204,461 (45.3%) | 9,358,660 (19.7%) |
| Mapped read pairs | 16,445,755 (80.9%) | 19,994,409 (88.4%) | 24,141,881 (90.2%) | 20,909,755 (90.1%) | 6,588,961 (70.4%) |
| Unique templates | 2,559,561 | 3,393,325 | 3,642,183 | 1,476,522 | 248,080 |
| Template ratio | 6.43 | 5.89 | 6.63 | 14.16 | 26.56 |
| Density (TAs hit/total) | 0.026 | 0.026 | 0.026 | 0.022 | 0.012 |
| BC_corr | 0.921 | 0.930 | 0.918 | 0.911 | 0.824 |
Interpretation:
BC_corr > 0.9for four of the five samples indicates strong concordance between raw reads and deduplicated templates and is generally consistent with minimal PCR amplification bias.- The
extracellular_mutants_24hsample shows reduced library complexity (19.7% valid prefix, template ratio = 26.56,BC_corr= 0.824). This pattern likely reflects strong biological selection during extracellular growth.
Part 3: Step-by-Step Data Processing (TPP Pipeline)
Primer Screening & Genomic Extraction
- Primer:
AGCTTCAGGGTTGAGATGTGTATAAGAGACAG(Tn5-specific) - Parameters: One mismatch allowed
- Genomic extraction: Suffix ≥20 bp downstream of the primer; adapter stripping was applied for short fragments
Paired-End Mapping
- Tool: BWA-MEM (
bwa-alg mem) - Requirements: Both R1 and R2 were required to map to opposite strands on reference CP009367
Template Deduplication
- Reads were grouped by
(barcode, mapping coordinates) - Each unique combination was counted once as a “template” to remove PCR duplicates
Output
.wigfiles: Template counts per genomic position.tn_stats.txt: Library QC metrics
Part 4: Statistical Analysis (Transit)
Normalization: TTR Method
- Trimmed Total Reads: Scales samples to equal total counts after excluding the top and bottom 5% of values.
- Purpose: Reduces the influence of outliers, including essential genes with zero counts or highly amplified templates.
Differential Essentiality Analysis: ANOVA
transit anova combined.wig samples_run3.metadata CP009367.prot_table \\
anova_out_intracellular_vs_LB \\
-n TTR --include-conditions intracellular_mutants_24h,LB_culture \\
--ref LB_culture -PC 5 -alpha 1000 -winz
| PARAMETER | MEANING | RATIONALE |
|---|---|---|
--include-conditions |
Conditions to compare (comma-separated) | Enables pairwise or multi-condition comparisons |
--ref |
Reference condition for LFC calculation | Log-fold changes are computed relative to this baseline |
-PC 5 |
Pseudocount added to all counts | Avoids log(0) and stabilizes low-count estimates |
-alpha 1000 |
Variance moderation parameter | Shrinks extreme variance estimates for genes with few insertions |
-winz |
Winsorization flag | Caps the top and bottom 1% of values to reduce the influence of outliers |
-n TTR |
Normalization method | Applies Trimmed Total Reads normalization before analysis |
KEY METRICS IN OUTPUT:
| COLUMN | DEFINITION |
|---|---|
Orf/Rv |
Gene identifier (e.g., CH47_1012) |
Gene |
Gene name (e.g., pncB, phoQ) |
TAs |
Number of TA dinucleotides within the ORF |
Mean_[condition] |
Mean normalized template count per condition |
LFC_[condition] |
Log₂ fold change relative to the reference condition |
Fstat |
F-test statistic (between-condition / within-condition variance) |
Pval |
Raw p-value from the ANOVA F-test |
Padj |
Benjamini-Hochberg FDR-adjusted p-value |
status |
Quality control flags (e.g., “No counts in all conditions”) |
Results saved in: anova_out_intracellular.xls, anova_out_extracellular.xls, heatmap_q0.05.png
Essentiality Analysis: Tn5Gaps
transit tn5gaps ${sample}_run3_normalized.wig CP009367.prot_table \\
${sample}_tn5gaps_trimmed.dat -m 2 -r Sum -iN 5 -iC 5
| PARAMETER | MEANING | RATIONALE |
|---|---|---|
-m 2 |
Minimum insertions for analysis | Genes with fewer than 2 insertions lack sufficient statistical power |
-r Sum |
Scoring method: sum of counts | Robust measure of overall insertion density |
-iN 5 |
Minimum insertion density (per kb) for non-essential calls | Filters genes with very sparse coverage |
-iC 5 |
Minimum absolute insertions for non-essential calls | Ensures sufficient absolute coverage |
KEY METRICS IN OUTPUT:
| COLUMN | DEFINITION |
|---|---|
k |
Observed insertions within the ORF |
n |
Total TA dinucleotides within the ORF |
r |
Length of the maximum run of non-insertions |
pval/padj |
Permutation test p-value, FDR-corrected |
call |
Essential/Non-essential (FDR < 0.05) |
Results Summary:
- ~218 essential genes were identified in the initial mutant library (~5.4% of the genome).
- Typical essential genes confirmed:
- Ribosomal proteins:
rpmJ,rpsM,rpsK,rpsD,rplQ,rpmI,rplT - RNA polymerase:
rpoA(alpha subunit) - Translation factors:
infC(IF-3),pheS/pheT(Phe-tRNA ligase) - Protein translocation:
secY,secD,secF(Sec translocase) - DNA replication:
dnaA,dnaN - Cell division:
ftsH(FtsH protease) - tRNA processing:
thrS(Thr-tRNA ligase) - Nucleoid organization:
ihfA/ihfB(integration host factor) - Ribosome maturation:
rimP,rbfA - Central metabolism:
glmM(phosphoglucosamine mutase)
- Ribosomal proteins:
These genes are universally essential across bacterial species, supporting the validity of the analysis pipeline.
Results saved in: Tn5Gaps.xls
Part 5: Circos Visualization – Genome-Wide Insertion Patterns
To visualize transposon insertion distributions across the Y. enterocolitica WA-314 genome, a Circos plot was generated with the following structure:
Figure Layout
- Outermost ring: Genome ideogram with kilobase scale markers
- Five concentric scatter rings: Normalized template counts per insertion site for each experimental condition (extracellular, intracellular, growth control, LB culture, initial mutants), color-coded for distinction
- Innermost heatmap ring: Locations of genes classified as essential by Tn5Gaps analysis (FDR-adjusted p-value < 0.05)
Data Preparation Workflow
- Input processing: The normalized
combined.wigfile, containing template counts per TA site across all conditions, was parsed to extract coordinate-value pairs for each sample. - Format conversion: Data were reshaped into Circos-compatible format (
chromosome, start, end, value), with zero-count positions optionally removed to improve visual clarity. - Essential gene extraction: Genes identified as essential in the initial mutant library were extracted from the Tn5Gaps output and formatted as genomic intervals for heatmap display.
- Configuration: A Circos configuration file specified ring radii, color schemes, glyph styles (circles for scatter plots), axis spacing, and label formatting.
This visualization complements the statistical analyses by providing an intuitive spatial overview of insertion patterns across the complete genome.
Part 6: Addressing Specific Questions
1. Step-by-step analysis? See Part 2 above. The TPP pipeline integrates trimming, mapping, counting, and deduplication into a single workflow.
2. Bias correction for samples with fewer positions but higher reads per position?
- Template deduplication: Collapses PCR duplicates by
(barcode + coordinate). - TTR normalization: Trims extreme values before scaling, thereby reducing the influence of outliers.
- BC_corr monitoring: Values > 0.9 indicate minimal PCR bias in most samples.
- Gene-length normalization: Density = k/n (insertions per TA site), preventing longer genes from appearing artificially essential.
3. Sequence motif analysis around insertion sites? Although Tn5 displays relatively relaxed sequence specificity, unlike Himar1 with its strict TA requirement, explicit motif logo analysis was not performed. However, the pipeline inherently restricts analysis to valid insertion sites through precise mapping to the reference genome.
4. Determining significantly less frequently mutated genes? Two complementary approaches were applied:
- Tn5Gaps: Identifies constitutive essentiality through runs of non-insertions (permutation test, FDR correction).
- ANOVA: Identifies condition-specific essentiality by comparing insertion counts across conditions (F-test, Benjamini-Hochberg correction).
5. Positional effects (mutations at gene ends less lethal)? Yes, this issue is explicitly addressed within the analysis framework. The Tn5Gaps algorithm accounts for positional effects by distinguishing between internal and terminal gaps in insertion coverage:
rmetric: Represents the length of the longest continuous run of non-insertions. Long internal runs typically indicate essential protein domains.lenovrmetric: Represents the full length of the non-insertion run with the greatest overlap with the gene body.
Decision Pipeline Summary:
- For each gene, calculate k (observed insertions), n (total TA sites), r, and
lenovrfrom the insertion data. - Perform a permutation test: p = P(r_perm ≥ r_obs | random insertion).
- Apply Benjamini-Hochberg correction to obtain the adjusted p-value (p_adj).
- Interpret
lenovrto determine whether the significant gap is internal or terminal.
Final Essentiality Call:
- If p_adj < 0.05 and
lenovr≈ r (internal gap) → Essential - If p_adj < 0.05 and
lenovr<< r (terminal gap) → Review manually - If p_adj ≥ 0.05 → Non-essential (insufficient evidence)
This two-layer approach, combining statistical significance (p_adj) with biological context (lenovr/k), ensures that essentiality calls are both statistically rigorous and biologically interpretable. It explicitly accounts for positional effects, particularly terminal tolerance, where gaps at gene ends may have limited functional consequences.