Journal-Polished Manuscript Methods and Analysis Text for TnSeq (Data_Jiline_Transposon)

Part 1: Manuscript Methods Section

Raw paired-end sequencing data in FASTQ format were processed using the Transposon Position Profiling (TPP) pipeline (DeJesus et al., 2017), adapted for Tn5 transposon specificity. The analysis was performed using the reference genome of Yersinia enterocolitica subsp. enterocolitica WA-314 (GenBank accession: CP009367).

Read 1 (R1) was screened for the Tn5-specific primer sequence (AGCTTCAGGGTTGAGATGTGTATAAGAGACAG), allowing up to one nucleotide mismatch. Genomic DNA flanking the transposon insertion site was extracted from R1 and R2 reads, and paired-end reads were aligned to the reference genome using BWA-MEM (Li, 2013). Only properly paired reads mapping to opposite strands were retained for further analysis.

Unique insertion events were quantified after collapsing PCR duplicates. Reads were grouped according to barcode sequence and mapping coordinates, and each unique combination was counted as a single template. Template counts at each genomic position were exported in .wig format for downstream statistical analysis.

Statistical analysis of insertion patterns was conducted using Transit (v3.2.5; DeJesus & Ioerger, 2016). Datasets were normalized using the Trimmed Total Reads (TTR) method to correct for differences in library complexity and sequencing depth. Conditional essentiality was assessed by analysis of variance (ANOVA), followed by Benjamini-Hochberg false discovery rate (FDR) correction (α = 0.05), comparing insertion counts across five experimental conditions: initial mutant library, LB culture, 24-hour growth control, intracellular infection, and extracellular infection. Constitutive essentiality was evaluated independently using the Tn5Gaps algorithm (Griffin et al., 2011), which identifies genes containing significant runs of non-insertions by permutation testing.

Genome-wide insertion distributions and essential gene locations were visualized using Circos (Krzywinski et al., 2009). Scatter plots represented normalized template counts for each condition, and an inner heatmap highlighted genes classified as essential (FDR-adjusted p-value < 0.05, Tn5Gaps). All analyses were performed on the complete reference genome CP009367 to ensure accurate coordinate mapping.

References DeJesus, M. A. et al. Nature Protocols 12, 2017. DeJesus, M. A. & Ioerger, T. R. Bioinformatics 32, 2016. Griffin, J. E. et al. PNAS 108, 2011. Li, H. arXiv 1303.3997, 2013. Krzywinski, M. et al. Genome Research 19, 2009.

Part 2: Summary Table – Key Quality Metrics (Run3 – Final Analysis)

METRIC	INITIAL_MUTANTS	LB_CULTURE	GROWTHOUT_CONTROL_24H	INTRACELLULAR_MUTANTS_24H	EXTRACELLULAR_MUTANTS_24H
Total reads	49,821,406	43,486,192	70,663,823	51,244,639	47,473,664
Valid Tn prefix	20,339,623 (40.8%)	22,631,019 (52.0%)	26,777,280 (37.9%)	23,204,461 (45.3%)	9,358,660 (19.7%)
Mapped read pairs	16,445,755 (80.9%)	19,994,409 (88.4%)	24,141,881 (90.2%)	20,909,755 (90.1%)	6,588,961 (70.4%)
Unique templates	2,559,561	3,393,325	3,642,183	1,476,522	248,080
Template ratio	6.43	5.89	6.63	14.16	26.56
Density (TAs hit/total)	0.026	0.026	0.026	0.022	0.012
BC_corr	0.921	0.930	0.918	0.911	0.824

Interpretation:

BC_corr > 0.9 for four of the five samples indicates strong concordance between raw reads and deduplicated templates and is generally consistent with minimal PCR amplification bias.
The extracellular_mutants_24h sample shows reduced library complexity (19.7% valid prefix, template ratio = 26.56, BC_corr = 0.824). This pattern likely reflects strong biological selection during extracellular growth.

Part 3: Step-by-Step Data Processing (TPP Pipeline)

Primer Screening & Genomic Extraction

Primer: AGCTTCAGGGTTGAGATGTGTATAAGAGACAG (Tn5-specific)
Parameters: One mismatch allowed
Genomic extraction: Suffix ≥20 bp downstream of the primer; adapter stripping was applied for short fragments

Paired-End Mapping

Tool: BWA-MEM (bwa-alg mem)
Requirements: Both R1 and R2 were required to map to opposite strands on reference CP009367

Template Deduplication

Reads were grouped by (barcode, mapping coordinates)
Each unique combination was counted once as a “template” to remove PCR duplicates

Output

.wig files: Template counts per genomic position
.tn_stats.txt: Library QC metrics

Part 4: Statistical Analysis (Transit)

Normalization: TTR Method

Trimmed Total Reads: Scales samples to equal total counts after excluding the top and bottom 5% of values.
Purpose: Reduces the influence of outliers, including essential genes with zero counts or highly amplified templates.

Differential Essentiality Analysis: ANOVA

transit anova combined.wig samples_run3.metadata CP009367.prot_table \\
  anova_out_intracellular_vs_LB \\
  -n TTR --include-conditions intracellular_mutants_24h,LB_culture \\
  --ref LB_culture -PC 5 -alpha 1000 -winz

PARAMETER	MEANING	RATIONALE
`--include-conditions`	Conditions to compare (comma-separated)	Enables pairwise or multi-condition comparisons
`--ref`	Reference condition for LFC calculation	Log-fold changes are computed relative to this baseline
`-PC 5`	Pseudocount added to all counts	Avoids `log(0)` and stabilizes low-count estimates
`-alpha 1000`	Variance moderation parameter	Shrinks extreme variance estimates for genes with few insertions
`-winz`	Winsorization flag	Caps the top and bottom 1% of values to reduce the influence of outliers
`-n TTR`	Normalization method	Applies Trimmed Total Reads normalization before analysis

KEY METRICS IN OUTPUT:

COLUMN	DEFINITION
`Orf/Rv`	Gene identifier (e.g., `CH47_1012`)
`Gene`	Gene name (e.g., `pncB`, `phoQ`)
`TAs`	Number of TA dinucleotides within the ORF
`Mean_[condition]`	Mean normalized template count per condition
`LFC_[condition]`	Log₂ fold change relative to the reference condition
`Fstat`	F-test statistic (between-condition / within-condition variance)
`Pval`	Raw p-value from the ANOVA F-test
`Padj`	Benjamini-Hochberg FDR-adjusted p-value
`status`	Quality control flags (e.g., “No counts in all conditions”)

Results saved in: anova_out_intracellular.xls, anova_out_extracellular.xls, heatmap_q0.05.png

Essentiality Analysis: Tn5Gaps

transit tn5gaps ${sample}_run3_normalized.wig CP009367.prot_table \\
  ${sample}_tn5gaps_trimmed.dat -m 2 -r Sum -iN 5 -iC 5

PARAMETER	MEANING	RATIONALE
`-m 2`	Minimum insertions for analysis	Genes with fewer than 2 insertions lack sufficient statistical power
`-r Sum`	Scoring method: sum of counts	Robust measure of overall insertion density
`-iN 5`	Minimum insertion density (per kb) for non-essential calls	Filters genes with very sparse coverage
`-iC 5`	Minimum absolute insertions for non-essential calls	Ensures sufficient absolute coverage

KEY METRICS IN OUTPUT:

COLUMN	DEFINITION
`k`	Observed insertions within the ORF
`n`	Total TA dinucleotides within the ORF
`r`	Length of the maximum run of non-insertions
`pval/padj`	Permutation test p-value, FDR-corrected
`call`	Essential/Non-essential (FDR < 0.05)

Results Summary:

~218 essential genes were identified in the initial mutant library (~5.4% of the genome).
Typical essential genes confirmed:
- Ribosomal proteins: rpmJ, rpsM, rpsK, rpsD, rplQ, rpmI, rplT
- RNA polymerase: rpoA (alpha subunit)
- Translation factors: infC (IF-3), pheS/pheT (Phe-tRNA ligase)
- Protein translocation: secY, secD, secF (Sec translocase)
- DNA replication: dnaA, dnaN
- Cell division: ftsH (FtsH protease)
- tRNA processing: thrS (Thr-tRNA ligase)
- Nucleoid organization: ihfA/ihfB (integration host factor)
- Ribosome maturation: rimP, rbfA
- Central metabolism: glmM (phosphoglucosamine mutase)

These genes are universally essential across bacterial species, supporting the validity of the analysis pipeline.

Results saved in: Tn5Gaps.xls

Part 5: Circos Visualization – Genome-Wide Insertion Patterns

To visualize transposon insertion distributions across the Y. enterocolitica WA-314 genome, a Circos plot was generated with the following structure:

Figure Layout

Outermost ring: Genome ideogram with kilobase scale markers
Five concentric scatter rings: Normalized template counts per insertion site for each experimental condition (extracellular, intracellular, growth control, LB culture, initial mutants), color-coded for distinction
Innermost heatmap ring: Locations of genes classified as essential by Tn5Gaps analysis (FDR-adjusted p-value < 0.05)

Data Preparation Workflow

Input processing: The normalized combined.wig file, containing template counts per TA site across all conditions, was parsed to extract coordinate-value pairs for each sample.
Format conversion: Data were reshaped into Circos-compatible format (chromosome, start, end, value), with zero-count positions optionally removed to improve visual clarity.
Essential gene extraction: Genes identified as essential in the initial mutant library were extracted from the Tn5Gaps output and formatted as genomic intervals for heatmap display.
Configuration: A Circos configuration file specified ring radii, color schemes, glyph styles (circles for scatter plots), axis spacing, and label formatting.

This visualization complements the statistical analyses by providing an intuitive spatial overview of insertion patterns across the complete genome.

Part 6: Addressing Specific Questions

1. Step-by-step analysis? See Part 2 above. The TPP pipeline integrates trimming, mapping, counting, and deduplication into a single workflow.

2. Bias correction for samples with fewer positions but higher reads per position?

Template deduplication: Collapses PCR duplicates by (barcode + coordinate).
TTR normalization: Trims extreme values before scaling, thereby reducing the influence of outliers.
BC_corr monitoring: Values > 0.9 indicate minimal PCR bias in most samples.
Gene-length normalization: Density = k/n (insertions per TA site), preventing longer genes from appearing artificially essential.

3. Sequence motif analysis around insertion sites? Although Tn5 displays relatively relaxed sequence specificity, unlike Himar1 with its strict TA requirement, explicit motif logo analysis was not performed. However, the pipeline inherently restricts analysis to valid insertion sites through precise mapping to the reference genome.

4. Determining significantly less frequently mutated genes? Two complementary approaches were applied:

Tn5Gaps: Identifies constitutive essentiality through runs of non-insertions (permutation test, FDR correction).
ANOVA: Identifies condition-specific essentiality by comparing insertion counts across conditions (F-test, Benjamini-Hochberg correction).

5. Positional effects (mutations at gene ends less lethal)? Yes, this issue is explicitly addressed within the analysis framework. The Tn5Gaps algorithm accounts for positional effects by distinguishing between internal and terminal gaps in insertion coverage:

r metric: Represents the length of the longest continuous run of non-insertions. Long internal runs typically indicate essential protein domains.
lenovr metric: Represents the full length of the non-insertion run with the greatest overlap with the gene body.

Decision Pipeline Summary:

For each gene, calculate k (observed insertions), n (total TA sites), r, and lenovr from the insertion data.
Perform a permutation test: p = P(r_perm ≥ r_obs | random insertion).
Apply Benjamini-Hochberg correction to obtain the adjusted p-value (p_adj).
Interpret lenovr to determine whether the significant gap is internal or terminal.

Final Essentiality Call:

If p_adj < 0.05 and lenovr ≈ r (internal gap) → Essential
If p_adj < 0.05 and lenovr << r (terminal gap) → Review manually
If p_adj ≥ 0.05 → Non-essential (insufficient evidence)

This two-layer approach, combining statistical significance (p_adj) with biological context (lenovr/k), ensures that essentiality calls are both statistically rigorous and biologically interpretable. It explicitly accounts for positional effects, particularly terminal tolerance, where gaps at gene ends may have limited functional consequences.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.