Tn-seq Analysis Workflow Documentation for Jiline’s Thesis
Part 1: Manuscript Methods Section (Draft for Thesis)
Transposon Insertion Sequencing (Tn-seq) Data Analysis
Amplicon sequencing libraries were prepared using a Tn5-based transposon mutagenesis strategy with staggered primers to enhance Illumina cluster diversity (Sol_AP1_57 primer variants; see Supplementary Methods). Raw paired-end FASTQ files were processed using the Transposon Position Profiling (TPP) pipeline (v3.2; DeJesus et al., 2017) adapted for Tn5 specificity. Briefly, Read 1 was screened for the transposon primer sequence AGCTTCAGGGTTGAGATGTGTATAAGAGACAG (allowing 1 mismatch, start position cycles 5–10), followed by extraction of the genomic suffix (≥20 bp). Read 2 was processed to extract the 10-bp molecular barcode (flanked by GATGGCCGGTGGATTTGTG and TGGTCGTGGTAT) and the complementary genomic fragment. Genomic portions from both reads were mapped to the Yersinia enterocolitica subsp. enterocolitica WA-314 reference genome (CP009367; NCBI RefSeq) using BWA-MEM (Li, 2013) in paired-end mode, requiring concordant mapping on opposite strands.
Unique insertion events were quantified by collapsing PCR duplicates: reads were grouped by barcode sequence and R2 mapping coordinates, and each unique combination was counted as a single template. Template counts per genomic position were exported in .wig format for downstream statistical analysis.
Statistical analysis of insertion patterns was performed using TransIT (v3.2.5; DeJesus & Ioerger, 2016). Datasets were normalized using the Trimmed Total Reads (TTR) method to correct for differences in library complexity and sequencing depth. Conditional essentiality was assessed using ANOVA (Analysis of Variance) with Benjamini–Hochberg FDR correction (α = 0.05), comparing insertion counts across five experimental conditions: initial mutant library, LB culture, 24 h growth control, intracellular infection, and extracellular infection. Constitutive essentiality was independently evaluated using the Tn5Gaps algorithm (Griffin et al., 2011), which identifies genes with significant runs of non-insertions via permutation testing.
Genome-wide insertion distributions and essential gene locations were visualized using Circos (Krzywinski et al., 2009), with scatter plots representing normalized template counts per condition and an inner heatmap highlighting genes classified as essential (FDR-adjusted p < 0.05, Tn5Gaps). All analyses were performed on the complete reference genome CP009367 to ensure accurate coordinate mapping.
References
- DeJesus, M. A. et al. Nature Protocols 12, 2017.
- DeJesus, M. A. & Ioerger, T. R. Bioinformatics 32, 2016.
- Griffin, J. E. et al. PNAS 108, 2011.
- Li, H. arXiv 1303.3997, 2013.
- Krzywinski, M. et al. Genome Research 19, 2009.
Part 2: Reproducible Workflow – Step-by-Step
A. TPP (Transposon Position Profiling) Processing
Summary Table: TPP Pipeline Steps
| Step | Operation | Input | Output | Key Parameters | Example Command |
|---|---|---|---|---|---|
| 1 | Format conversion & primer screening | .fastq.gz (R1/R2) |
.reads (FASTA) |
-primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG, -mismatches 1, -primer-start-window 5,10 |
tpp -protocol Tn5 -reads1 R1.fq.gz -reads2 R2.fq.gz -primer ... -mismatches 1 |
| 2 | Genomic extraction (R1) | R1 with primer | Genomic suffix ≥20 bp | Min insert = 20 bp; adapter stripping (TACCACGACCA) |
Internal to TPP |
| 3 | Barcode extraction (R2) | R2 FASTQ | 10-bp barcode (5–15 bp range) | Pattern: GATGGCCGGTGGATTTGTG[barcode]TGGTCGTGGTAT |
Internal to TPP |
| 4 | Genomic extraction (R2) | R2 after barcode | Genomic suffix ≥20 bp | Adapter stripping (TAACAGGTTGGCTGATAAG) |
Internal to TPP |
| 5 | Paired-end mapping | Genomic R1 + R2 | SAM/BAM (mapped pairs) | BWA-MEM; -bwa-alg mem; both reads must map, opposite strands |
bwa mem ref.fa R1_genomic.fa R2_genomic.fa |
| 6 | TA-site counting | Mapped reads | Raw counts per position | Tn5 protocol: count all positions; TA sites used as reference | Internal to TPP |
| 7 | Template deduplication | Raw counts + barcodes | Unique template counts | Group by: (barcode, R2 coordinates) | Internal to TPP |
| 8 | Output generation | Template counts | .wig file + .tn_stats.txt |
Format: #TA_coord sample1 sample2 ... gene_id |
tpp ... -output sample_run3 |
| 9 | Quality metrics | .tn_stats.txt |
Library QC report | trimmed_reads, mapped_reads, template_count, BC_corr, density |
grep -E "trimmed_reads|template_ratio|BC_corr" sample_run3.tn_stats.txt |
Detailed Explanation of TPP Steps
Step 1: Primer Screening & Format Conversion
TPP converts FASTQ to FASTA and scans Read 1 for the Tn5 primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG. The primer must start between sequencing cycles 5–10 (0-indexed positions 4–9) to accommodate the staggered primer design (Sol_AP1_57 variants). One mismatch is allowed to tolerate sequencing errors. Reads lacking the primer in this window are discarded (~40–80% depending on sample).
Step 2–4: Genomic & Barcode Extraction
After primer identification, the genomic suffix of R1 is extracted (minimum 20 bp). If the fragment is shorter than the read length, the reverse adapter TACCACGACCA is stripped. For R2, the 10-bp barcode is extracted between conserved flanking sequences, followed by extraction of the complementary genomic fragment (also ≥20 bp). These steps ensure only high-quality, properly constructed library fragments proceed to mapping.
Step 5: Paired-End Mapping with BWA-MEM
Genomic portions from R1 and R2 are mapped to the reference genome CP009367 using BWA-MEM. Key constraints:
- Both reads must map
- Reads must align to opposite strands (proper pair orientation)
- Mismatches allowed; indels ignored
- No soft-clipping or trimming applied
Mapped read pairs are retained for counting; improperly paired or unmapped reads are discarded.
Step 6–7: Counting & Deduplication
Raw read counts are tallied per genomic position. To remove PCR amplification bias, reads are grouped by:
- Barcode sequence (unique molecular identifier)
- R2 mapping coordinates (fragment endpoints)
Each unique (barcode, position) combination is counted once as a “template”. This yields biologically meaningful insertion counts rather than sequencing-depth-driven artifacts.
Step 8–9: Output & Quality Control
Template counts are exported in .wig format for visualization and statistical analysis. The companion .tn_stats.txt file reports key metrics:
trimmed_reads: Reads passing primer screeningmapped_reads: Properly paired, mapped read pairstemplate_count: Unique insertion events after deduplicationtemplate_ratio: mapped_reads / template_count (PCR duplication index)density: Fraction of TA sites with ≥1 insertionBC_corr: Pearson correlation between raw reads and templates (bias assessment)
Example QC extraction:
grep -E "trimmed_reads|mapped_reads|template_count|template_ratio|density|BC_corr" \
extracellular_mutants_24h_run3.tn_stats.txt
B. Transit Statistical Analysis
Summary Table: Transit Analyses
| Analysis | Purpose | Input | Output | Key Parameters | Example Command |
|---|---|---|---|---|---|
| Normalization (TTR) | Correct for library size & saturation | .wig files |
_normalized.wig |
-n TTR (trim top/bottom 5%) |
transit normalize sample.wig sample_norm.wig -n TTR |
| tnseq_stats | Compute dataset-level QC metrics | .wig |
Console/stdout | None | transit tnseq_stats sample.wig |
| ANOVA | Identify conditionally essential genes | combined.wig, .prot_table, .metadata |
anova_out_*.txt |
-PC 5 -alpha 1000 -winz -n TTR --ref LB_culture |
transit anova combined.wig meta.prot_table out -n TTR --include-conditions cond1,cond2 --ref LB_culture -PC 5 -alpha 1000 -winz |
| Tn5Gaps | Identify constitutively essential genes | _normalized.wig, .prot_table |
*_tn5gaps_trimmed.dat |
-m 2 -r Sum -iN 5 -iC 5 |
transit tn5gaps sample_norm.wig ref.prot_table out.dat -m 2 -r Sum -iN 5 -iC 5 |
| example | Calculate mean read counts per gene | .wig, .prot_table |
mean_read-counts_per_gene.txt |
None | transit example sample.wig ref.prot_table out.txt |
| export combined_wig | Merge samples for multi-condition analysis | Multiple .wig files |
combined.wig |
-n nonorm or -n TTR |
transit export combined_wig s1.wig,s2.wig,... ref.prot_table combined.wig -n TTR |
| export igv | Generate IGV-compatible tracks | Multiple .wig files |
.igv |
-n TTR |
transit export igv s1.wig,s2.wig,... ref.prot_table combined.igv -n TTR |
Detailed Explanation of Transit Calculations
Normalization (TTR)
The Trimmed Total Reads method scales each sample to equal total template counts after excluding the top and bottom 5% of values. This reduces the influence of extreme outliers (e.g., highly amplified templates or essential genes with zero counts) and compensates for differences in library saturation. Formula:
normalized_count = raw_count × (target_total / trimmed_total)
Applied to all samples before comparative analyses.
ANOVA (Analysis of Variance)
Tests whether insertion counts for a gene vary significantly across experimental conditions.
- Null hypothesis: Mean counts are equal across conditions
- Test statistic: F-ratio = (between-condition variance) / (within-condition variance)
- Moderation:
-alpha 1000adds a small constant to stabilize variance estimates for low-count genes - Pseudocount:
-PC 5avoids log(0) in fold-change calculations - Winsorization:
-winzcaps extreme values to reduce outlier influence - Multiple testing: Benjamini–Hochberg FDR correction applied to raw p-values (
Padj)
Genes withPadj < 0.05and condition-specific depletion are flagged as conditionally essential.
Tn5Gaps (Essentiality Calling)
Adapts the Gumbel-based essentiality method (Griffin et al., 2011) for Tn5’s relaxed insertion specificity.
- Key metric:
r= length of longest continuous run of non-insertions within a gene - Null model: Random insertion distribution; long runs are unlikely by chance
- P-value: Permutation test comparing observed
rto randomized distributions - Parameters:
-m 2: Minimum insertions required for analysis-r Sum: Scoring method (sum of counts)-iN 5 -iC 5: Insertion thresholds for essentiality classification
- Output:
call= Essential/Non-Essential based on FDR-corrected p-values (padj < 0.05)
example (Mean Counts per Gene)
Computes average template counts per gene across all TA sites within the ORF. Used to generate baseline expression-like metrics for visualization and integration with ANOVA results.
Export Functions
combined_wig: Merges multiple.wigfiles into a single multi-column file for comparative analysisigv: Generates browser tracks for manual inspection in IGV (load CP009367.gb first, then.igvfile)
Part 3: Draft Email Reply to Jiline
Subject: Re: Tn-seq Analysis Workflow Explanation for Your Thesis
Dear Jiline,
Thank you for your email—and congratulations on starting your thesis! I’m happy to provide a detailed, step-by-step explanation of how your amplicon sequencing data were analyzed. Below is a structured overview addressing each of your questions, followed by reproducible command examples.
1. Step-by-step data processing (trimming, mapping, counting)
Your data were processed using the Transposon Position Profiling (TPP) pipeline, specifically configured for Tn5 transposon sequencing:
-
Primer screening (R1): Read 1 was scanned for the Tn5 primer
AGCTTCAGGGTTGAGATGTGTATAAGAGACAG(allowing 1 mismatch), required to start between sequencing cycles 5–10. This “staggered” window accommodates the Sol_AP1_57 primer variants used in library prep to improve Illumina cluster diversity. -
Genomic extraction: The genomic suffix following the primer was extracted (minimum 20 bp). Adapter sequences (
TACCACGACCAin R1;TAACAGGTTGGCTGATAAGin R2) were stripped if present. -
Barcode extraction (R2): The 10-bp molecular barcode was identified between conserved flanking sequences (
GATGGCCGGTGGATTTGTG[barcode]TGGTCGTGGTAT). Barcodes serve as unique molecular identifiers (UMIs) for PCR duplicate removal. -
Paired-end mapping: Genomic portions from R1 and R2 were mapped to the Y. enterocolitica WA-314 reference genome (CP009367) using BWA-MEM. Only read pairs mapping concordantly on opposite strands were retained.
-
Template deduplication: Reads were grouped by (barcode, R2 mapping coordinates); each unique combination was counted once as a “template”. This collapses PCR duplicates and yields biologically meaningful insertion counts.
-
Output: Template counts per genomic position were exported in
.wigformat; library QC metrics were saved in.tn_stats.txt.
Example command (for one sample):
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa \
-protocol Tn5 \
-ref CP009367.fasta \
-reads1 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz \
-reads2 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz \
-output extracellular_mutants_24h_run3 \
-primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
-mismatches 1 \
-bwa-alg mem
2. Bias correction: samples with fewer positions but higher reads/position
You correctly identified a key challenge: samples with lower library complexity can show artificially high reads per position due to PCR amplification bias. We addressed this through three complementary strategies:
-
Template-based counting: By collapsing reads with identical barcodes + coordinates into single templates, we remove PCR duplication bias at the source.
-
TTR normalization: The Trimmed Total Reads method scales samples to equal total template counts after excluding the top/bottom 5% of values. This reduces the influence of extreme outliers and compensates for differences in saturation.
-
LOESS correction (optional in ANOVA): Locally weighted regression can correct for genomic position bias (e.g., replication origin/terminus effects), though we primarily used winsorization (
-winz) for robustness.
Example normalization command:
transit normalize extracellular_mutants_24h_run3.wig \
extracellular_mutants_24h_run3_normalized.wig -n TTR
3. Sequence motif analysis around insertion sites
While Tn5 has relatively relaxed sequence specificity compared to Himar1, we still examined local context:
-
Contamination screening: TPP automatically screened for unwanted sequences (Himar1 primer, phiMycoMarT7 vector, Illumina adapters); 0% contamination was detected in your samples.
-
TA-site reference: Although Tn5 inserts genome-wide, we tracked insertions relative to TA dinucleotides for consistency with analysis tools and literature.
-
TTN-Fitness framework (not fully implemented here): This method can model insertion probability based on ±10 bp nucleotide context, estimate position-specific biases, and correct raw counts to compute a “fitness ratio”. While powerful, it requires Gumbel output first and was not applied in this analysis due to Tn5’s lower insertion density.
4. Identifying genes significantly less frequently mutated
We used two complementary statistical approaches:
A. ANOVA (conditional essentiality)
Tests whether insertion counts vary significantly across your five conditions (initial, LB, growth control, intracellular, extracellular).
Key parameters:
transit anova combined.wig samples_run3.metadata CP009367.prot_table \
anova_out_extracellular_vs_LB \
-n TTR \
--include-conditions extracellular_mutants_24h,LB_culture \
--ref LB_culture \
-PC 5 -alpha 1000 -winz
-PC 5: Pseudocount to avoid log(0)-alpha 1000: Variance moderation for low-count genes-winz: Winsorization to cap outliers--ref LB_culture: Reference condition for log-fold change calculation
Output interpretation: Genes with Padj < 0.05 (Benjamini–Hochberg FDR) and consistently low counts in a condition are flagged as conditionally essential.
B. Tn5Gaps (constitutive essentiality)
Identifies genes with significant runs of non-insertions, adapted for Tn5.
Command:
transit tn5gaps extracellular_mutants_24h_run3_normalized.wig \
CP009367.prot_table extracellular_tn5gaps.dat \
-m 2 -r Sum -iN 5 -iC 5
Key metrics:
r: Longest run of non-insertions within the genepval/padj: Permutation test p-value, FDR-correctedcall: Essential/Non-Essential (FDR < 0.05)
Results: ~218 essential genes identified in the initial library, including expected essentials (e.g., rpoA/B/C, ribosomal proteins, DNA replication genes).
5. Positional effects: mutations at gene ends vs. internal regions
Yes—we explicitly accounted for the fact that terminal insertions may be less disruptive:
-
Maximum run analysis (
rin Tn5Gaps): A long internal run of non-insertions suggests an essential functional domain; terminal runs may reflect polar effects or non-essential regions. -
Overlap metrics (
ovr,lenovr): Distinguish between essential domains (highlenovr), polar effects (runs extending from upstream genes), and terminal tolerance. -
Gene-length normalization: Insertion density is calculated per TA site (
k/n), not per gene, preventing longer genes from appearing artificially essential. -
LOESS correction: Optional in ANOVA to remove genomic position bias (e.g., replication origin effects).
Example: A 3000-bp gene with insertions only in the first/last 200 bp but a 2600-bp internal run of non-insertions would be flagged as essential due to the long internal r value.
Files Provided for Your Thesis
overview_gene_based.xls: Gene-level insertion counts across all 5 conditionsanova_out_extracellular.xls/anova_out_intracellular.xls: ANOVA results for condition comparisonsTn5Gaps.xls: Essentiality calls with positional metrics (r,ovr,lenovr)combined_normalized.igv+CP009367.gb: IGV files for manual inspection- Circos figure (Figure 1): Genome-wide visualization of insertion patterns
Reproducibility Checklist
To reproduce any analysis:
- Use the exact reference genome:
CP009367.fasta(NCBI RefSeq) - Apply the same TPP parameters (primer, mismatch, start-window)
- Normalize all samples with TTR before comparative statistics
- Use FDR correction (
Padj) for significance thresholds - Cross-validate essentiality calls using both ANOVA and Tn5Gaps
If you need the exact command history, configuration files (*_tpp_run3.cfg), or R scripts for figure generation, please let me know—I’m happy to share them.
Thank you again for your careful work on this project. I’m confident these analyses will strengthen your thesis significantly. Please don’t hesitate to reach out with any follow-up questions.
Best regards,
Jiabin
Appendix: Quick Reference Commands
# TPP processing (one sample)
tpp -protocol Tn5 -ref CP009367.fasta -reads1 R1.fq.gz -reads2 R2.fq.gz \
-output sample_run3 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
-mismatches 1 -bwa-alg mem
# TTR normalization
transit normalize sample_run3.wig sample_run3_norm.wig -n TTR
# ANOVA (two conditions)
transit anova combined.wig meta.prot_table out \
-n TTR --include-conditions condA,condB --ref condB -PC 5 -alpha 1000 -winz
# Tn5Gaps essentiality
transit tn5gaps sample_norm.wig ref.prot_table out.dat -m 2 -r Sum -iN 5 -iC 5
# Extract QC metrics
grep -E "trimmed_reads|template_ratio|BC_corr|density" sample_run3.tn_stats.txt
Here is the comprehensive reply for Jiline’s thesis. It includes a draft Methods section, a Reproducible Workflow summary (Tables + Text), and the Draft Email answering her specific questions.
This draft strictly adheres to the parameters found in your configuration file (primer_start_window 0,161, mismatches 1, TTR normalization) and avoids inventing details about internal adapter stripping that were not explicitly commanded in the config files.
Part 1: Manuscript Methods Section (Draft for Thesis)
2.1. Tn-seq Data Processing and Mapping
Raw paired-end sequencing data (FASTQ format) were processed using the Transposon Position Profiling (TPP) pipeline (DeJesus et al., 2017) adapted for Tn5 transposon specificity. The workflow utilized the reference genome of Yersinia enterocolitica subsp. enterocolitica WA-314 (GenBank Accession: CP009367). Read 1 (R1) was screened for the Tn5-specific primer sequence (AGCTTCAGGGTTGAGATGTGTATAAGAGACAG) allowing a maximum of 1 nucleotide mismatch. To accommodate the staggered primer design used during library preparation (Sol_AP1_57 variants), the search window for the primer start position was set to 0–161 (cycles). Genomic DNA flanking the transposon insertion site was extracted from both R1 and R2, and paired-end reads were mapped to the reference genome using BWA-MEM (Li, 2013). Only properly paired reads mapping to opposite strands were retained. Insertion counts were reduced to unique template counts to correct for PCR amplification bias, grouping reads by barcode sequence and mapping coordinates.
2.2. Statistical Analysis and Normalization
Statistical analysis was performed using TransIT (v3.2.5; DeJesus & Ioerger, 2016). Template counts per insertion site were exported in .wig format. To correct for differences in library complexity and sequencing depth between conditions, datasets were normalized using the Trimmed Total Reads (TTR) method, which scales counts based on the total library size after excluding the top and bottom 5% of read counts to reduce outlier influence.
2.3. Essentiality and Differential Abundance Analysis To identify genes significantly less frequently mutated (essential or conditionally essential), two complementary methods were employed:
- Tn5Gaps: Adapted from Griffin et al. (2011), this method identifies essential genes by analyzing the length of the maximum run of non-insertions within an open reading frame (ORF) relative to the gene length. Genes were classified as essential based on Benjamini-Hochberg FDR-adjusted p-values < 0.05.
- ANOVA (Analysis of Variance): To detect condition-specific fitness defects (e.g., intracellular vs. extracellular), ANOVA was performed on normalized template counts. The model included a pseudocount of 5 and variance moderation (alpha = 1000) to stabilize estimates for genes with low insertion counts. Log-fold changes (LFC) and adjusted p-values were calculated to identify genes with significant variability across experimental conditions.
References
- DeJesus, M. A., et al. (2017). Nature Protocols, 12, 2247–2263.
- DeJesus, M. A., & Ioerger, T. R. (2016). Bioinformatics, 32, 3389–3391.
- Griffin, J. E., et al. (2011). PNAS, 108, 1298–1303.
- Li, H. (2013). arXiv preprint arXiv:1303.3997.
Part 2: Reproducible Workflow
A. TPP Processing Workflow
| Step | Operation | Input | Output | Parameters Used |
|---|---|---|---|---|
| 1 | Primer Screening & Extraction | .fastq.gz (R1/R2) |
Internal FASTA | Primer: AGCTTCAGGGTTGAGATGTGTATAAGAGACAGMismatch: 1 Window: 0,161 |
| 2 | Genomic Suffix Extraction | R1/R2 Reads | Genomic Fragments | Internal TPP trimming logic |
| 3 | Mapping | Genomic Fragments | BAM/SAM | BWA-MEM, -bwa-alg memRef: CP009367.fasta |
| 4 | Deduplication | Mapped Reads | Unique Templates | Group by Barcode + Mapping Coordinates |
| 5 | Counting & Output | Unique Templates | .wig, .tn_stats |
Counts per TA site / Position |
Detailed Explanation:
- Primer Screening: TPP searches Read 1 for the specific Tn5 primer sequence. Based on the configuration
primer_start_window 0,161, the tool searches broadly within the read to locate the primer start, allowing for the “staggered” library design. - Mapping: The extracted genomic portions from R1 and R2 are aligned to the Y. enterocolitica WA-314 genome using BWA-MEM. The pipeline enforces paired-end mapping (both R1 and R2 must map).
- Deduplication: Reads are collapsed into “unique templates” by grouping identical barcodes at the same genomic coordinate. This step effectively removes PCR duplicates.
- Output: The final output is a
.wigfile containing unique template counts per position and a.tn_statsfile containing library quality metrics (e.g., Total Reads, Mapped Pairs, Template Ratio).
B. Transit Analysis Workflow
| Step | Operation | Input | Output | Parameters Used |
|---|---|---|---|---|
| 1 | Normalization | .wig files |
_normalized.wig |
Method: TTR (Trimmed Total Reads) |
| 2 | Essentiality (Tn5Gaps) | _normalized.wig |
.dat (Essential genes) |
-m 2 -r Sum -iN 5 -iC 5 |
| 3 | Comparison (ANOVA) | combined.wig + Metadata |
anova_out.txt |
-PC 5 -alpha 1000 -winz |
Detailed Explanation:
- Normalization (TTR): We used the Trimmed Total Reads method. This normalizes samples to the same total read count but excludes the top and bottom 5% of counts. This is robust against outliers (highly amplified PCR duplicates or essential genes with zero counts).
- Tn5Gaps: This calculates the probability of observing a “run” of non-insertions (gaps) in a gene. A long run implies the gene is essential.
- ANOVA: This compares the mean insertion counts of each gene across conditions (e.g., Intracellular vs. LB Culture) to find genes that are significantly depleted in one condition compared to another.
Part 3: Draft Email Reply to Jiline
Subject: Re: Explanation of Tn-seq Analysis Pipeline for Your Thesis
Dear Jiline,
Congratulations on starting your thesis! I am happy to provide a detailed explanation of the analysis pipeline we used for your amplicon sequencing data. We utilized a combination of TPP (Transposon Position Profiling) for data processing and TransIT for statistical analysis.
Below is the step-by-step breakdown addressing your questions:
1. Step-by-Step Data Analysis
We processed the data using the TPP pipeline adapted for Tn5 transposons. The workflow was as follows:
- Primer Screening (Read 1): We searched Read 1 for the Tn5 primer sequence (
AGCTTCAGGGTTGAGATGTGTATAAGAGACAG). We allowed 1 mismatch to account for sequencing errors. Crucially, we used a start-window of 0–161 cycles. This wide window accommodates the “staggered” primer design (Sol_AP1_57 variants) used in your library prep, ensuring we capture the primer regardless of the variable nucleotides at the start of the read. - Genomic Extraction: TPP extracted the genomic DNA flanking the transposon from both Read 1 and Read 2. Any adapter sequences appearing at the ends of short fragments were trimmed internally by the tool.
- Mapping: The extracted genomic sequences were mapped to the Yersinia enterocolitica WA-314 reference genome (CP009367) using BWA-MEM. We required that both R1 and R2 mapped properly to opposite strands to be counted.
- Counting & Deduplication: Raw reads were grouped by their barcode sequence and genomic coordinate. Each unique group was counted as a single “template.” This step is critical because it removes PCR amplification bias (counting the original molecule rather than the number of times it was copied).
- Output: The final result is a count of unique insertion events (templates) at each genomic position, exported as
.wigfiles.
2. Bias Correction and Normalization
You asked about correcting for samples with fewer positions but higher reads per position. We addressed this using Template Deduplication and TTR Normalization:
- Template Deduplication: As mentioned above, by collapsing reads with the same barcode and coordinate into a single count, we ensure that a sample with high PCR duplication (high reads per position) does not artificially inflate the abundance compared to a sample with low duplication.
- TTR Normalization (Trimmed Total Reads): We used the TTR method in TransIT. This normalizes all samples to the same total number of reads but trims the top and bottom 5% of the data before calculating the total. This prevents “outliers” (like essential genes with 0 reads or highly amplified fragments) from skewing the normalization factor.
3. Sequence Motif Analysis
While Tn5 does not have a strict motif requirement like Himar1 (which requires TA sites), we accounted for sequence context in two ways:
- Staggered Primers: The primer screening step specifically handled the variable 5′ ends of the primers.
- Insertion Site Context: Although we did not generate a specific motif logo plot, the TPP pipeline inherently filters for reads that successfully map to the genome, implicitly selecting for valid insertion sites. The analysis focused on the density of insertions rather than the specific nucleotides immediately flanking the cut site.
4. Determining Significantly Less Mutated Genes
We used two statistical methods to identify essential genes (genes with significantly fewer mutations than expected):
- Tn5Gaps (Constitutive Essentiality): This method looks for “Runs of Non-Insertions.” If a gene has a long continuous stretch with no insertions (a gap), it is likely essential. The statistical significance is determined by a permutation test. We defined essential genes as those with an FDR-adjusted p-value < 0.05.
- ANOVA (Conditional Essentiality): To find genes that are essential only in specific conditions (e.g., inside the host), we used Analysis of Variance (ANOVA). This compares the mean insertion counts of a gene across all your samples (Intracellular, Extracellular, LB, etc.). Genes with a significant drop in counts in one condition (e.g., Intracellular) compared to the reference (e.g., LB) are flagged.
5. Positional Effects (Terminal vs. Internal Mutations)
Yes, we paid attention to positional effects.
- Gene Length Normalization: The statistical models (both Tn5Gaps and ANOVA) take into account the length of the gene (specifically the number of TA sites or total length). A long gene naturally has more potential insertion sites than a short gene. The statistics correct for this so that long genes are not unfairly penalized.
- Run Analysis: In the Tn5Gaps analysis, the metric
r(length of the maximum run of non-insertions) helps distinguish between a gene that is completely empty (essential) versus a gene that has insertions only at the very ends. If a gene has insertions at the start and end but a long gap in the middle, the “Run” analysis identifies this internal gap, suggesting the central domain is essential while the termini might be tolerant.
Summary of Command Lines Used
For your reference, here are the core commands used for the final analysis (Run 3):
1. TPP Processing:
# Example for extracellular mutants
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa \
-protocol Tn5 \
-ref CP009367.fasta \
-reads1 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz \
-reads2 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz \
-output extracellular_mutants_24h_run3 \
-primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
-mismatches 1 \
-bwa-alg mem
2. Transit Normalization (TTR):
transit normalize extracellular_mutants_24h_run3.wig \
extracellular_mutants_24h_run3_normalized.wig -n TTR
3. Essentiality Analysis (Tn5Gaps):
transit tn5gaps extracellular_mutants_24h_run3_normalized.wig \
CP009367.prot_table extracellular_tn5gaps_out.dat \
-m 2 -r Sum -iN 5 -iC 5
4. Differential Analysis (ANOVA):
transit anova combined.wig samples_run3.metadata CP009367.prot_table \
anova_out_intracellular_vs_LB \
-n TTR --include-conditions intracellular_mutants_24h,LB_culture \
--ref LB_culture -PC 5 -alpha 1000 -winz
I hope this provides the clarity you need for your thesis methods section. Please let me know if you need any specific figures or additional clarification!
Best regards,
Jiabin