Explanation: Why the Primer is at the “End” of the Transposon Sequence

This is an excellent and important question about the library preparation strategy. Let me clarify the apparent contradiction.

Short Answer

The primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG is positioned at the transposon-genome junction — but “end” depends on directionality and sequencing orientation. When we write the transposon sequence 5’→3′, the primer-binding region appears at the 3′ end because that’s the end that abuts the genomic DNA after insertion. Sequencing proceeds from the primer INTO the genome, not the other way around.

Detailed Explanation

1. Transposon Structure and Orientation

The Tn5 transposon has two key features at each end:

[Genomic DNA] ← insertion site → [Transposon End (ME sequence)] ← [Primer Binding Site] → [Rest of Transposon/Adapter]

The sequence you see in the PDF is written 5’→3′ in the orientation of the transposon construct:

5'-[Long transposon body]...[ME recognition sequence: AGATGTGTATAAGAGACAG]-3'
                              ↑
                      Primer binds here
                      Sequencing proceeds → INTO genomic DNA

2. The Mosaic End (ME) Sequence

From your PDF:

ME Erkennungs Sequenz 5´ AGATGTGTATAAGAGACAG 3´
ME Sequenz Komplementär: 3´ TCTACACATATTCTCTGTC 5´
ME Sequenz Rev Komplementär: 5´ CTGTCTCTTATACACATCT 3´

The primer you use (`AGCTTCAGGGTTGAGATGTGTATAAGAGACAG`) contains:	Component	Sequence	Function
Additional 5′ extension	`AGCTTCAGGGTTGAG`	Provides binding stability, may contain adapter/index sequences
Core ME recognition	`ATGTGTATAAGAGACAG`	Essential for Tn5 transposase binding and insertion

3. Why the Primer Must Be at the “End”

A. Sequencing Directionality

Illumina sequencing reads 5'→3' from the primer:

Primer: 5'-AGCTTCAGGGTTGAGATGTGTATAAGAGACAG-3'
                             ↓
                    Sequencing extends → → → 
                             ↓
                    [Genomic DNA insertion site]

If the primer were at the “beginning” of the transposon sequence, sequencing would read through the transposon body instead of into the genome — yielding no useful insertion site information.

B. Library Preparation Workflow

1. Tn5 transposon inserts into genomic DNA at random TA sites
2. Genomic DNA is fragmented around insertion sites
3. PCR amplification uses primers binding to transposon ends
4. Sequencing read 1 (R1) starts at transposon primer → reads INTO genome
5. Sequencing read 2 (R2) may capture barcode + opposite genomic flank

C. The “Staggered” Primer Start (Cycles 5-10)

Your pipeline searches for the primer starting between cycles 5-10 because:

Primer variants used (from PDF):
NEB_i5_Tn_01: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTACCTACAACAAAGCTCTCATCAACC-3'
NEB_i5_Tn_02: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCACCTACAACAAAGCTCTCATCAAC-3'  
NEB_i5_Tn_03: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCCTACAACAAAGCTCTCATCAACCG-3'

These primers have variable 5' overhangs (frameshifts) to:
- Create diversity at the start of Read 1 (improves cluster identification on Illumina flow cells)
- Ensure the core ME sequence (`AGATGTGTATAAGAGACAG`) is still recognized by the pipeline

4. Visual Summary: Read Structure After Insertion

Genomic DNA (unknown) ← [Insertion Site: TA] ← [Transposon ME End] ← [Primer Binding Site]

Sequencing Read 1 (R1):
5'-[Primer: AGCTTCAGGGTTGAGATGTGTATAAGAGACAG][Genomic DNA...]-3'
     ↑                                    ↑
     Primer binding site                  Genomic sequence starts here
     (searched for in pipeline)           (extracted and mapped to genome)

Pipeline step:
1. Identify primer pattern in R1 (allowing 1 mismatch, start position 5-10)
2. Extract suffix AFTER primer = genomic DNA
3. Map genomic suffix to reference genome
4. Record insertion coordinate = position immediately after TA dinucleotide

5. Why This Design Works for Tn-Seq

Design Feature	Purpose
Primer at transposon-genome junction	Enables sequencing FROM known transposon INTO unknown genome
ME sequence within primer	Ensures only true transposon insertions are captured
Variable start position (cycles 5-10)	Improves Illumina cluster diversity; accommodates library prep variants
Genomic suffix extraction	Isolates the biologically relevant insertion site for mapping

Key Takeaway

The primer appears at the “end” of the written transposon sequence because that end is physically adjacent to the genomic DNA after insertion. Sequencing proceeds from this primer into the genome, allowing us to identify exactly where the transposon inserted. If the primer were at the opposite end, we would sequence through the transposon body and miss the critical genomic insertion site.

This orientation is fundamental to all Tn-seq, Tn5-seq, and similar insertion sequencing methods — the primer must be positioned to read from the known transposon sequence into the unknown genomic context.

Related Command Line (from your pipeline):

# TPP searches for primer in R1, extracts genomic suffix, maps to genome:
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref CP009367.fasta \
  -reads1 sample_R1.fastq.gz -reads2 sample_R2.fastq.gz \
  -output sample_run3 \
  -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \  # Primer at transposon-genome junction
  -mismatches 1 -bwa-alg mem

TODO: illustrate this with a diagram or clarify any aspect of the library preparation strategy!

Explanation of Start-Position 5-10

What Does “Start-Position 5-10” Mean?

The start-position 5-10 refers to the cycle number (nucleotide position) in Read 1 where the transposon primer sequence begins. This is intentional diversity created during library preparation.

Why Positions 5-10?

Looking at your PDF, three different forward primers are used:

NEB_i5_Tn_01: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT**ACC**TACAACAAAGCTCTCATCAACC-3'
NEB_i5_Tn_02: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT**CAC**TACAACAAAGCTCTCATCAAC-3'
NEB_i5_Tn_03: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT**CCT**ACAACAAAGCTCTCATCAACCG-3'
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    Constant Illumina adapter (33 bp)

Key insight: The primers have:

Constant region (first 33 bp): ACACTCTTTCCCTACACGACGCTCTTCCGATCT (Illumina adapter)
Variable frameshift (2-3 bp): ACC, CAC, or CCT
Transposon amplification sequence: TACAACAAAGCTCTCATCAACC...

This creates a frameshift so the transposon sequence starts at different positions (cycles 5-10) in the sequencing read, improving cluster diversity on Illumina flow cells.

How to Implement Start-Position 5-10 in Code

In TPP (Transposon Position Profiling)

The pipeline uses the -primer-start-window parameter:

# Example from your pipeline:
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa \
  -protocol Tn5 \
  -ref CP009367.fasta \
  -reads1 sample_R1.fastq.gz \
  -reads2 sample_R2.fastq.gz \
  -output sample_run3 \
  -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
  -mismatches 1 \
  -bwa-alg mem
  # -primer-start-window is set internally to allow positions 5-10

Modifying the TPP Script

From your pipeline notes:

# Modify the TPP tools to set the correct window
vim ~/.local/lib/python3.10/site-packages/pytpp/tpp_tools.py

# Search for "DEBUG" or "primer-start-window"
# The default is set to: -primer-start-window 0,159
# But for your case, you want to restrict to positions 5-10

Implementation Logic

The code should:

Search for primer in Read 1
Check start position – must be between cycles 5-10 (0-indexed: positions 4-9)
Allow 1 mismatch in the primer sequence
Extract genomic DNA after the primer

Pseudocode example:

def find_transposon_primer(read_sequence, primer="AGCTTCAGGGTTGAGATGTGTATAAGAGACAG", 
                           max_mismatches=1, start_window=(4, 9)):
    """
    Find transposon primer in read with position constraint.

    Args:
        read_sequence: The R1 read sequence
        primer: Transposon primer sequence
        max_mismatches: Maximum allowed mismatches (default: 1)
        start_window: Tuple of (min_pos, max_pos) for primer start (0-indexed)

    Returns:
        genomic_sequence: Sequence after primer if found, else None
    """
    primer_len = len(primer)

    # Search in the allowed window (positions 5-10 in 1-indexed = 4-9 in 0-indexed)
    for start_pos in range(start_window[0], start_window[1] + 1):
        end_pos = start_pos + primer_len

        if end_pos > len(read_sequence):
            continue

        # Extract candidate sequence
        candidate = read_sequence[start_pos:end_pos]

        # Count mismatches
        mismatches = sum(1 for a, b in zip(candidate, primer) if a != b)

        if mismatches <= max_mismatches:
            # Found valid primer! Extract genomic sequence
            genomic_sequence = read_sequence[end_pos:]
            return genomic_sequence, start_pos

    return None, None  # Primer not found in valid position

Actual TPP Parameter

Looking at your pipeline, the modification shows:

# In tpp_tools.py, there should be a parameter like:
-primer-start-window 0,159  # Default allows positions 0-159

# But for your specific case with frameshift primers,
# the effective positions are 5-10 due to the primer design

Why Window is 0-159 but Effective is 5-10?

The 0-159 window is a permissive search range to find the primer anywhere in the read. However, due to your primer design (33 bp adapter + 2-3 bp frameshift), the transposon sequence naturally starts at positions 5-10.

Read structure:

Position:  1-33        34-35/36/37    36/37/38 onwards
           [Illumina]  [Frameshift]   [Transposon + Genomic]
           Adapter     ACC/CAC/CCT    AGCTTCAGGGTTGAG...

After sequencing from the other end:
Position:  1-4         5-10          11+
           [Random]    [Transposon]  [Genomic DNA]

Verification in Your Data

From your pipeline output:

# Break-down of total reads (49821406):
#  29481783 reads (59.2%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (20339623):

This shows that ~40% of reads have the transposon primer starting in the valid window (positions 5-10), which is expected given the library design.

Summary

Start-position 5-10: The transposon primer starts at cycle 5, 6, 7, 8, 9, or 10 in Read 1
Purpose: Created by mixing 3 primers with frameshifts to improve Illumina cluster diversity
Implementation: TPP searches for primer with -primer-start-window parameter (default 0-159, but effective range is 5-10 due to primer design)
Code: Search for primer sequence allowing 1 mismatch, verify start position is in range, extract genomic suffix

TODO: modifying the actual TPP code or understanding specific parameters!

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Explanation: Why the Primer is at the “End” of the Transposon Sequence (Data_Jiline_Transposon)