Processing Data_Patricia_Transposon_2025 (Workflow for Structural Variant Calling in Nanopore Sequencing)

install mambaforge https://conda-forge.org/miniforge/ (recommended)

#download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
./Mambaforge-24.9.2-0-Linux-x86_64.sh

To activate this environment, use:
    micromamba activate /home/jhuang/mambaforge
Or to execute a single command in this environment, use:
    micromamba run -p /home/jhuang/mambaforge mycommand
installation finished.

Do you wish to update your shell profile to automatically initialize conda?
This will activate conda on startup and change the command prompt when activated.
If you'd prefer that conda's base environment not be activated on startup,
  run the following command when conda is activated:

conda config --set auto_activate_base false

You can undo this by running `conda init --reverse $SHELL`? [yes|no]
[no] >>> yes
no change     /home/jhuang/mambaforge/condabin/conda
no change     /home/jhuang/mambaforge/bin/conda
no change     /home/jhuang/mambaforge/bin/conda-env
no change     /home/jhuang/mambaforge/bin/activate
no change     /home/jhuang/mambaforge/bin/deactivate
no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
modified      /home/jhuang/.bashrc
==> For changes to take effect, close and re-open your current shell. <==
no change     /home/jhuang/mambaforge/condabin/conda
no change     /home/jhuang/mambaforge/bin/conda
no change     /home/jhuang/mambaforge/bin/conda-env
no change     /home/jhuang/mambaforge/bin/activate
no change     /home/jhuang/mambaforge/bin/deactivate
no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
no change     /home/jhuang/.bashrc
No action taken.
WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
Added mamba to /home/jhuang/.bashrc
==> For changes to take effect, close and re-open your current shell. <==
Thank you for installing Mambaforge!

Close your terminal window and open a new one, or run:
#source ~/mambaforge/bin/activate
conda --version
mamba --version

https://github.com/conda-forge/miniforge/releases
Note

    * After installation, please make sure that you do not have the Anaconda default channels configured.
        conda config --show channels
        conda config --remove channels defaults
        conda config --add channels conda-forge
        conda config --show channels
        conda config --set channel_priority strict
        #conda clean --all
        conda config --remove channels biobakery

    * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!

# --Deprecated method: mamba installing on conda--
#conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
#    * Note that installing mamba into any other environment than base is not supported.
#
#conda activate base
#conda install conda
#conda uninstall mamba
#conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor

Test the installed tools

# Check versions
sniffles --version
RepeatModeler -h
RepeatMasker -h
svim --help
SURVIVOR --help
mamba install -c conda-forge perl r

Data Preparation

Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).

Preprocessing

Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.

(Optional) Variant Calling for SNP and Indel Detection:

Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.

(OFFICIAL STARTING POINT) Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

  #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!

  # -- PREPARING the input fastq-data, merge the fastqz and move the top-directory

  # Under raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass
  zcat ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_0.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_1.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_2.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_3.fastq.gz ... | gzip > HD46_1.fastq.gz
  mv ./raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass/HD46_1.fastq.gz ~/DATA/Data_Patricia_Transposon_2025

    #this are the corresponding sample names:
    #barcode 1: HD46-1
    #barcode 2: HD46-2
    #barcode 3: HD46-3
    #barcode 4: HD46-4
    mv barcode01.fastq.gz HD46_1.fastq.gz
    mv barcode02.fastq.gz HD46_2.fastq.gz
    mv barcode03.fastq.gz HD46_3.fastq.gz
    mv barcode04.fastq.gz HD46_4.fastq.gz

  # -- CALCULATE the coverages
    #!/bin/bash

    for bam in barcode*_minimap2.sorted.bam; do
        echo "Processing $bam ..."
        avg_cov=$(samtools depth -a "$bam" | awk '{sum+=$3; cnt++} END {if (cnt>0) print sum/cnt; else print 0}')
        echo -e "${bam}\t${avg_cov}" >> coverage_summary.txt
    done

  # ---- !!!! LOGIN the suitable environment !!!! ----

  mamba activate transposon_long

  # -- TODO: AFTERNOON_DEBUG_THIS: FAILED and not_USED: Alignment and Detect structural variants in each sample using SVIM which used aligner ngmlr or mimimap2
  #mamba install -c bioconda ngmlr
  mamba install -c bioconda svim

  # ---- Option_1: minimap2 (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
  for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
      #INS,INV,DUP:TANDEM,DUP:INT,BND
      svim reads --aligner minimap2 --nanopore minimap2+svim_${sample}    ${sample}.fastq.gz CP020463.fasta  --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
  done

  #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names

  # ---- Option_2: minamap2 (aligner) + Sniffles2 (structural variant caller) --> SUCCESSFUL ----
  #Minimap2: A commonly used aligner for nanopore sequencing data.
  #    Align Long Reads to the WT Reference using Minimap2
  #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
  #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
  #  -l 50: Reports SVs that are at least 50 base pairs long.
  #  -t 60: Uses 60 threads for faster processing.
  for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
      #minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
      minimap2 --MD -t 60 -ax map-ont CP020463.fasta ${sample}.fastq.gz | samtools sort -o ${sample}_minimap2.sorted.bam
      samtools index ${sample}_minimap2.sorted.bam
      sniffles -m ${sample}_minimap2.sorted.bam -v ${sample}_minimap2+sniffles.vcf -s 10 -l 50 -t 60
      #QUAL < 20 ||
      bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_minimap2+sniffles.vcf > ${sample}_minimap2+sniffles_filtered.vcf
  done

    #Estimating parameter...
    #        Max dist between aln events: 44
    #        Max diff in window: 76
    #        Min score ratio: 2
    #        Avg DEL ratio: 0.0112045
    #        Avg INS ratio: 0.0364027
    #Start parsing... CP020463
    #                # Processed reads: 10000
    #                # Processed reads: 20000
    #        Finalizing  ..
    #Start genotype calling:
    #        Reopening Bam file for parsing coverage
    #        Finalizing  ..
    #Estimating parameter...
    #        Max dist between aln events: 28
    #        Max diff in window: 89
    #        Min score ratio: 2
    #        Avg DEL ratio: 0.013754
    #        Avg INS ratio: 0.17393
    #Start parsing... CP020463
    #                # Processed reads: 10000
    #                # Processed reads: 20000
    #                # Processed reads: 30000
    #                # Processed reads: 40000

    # Results:
    # * barcode01_minimap2+sniffles.vcf
    # * barcode01_minimap2+sniffles_filtered.vcf
    # * barcode02_minimap2+sniffles.vcf
    # * barcode02_minimap2+sniffles_filtered.vcf
    # * barcode03_minimap2+sniffles.vcf
    # * barcode03_minimap2+sniffles_filtered.vcf
    # * barcode04_minimap2+sniffles.vcf
    # * barcode04_minimap2+sniffles_filtered.vcf

  #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
  #for sample in barcode01 barcode02 barcode03 barcode04; do
  #    sniffles -m svim_reads_minimap2_${sample}/${sample}.fastq.minimap2.coordsorted.bam -v sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
  #    bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
  #done

  # ---- Option_3: NGMLR (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
  for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
      svim reads --aligner ngmlr --nanopore    ngmlr+svim_${sample}       ${sample}.fastq.gz CP020463.fasta  --cores 10;
  done

  # ---- Option_4: NGMLR (aligner) + sniffles (structural variant caller) --> SUCCESSFUL ----
  for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
      sniffles -m ngmlr+svim_${sample}/${sample}.fastq.ngmlr.coordsorted.bam -v ${sample}_ngmlr+sniffles.vcf -s 10 -l 50 -t 60
      bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_ngmlr+sniffles.vcf > ${sample}_ngmlr+sniffles_filtered.vcf
  done

Compare and integrate all results produced by minimap2+sniffles and ngmlr+sniffles, and check them each position in IGV!

    mv HD46_Ctrl_minimap2+sniffles_filtered.vcf HD46-Ctrl_minimap2+sniffles_filtered.vcf
    mv HD46_Ctrl_ngmlr+sniffles_filtered.vcf    HD46-Ctrl_ngmlr+sniffles_filtered.vcf
    mv HD46_1_minimap2+sniffles_filtered.vcf    HD46-1_minimap2+sniffles_filtered.vcf
    mv HD46_1_ngmlr+sniffles_filtered.vcf       HD46-1_ngmlr+sniffles_filtered.vcf
    mv HD46_2_minimap2+sniffles_filtered.vcf    HD46-2_minimap2+sniffles_filtered.vcf
    mv HD46_2_ngmlr+sniffles_filtered.vcf       HD46-2_ngmlr+sniffles_filtered.vcf
    mv HD46_3_minimap2+sniffles_filtered.vcf    HD46-3_minimap2+sniffles_filtered.vcf
    mv HD46_3_ngmlr+sniffles_filtered.vcf       HD46-3_ngmlr+sniffles_filtered.vcf
    mv HD46_4_minimap2+sniffles_filtered.vcf    HD46-4_minimap2+sniffles_filtered.vcf
    mv HD46_4_ngmlr+sniffles_filtered.vcf       HD46-4_ngmlr+sniffles_filtered.vcf
    mv HD46_5_minimap2+sniffles_filtered.vcf    HD46-5_minimap2+sniffles_filtered.vcf
    mv HD46_5_ngmlr+sniffles_filtered.vcf       HD46-5_ngmlr+sniffles_filtered.vcf
    mv HD46_6_minimap2+sniffles_filtered.vcf    HD46-6_minimap2+sniffles_filtered.vcf
    mv HD46_6_ngmlr+sniffles_filtered.vcf       HD46-6_ngmlr+sniffles_filtered.vcf
    mv HD46_7_minimap2+sniffles_filtered.vcf    HD46-7_minimap2+sniffles_filtered.vcf
    mv HD46_7_ngmlr+sniffles_filtered.vcf       HD46-7_ngmlr+sniffles_filtered.vcf
    mv HD46_8_minimap2+sniffles_filtered.vcf    HD46-8_minimap2+sniffles_filtered.vcf
    mv HD46_8_ngmlr+sniffles_filtered.vcf       HD46-8_ngmlr+sniffles_filtered.vcf
    mv HD46_13_minimap2+sniffles_filtered.vcf   HD46-13_minimap2+sniffles_filtered.vcf
    mv HD46_13_ngmlr+sniffles_filtered.vcf      HD46-13_ngmlr+sniffles_filtered.vcf

  conda activate plot-numpy1
  #python generate_common_vcf.py
  #mv common_variants.xlsx putative_transposons.xlsx

  # * Reads each of your VCFs.
  # * Filters variants → only keep those with FILTER == PASS.
  # * Compares the two aligner methods (minimap2+sniffles2 vs ngmlr+sniffles2) per sample.
  # * Keeps only variants that appear in both methods for the same sample.
  # * Outputs: An Excel file with the common variants and a log text file listing which variants were filtered out, and why (not_PASS or not_COMMON_in_two_VCF).

  #python generate_fuzzy_common_vcf_v1.py
  #Sample   PASS_minimap2   PASS_ngmlr  COMMON
  #  HD46-Ctrl_Ctrl 39  29  28
  #  HD46-1 39  32  29
  #  HD46-2 40  32  28
  #  HD46-3 38  30  27
  #  HD46-4 46  35  32
  #  HD46-5 40  35  31
  #  HD46-6 43  35  30
  #  HD46-7 40  33  28
  #  HD46-8 37  20  11
  #  HD46-13    39  38  27

#Sample PASS_minimap2   PASS_ngmlr  COMMON_FINAL
#HD46-Ctrl_Ctrl 39  29  6
#HD46-1 39  32  8
#HD46-2 40  32  8
#HD46-3 38  30  6
#HD46-4 46  35  8
#HD46-5 40  35  9
#HD46-6 43  35  10
#HD46-7 40  33  8
#HD46-8 37  20  4
#HD46-13    39  38  5

#!!!! Summarize the results of ngmlr+sniffles !!!!
python merge_ngmlr+sniffles_filtered_results_and_summarize.py

#!!!! Post-Processing !!!!
#DELETE "2186168    N

. PASS” in Sheet HD46-13 and Summary #DELETE “2427785 N CGTCAGAATCGCTGTCTGCGTCCGAGTCACTGTCTGAGTCTGAATCACTATCTGCGTCTGAGTCACTGTCTG . PASS” due to “0/1:169:117” in HD46-13 and Summary #DELETE “2441640 N GCTCATTAAGAATCATTAAATTAC . PASS” due to 0/1:170:152 in HD46-13 and Summary

Source code of merge_ngmlr+sniffles_filtered_results_and_summarize.py

import os
import pandas as pd

# List all ngmlr VCF files
vcf_files = sorted([f for f in os.listdir('.') if f.endswith('_ngmlr+sniffles_filtered.vcf')])

log_lines = []
df_list = []

# Function to read VCF and filter PASS
def read_vcf(vcf_path, log_lines, sample):
    variants = []
    with open(vcf_path) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            pos, ref, alt, qual, flt = parts[1], parts[3], parts[4], parts[5], parts[6]
            info = parts[7] if len(parts) > 7 else '.'
            fmt = parts[8] if len(parts) > 8 else '.'
            last_column = parts[9] if len(parts) > 9 else '.'  # Keep original column
            var_id = f"{pos}:{ref}>{alt}"
            if flt != 'PASS':
                log_lines.append(f"{sample}\t{var_id}\tnot_PASS")
                continue
            variants.append({
                'POS': int(pos),
                'REF': ref,
                'ALT': alt,
                'QUAL': qual,
                'FILTER': flt,
                'INFO': info,
                'FORMAT': fmt,
                sample: last_column,  # Use only genotype for individual sheets
                f"{sample}_with_alt": alt  # Use only ALT sequence for summary
            })
    return pd.DataFrame(variants)

# Read all VCFs
sample_names = []
sample_with_alt_columns = []
for f in vcf_files:
    sample = os.path.basename(f).replace('_ngmlr+sniffles_filtered.vcf', '')
    sample_names.append(sample)
    sample_with_alt_columns.append(f"{sample}_with_alt")
    df = read_vcf(f, log_lines, sample)
    df_list.append(df)

# Merge all variants into one DataFrame
all_variants = pd.concat(df_list, ignore_index=True)

# Fill missing sample columns with hyphen for summary
for sample in sample_names:
    if sample not in all_variants.columns:
        all_variants[sample] = ''
    if f"{sample}_with_alt" not in all_variants.columns:
        all_variants[f"{sample}_with_alt"] = '-'

# Pivot table for summary using exact POS, using columns with ALT
summary_df = all_variants.groupby(['POS', 'REF'])[sample_with_alt_columns].first().reset_index()

# Rename columns in summary to remove '_with_alt' suffix
summary_df.columns = ['POS', 'REF'] + sample_names

# Replace NaN or empty strings with hyphen in sample columns
summary_df[sample_names] = summary_df[sample_names].fillna('-').replace('', '-')

# Count how many samples have a non-hyphen value
summary_df['Count'] = summary_df[sample_names].apply(lambda row: sum(val != '-' for val in row), axis=1)

# Save Excel with individual sheets and summary
writer = pd.ExcelWriter('merged_ngmlr+sniffles_variants.xlsx', engine='openpyxl')

# Save individual sample sheets
for df in df_list:
    sheet_name = df.columns[-2]  # Use sample name (not the _with_alt column)
    # Select only relevant columns for individual sheets (exclude _with_alt)
    df = df[[col for col in df.columns if not col.endswith('_with_alt')]]
    df.to_excel(writer, sheet_name=sheet_name, index=False)

# Save summary sheet
summary_df.to_excel(writer, sheet_name='Summary', index=False)
writer.close()

# Write log
with open('ngmlr_filtering_log.txt', 'w') as logf:
    logf.write('Sample\tVariant\tReason\n')
    for line in log_lines:
        logf.write(line + '\n')

print('Done: merged_ngmlr+sniffles_variants.xlsx with ALT sequences or hyphens in summary and genotype-only in individual sheets.')

———————– END of the transposon calculation. The following steps (e.g. assembly based on the long-reads is not necessary for the transposon analysis) —————————-
(NOT_USED) Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)
```
  python vcf_to_fasta.py variants.vcf variants.fasta
  #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
  #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
  /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
  python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
  Exact Matches
  =============
  2 bacteria (blast name), Bacteria 
```
(scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields. * In your VCF file generated by Sniffles, the REF=N in the results has a specific meaning: * In a standard VCF, the REF field usually contains the reference base(s) at the variant position. * For structural variants (SVs), especially insertions, there is no reference sequence replaced; the insertion occurs between reference bases. * Therefore, Sniffles uses N as a placeholder in the REF field to indicate “no reference base replaced”. * The actual inserted sequence is then stored in the ALT field.

Why some records have UNRESOLVED in the FILTER field in the Excel output.

1. Understanding the format

    The data appears to be structural variant (SV) calls from Sniffles, probably in a VCF-like tabular format exported to Excel:

        * gi|1176884116|gb|CP020463.1| → reference sequence
        * Positions: 1855752 and 2422820
        * N → insertion event
        * SVLEN=999 → size of the insertion
        * AF → allele frequency
        * GT:DR:DV → genotype, depth reference, depth variant (1/1:0:678, example values for a PASS variant)
        * FILTER → whether the variant passed filters (UNRESOLVED means it didn’t pass)

2. What UNRESOLVED usually means

    In Sniffles:

    * UNRESOLVED is assigned to SVs when the tool cannot confidently resolve the exact sequence or breakpoint.
    * Reasons include:
        - Low read support (RE, DV) relative to the expected coverage
        - Ambiguous alignment at repetitive regions
        - Conflicting strand or orientation signals
        - Allele frequency inconsistent with expectations

3. Examine your two records

    First record

        POS: 1855752
        SVTYPE: INS
        SVLEN: 999
        RE: 68
        AF: 1
        GT: 1/1
        FILTER: UNRESOLVED

    Observations:

    * AF = 1 → allele frequency 100%, homozygous insertion
    * RE = 68 → 68 reads support the variant, decent coverage
    * Still UNRESOLVED → likely because Sniffles could not resolve the inserted sequence precisely; sometimes long insertions in repetitive regions are hard to reconstruct fully even with good read support.

    Second record

        POS: 2422820
        SVTYPE: INS
        SVLEN: 999
        RE: 22
        AF: 0.025522
        GT: 0/0
        FILTER: UNRESOLVED

    Observations:

    * AF = 0.0255 → very low allele frequency (~2.5%)
    * RE = 22, DR = 840 → very low variant reads vs reference
    * GT = 0/0 → homozygous reference
    * Sniffles marks it UNRESOLVED because the variant is essentially noise, not confidently detected.

4. Key difference between the two
    Feature First record    Second record
    Allele frequency (AF)   1 (high)    0.0255 (very low)
    Variant reads (RE)  68  22
    Genotype (GT)   1/1 0/0
    Reason for UNRESOLVED   Unresolvable inserted sequence

✅ 5. Conclusion

    * Sniffles marks a variant as UNRESOLVED when the SV cannot be confidently characterized.
    * Even if there is good read support (first record), complex insertions can’t always be reconstructed fully.
    * Very low allele frequency (second record) also triggers UNRESOLVED because the signal is too weak compared to background noise.
    * Essentially: “UNRESOLVED” ≠ bad data, it’s just unresolved uncertainty.

(NOT_SURE_HOW_TO_USE) Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.
```
  mamba install -c bioconda medaka
  medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
```

Compare Insertions Across Samples

Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:

SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf

    Input: List of VCF files from Sniffles2.
    Output: A consolidated VCF file with shared and unique variants.

Filter WT Insertions:

    Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:

        bcftools isec WT.vcf merged.vcf -p comparison_results

Validate and Visualize

Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.

igv.sh

Validate Findings:
    Perform PCR or additional sequencing for key transposon insertion sites to confirm results.

Alternatives to TEPID for Long-Read Data

If you’re looking for transposon-specific tools for long reads:

    REPET: A robust transposon annotation tool compatible with assembled genomes.
    EDTA (Extensive de novo TE Annotator):
        A pipeline to identify, classify, and annotate transposons.
        Works directly on your assembled genomes.

        perl EDTA.pl --genome WT.fasta --type all

The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

Steps to Generate WT.vcf
1. Align WT Reads to the WT Reference Genome

The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.

Command using Minimap2:

minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam

Index the BAM file:

samtools index WT.sorted.bam

2. Detect Structural Variants with Sniffles2

Run Sniffles2 on the WT alignment to call structural variants:

sniffles --input WT.sorted.bam --vcf WT.vcf

This step identifies:

    Native transposons and insertions present in the WT genome.
    Other structural variants that are part of the reference genome or sequencing artifacts.

Key parameters to consider:

    --min_support: Adjust based on your WT sequencing coverage.
    --max_distance: Define proximity for merging variants.
    --min_length: Set a minimum SV size (e.g., >50 bp for transposons).

Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

To ensure the WT.vcf only includes relevant transposons or SVs:

    Use bcftools or similar tools to filter out low-confidence variants:

    bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
    bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf

NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

Why Use a Reference Genome (WT.fasta) from a Database?

    Higher Quality and Completeness:
        Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.

    Consistency:
        Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.

    Saves Time:
        Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.

Alternative: Assembling WT from FASTQ

If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:

    Assemble the genome from your WT.fastq:
        Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.

    flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size

Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.

Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

# -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ !
# Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches.
#https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1

# -- (failed!) try TEPID for annotation
mamba install tepid=0.10 -c bioconda
#(tepid_env)
for sample in WT 1 2 3 4 5 7 8 9 10; do
    tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q  ../batch1_depth25/trycycler_${sample}/reads.fastq;
    tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se;
done

tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se;

tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS]
            [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL

# -- (failed!) try EDTA for annotation
https://github.com/oushujun/EDTA
(transposon_long) mamba install -c conda-forge -c bioconda edta
mamba install bioconda::rmblast  # cd RepeatMasker; ./configure
EDTA.pl --genome CP020463.fasta --species others --threads 40

For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options.
For de novo repeat identification: RepeatScout is highly effective.
For LTR retrotransposons: Use LTR_retriever.
For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.

Validation: Cross-validate with short-read sequencing data if available.

(Optional) Assembly the nanopore-sequencing using

1. merge and prepare fastq.gz

2. calcuclate the precentage of coding DNA: https://www.ncbi.nlm.nih.gov/nuccore/CP020463 with additional plasmid.

    * Average gene length: 870.4 bp
    * Total coding region length: 2056168 bp
    * Percentage of coding DNA = (Total Coding Region Length / Total Genome Length) × 100 = 2056168 bp / 2454929 bp = 83.8%

3. Prepare the fastq.gz

    conda activate /home/jhuang/miniconda3/envs/trycycler  # under jhuang@hamm (10.169.63.113).

    rsync -a -P *.fastq.gz jhuang@10.169.63.113:/home/jhuang/DATA/Data_Patricia_Transposon_2025/
    for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
        mkdir trycycler_${sample};
        mv ${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
        gunzip trycycler_${sample}/reads.fastq.gz;
    done
    #mkdir batch3;
    #cd batch3;
    #for sample in WT 1 2 3 4 5 7 8 9 10; do
    #    mkdir trycycler_${sample};
    #    cp ${sample}_nanopore/${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
    #    gunzip trycycler_${sample}/reads.fastq.gz;
    #done

4. Running separate assemblies (6x4 times: canu, flye, miniasm_and_minipolish, necat, nextdenovo, raven) using trycycler_assembly_extra-thorough.sh (under the trycycler environment running the following steps)
        # batch1: min_read_cov=25; batch2: min_read_cov=50; batch3: min_read_cov=100 (if necessary on the monday!)
        cp ../Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_*.sh ./
        #for sample in trycycler_HDRNA_10 trycycler_HDRNA_13; do
        for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
            cd trycycler_${sample};
            ../trycycler_assembly_extra-thorough.sh
            cd ..;
        done

# END!
#TODO: upload the nanopore-sequencing data to NCBI for the HDRNA_10 and HDRNA_13 to the correspoinding project.

(bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_02.fasta
assembly_02.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
[15:32:46] If you like MLST, you're absolutely going to love wgMLST!
[15:32:46] Done.
(bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_20.fasta

assembly_20.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
[15:33:01] You can use --label XXX to replace an ugly filename in the output.
[15:33:01] Done.

(bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_02.fasta
assembly_02.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)

(bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_08.fasta

assembly_08.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
[15:38:22] Remember that --minscore is only used when using automatic scheme detection.
[15:38:22] Done.
(bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_14.fasta
assembly_14.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
[15:38:50] Thanks for using mlst, I hope you found it useful.
[15:38:50] Done.
(bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_20.fasta
assembly_20.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
[15:38:54] If you like MLST, you're going to absolutely love cgMLST!
[15:38:54] Done.

        #- under the directory batch3
        #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
        #    cd ${sample};
        #    ../trycycler_assembly_extra-thorough_threads5_cov100.sh;
        #    cd ..;
        #done

        #if ERROR, running separate assembly-methods raven and canu
        #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
        #    cd ${sample};
        #    ../trycycler_assembly_extra-thorough_raven.sh;
        #    cd ..;
        #done
        #
        #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
        #    cd ${sample};
        #    ../trycycler_assembly_extra-thorough_canu.sh;
        #    cd ..;
        #done

5. trycycler cluster

        for sample in trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
        for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4; do
            cd ${sample};
            rm -rf trycycler
            trycycler cluster --threads 10 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler;
            cd ..;
        done

6. trycycler reconcile

        cd trycycler_WT
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        mkdir trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/C_utg000001c.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed

        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            # -- under batch3 --
            176K Nov  5 12:38 P_bctg00000001.fasta
            122K Nov  5 12:38 D_bctg00000001.fasta
            95K Nov  5 12:38 J_bctg00000001.fasta
            78K Nov  5 12:38 J_bctg00000002.fasta
            #--> Error: unable to find a suitable common sequence

            69K Nov  5 17:24 G_tig00000004.fasta
            66K Nov  5 17:24 S_tig00000004.fasta
            66K Nov  5 17:24 M_tig00000004.fasta
            65K Nov  5 17:24 A_tig00000005.fasta
            #--> Error: unable to find a suitable common sequence
            #If repeat M_tig00000004.fasta twice, resulting in the following error: Circularising M_tig00000004:
            #  using S_tig00000004:
            #    unable to circularise: M_tig00000004's start and end were found in multiple places in S_tig00000004
            #Error: failed to circularise sequence M_tig00000004 because its start/end sequences were found in multiple ambiguous places in other sequences. #This is likely because M_tig00000004 starts/ends in a repetitive region. You can either manually repair its circularisation (and ensure it does #not start/end in a repetitive region) or exclude the sequence altogether and try again.

            47K Nov  5 17:24 V_bctg00000001.fasta
            45K Nov  5 17:24 C_utg000002c.fasta
            #Error: unable to find a suitable common sequence

            34K Nov  5 17:24 U_utg000004c.fasta
            34K Nov  5 17:24 N_contig_2.fasta
            34K Nov  5 17:24 T_contig_2.fasta
            #Error: unable to find a suitable common sequence

            23K Nov  5 17:24 I_utg000003c.fasta
            22K Nov  5 17:24 B_contig_2.fasta
            22K Nov  5 17:24 H_contig_2.fasta
            #Error: unable to find a suitable common sequence

            13K Nov  5 17:24 G_tig00000006.fasta
            12K Nov  5 17:24 M_tig00000003.fasta
            12K Nov  5 17:24 O_utg000002c.fasta
            12K Nov  5 17:24 S_tig00000003.fasta
            11K Nov  5 17:24 A_tig00000004.fasta
            #trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 --max_length_diff 1.2 --max_trim_seq_percent 20.0
            #--> SUCCESSFULLY saving sequences to file: trycycler/cluster_002/2_all_seqs.fasta

        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 --max_length_diff 1.3 --max_trim_seq_percent 30.0
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 #Error: two or more input contigs are required
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006 #Error: two or more input contigs are required
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_007 #Error: two or more input contigs are required

        cd ../trycycler_1
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        mkdir trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/I_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/W_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/A_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/U_utg000001c.fasta trycycler/cluster_001/1_contigs_removed

        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
        #TODO_TOMORROW_HERE!
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 --max_length_diff 1.7 --max_trim_seq_percent 70.0
        #Error: failed to circularise sequence I_utg000002l because its end could not be found in other sequences.
        (NOTE) using the plasmid from other isolate to help the cut of the sequence.
            #cp ~/DATA/Data_Patricia_Transposon/batch3/trycycler_WT/trycycler/cluster_004/7_final_consensus.fasta .
            #cat *.fasta > all.fasta
            #mafft --clustalout --adjustdirection all.fasta > all.aln
        cluster_004_con -------------------------------------------------------tatga
        I_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
        O_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
                                                                            *****
        cluster_004_con gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaa---
        I_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
        O_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
                        ********************************************************

        grep "I_utg000002l" all.a_ > I_utg000002l.fasta
        grep "O_utg000002l" all.a_ > O_utg000002l.fasta
        cut -d' ' -f5 I_utg000002l.fasta > I_utg000002l_.fasta
        cut -d' ' -f5 O_utg000002l.fasta > O_utg000002l_.fasta
        sed 's/-//g' I_utg000002l_.fasta > I_utg000002l__.fasta
        sed 's/-//g' O_utg000002l_.fasta > O_utg000002l__.fasta
        seqtk seq I_utg000002l__.fasta -l 80 > I_utg000002l___.fasta
        seqtk seq O_utg000002l__.fasta -l 80 > O_utg000002l___.fasta

        #Cut two files Manually with the sequence: AGAATCAGAATTAGGCGCATAATTTACAGGAGGTTTGATTATGGCTAAGAAAAA
        cat I_utg000002l___.fasta O_utg000002l___.fasta > 2_all_seqs.fasta
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 #Error: two or more input contigs are required

        cd ../trycycler_2
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        mkdir trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/O_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004

        cd ../trycycler_3
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        mkdir trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004

        cd ../trycycler_4
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
        mkdir trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/F_Utg670.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/V_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004

        cd ../trycycler_5
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004

        cd ../trycycler_7
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        mkdir trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/D_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005

        cd ../trycycler_8
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        mkdir trycycler/cluster_001/1_contigs_removed
        mv trycycler/cluster_001/1_contigs/M_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005

        cd ../trycycler_9
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006

        cd ../trycycler_10
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
        trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005

        #Nachmachen: for sample in trycycler_9 trycycler_10; do
        cd ${sample};
        ../trycycler_assembly_extra-thorough_canu.sh;
        cd ..;
        done

7. (optional) map the circular assemblies to plasmid databases
        #bzip2 -d plsdb.fna.bz2
        #makeblastdb -in plsdb.fna -dbtype nucl
        #blastn -db plsdb.fna -query all.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > all_on_plsdb.blastn

8. trycycler msa

        trycycler msa --threads 55 --cluster_dir trycycler/cluster_001
        trycycler msa --threads 55 --cluster_dir trycycler/cluster_002
        trycycler msa --threads 55 --cluster_dir trycycler/cluster_003
        trycycler msa --threads 55 --cluster_dir trycycler/cluster_004
        trycycler msa --threads 55 --cluster_dir trycycler/cluster_005
        #--> When finished, Trycycler reconcile will make a 3_msa.fasta file in the cluster directory

9. trycycler partition

        #generate 4_reads.fastq for each contig!
        trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_*
        #trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001 trycycler/cluster_002 trycycler/cluster_003
        trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_003

10. trycycler consensus

        trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001
        trycycler consensus --threads 55 --cluster_dir trycycler/cluster_002
        trycycler consensus --threads 55 --cluster_dir trycycler/cluster_003
        trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004
        trycycler consensus --threads 55 --cluster_dir trycycler/cluster_005
        #!!NOTE that we take the isolates of HD05_2_K5 and HD05_2_K6 assembled by Unicycler instead of Trycycler!!
        # TODO (TODAY), generate the 3 datasets below!
        # TODO (IMPORTANT): write a Email to Holger, say the short sequencing of HD5_2 is not correct, since the 3 datasets! However, the MTxxxxxxx is confirmed not in K5 and K6!
        TODO: variant calling needs the short-sequencing, they are not dorable without the correct short-reads! resequencing? It is difficult to call variants only from long-reads since too much errors in long-reads!
        #TODO: check the MT sequence if in the isolates, more deteiled annotations come late!
        #II. Comparing the results of Trycycler with Unicycler.
        #III. Eventually add the plasmids assembled from unicycler to the final results. E.g. add the 4 plasmids to K5 and K6

11. Polishing after Trycycler

        #1. Oxford Nanopore sequencer (Ignored due to the samtools version incompatibility!)
        # for c in trycycler/cluster_*; do
        #     medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12
        #     mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
        #     rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi  # clean up
        # done
        # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
        #2. Short-read polishing
        #---- 5179_R1 (2) ----
        #  mean read depth: 205.8x
        #  188 bp have a depth of zero (99.9924% coverage)
        #  355 positions changed (0.0144% of total positions)
        #  estimated pre-polishing sequence accuracy: 99.9856% (Q38.42)
        #Step 1: read QC
        fastp --in1 ../../s-epidermidis-5179-r1_R1.fastq.gz --in2 ../../s-epidermidis-5179-r1_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
        #Step 2: Polypolish
        for cluster in cluster_001 cluster_002; do
        bwa index ${cluster}/7_final_consensus.fasta
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
        polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
        done
        #Step 3: POLCA
        for cluster in cluster_001 cluster_002; do
        cd ${cluster}
        polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 37
        #Insertion/Deletion Errors: 2
        #Assembly Size: 2470001
        #Consensus Quality: 99.9984
        #Substitution Errors: 4
        #Insertion/Deletion Errors: 0
        #Assembly Size: 17748
        #Consensus Quality: 99.9775
        #Step 4: (optional) more rounds and/or other polishers
        #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
        #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
        for cluster in cluster_001 cluster_002; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        Substitution Errors: 13
        Insertion/Deletion Errors: 0
        Assembly Size: 2470004
        Consensus Quality: 99.9995
        Substitution Errors: 0
        Insertion/Deletion Errors: 0
        Assembly Size: 17748
        Consensus Quality: 100
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2470004
        #Consensus Quality: 100
        #---- 1585 (4) ----
        #  mean read depth: 174.7x
        #  8,297 bp have a depth of zero (99.6604% coverage)
        #  271 positions changed (0.0111% of total positions)
        #  estimated pre-polishing sequence accuracy: 99.9889% (Q39.55)
        #Step 1: read QC
        fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
        #Step 2: Polypolish
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        bwa index ${cluster}/7_final_consensus.fasta
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
        polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
        done
        #Step 3: POLCA
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        cd ${cluster}
        polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 7
        #Insertion/Deletion Errors: 4
        #Assembly Size: 2443174
        #Consensus Quality: 99.9995
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 9014
        #Consensus Quality: 100
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 9014
        #Consensus Quality: 100
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2344
        #Consensus Quality: 100
        #Step 4: (optional) more rounds and/or other polishers
        #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
        #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2443176
        #Consensus Quality: 100
        #---- 1585 derived from unicycler, under 1585_normal/unicycler (4) ----
        #Step 0: copy chrom and plasmid1, plasmid2, plasmid3 to cluster_001/7_final_consensus.fasta, ...
        #Step 1: read QC
        fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
        #Step 2: Polypolish
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        bwa index ${cluster}/7_final_consensus.fasta
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
        polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
        done
        #Polishing 1 (2,443,574 bp):
        #mean read depth: 174.7x
        #8,298 bp have a depth of zero (99.6604% coverage)
        #52 positions changed (0.0021% of total positions)
        #estimated pre-polishing sequence accuracy: 99.9979% (Q46.72)
        #Polishing 2 (9,014 bp):
        #mean read depth: 766.5x
        #3 bp have a depth of zero (99.9667% coverage)
        #0 positions changed (0.0000% of total positions)
        #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
        #Polishing 7 (2,344 bp):
        #mean read depth: 2893.0x
        #4 bp have a depth of zero (99.8294% coverage)
        #0 positions changed (0.0000% of total positions)
        #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
        #Polishing 8 (2,255 bp):
        #mean read depth: 2719.6x
        #4 bp have a depth of zero (99.8226% coverage)
        #0 positions changed (0.0000% of total positions)
        #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
        #Step 3: POLCA
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        cd ${cluster}
        polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 7
        #Insertion/Deletion Errors: 4
        #Assembly Size: 2443598
        #Consensus Quality: 99.9995
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 9014
        #Consensus Quality: 100
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2344
        #Consensus Quality: 100
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2255
        #Consensus Quality: 100
        #Step 4: (optional) more rounds and/or other polishers
        #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
        #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2443600
        #Consensus Quality: 100
        #-- 1585v (1, no short reads, waiting) --
        # TODO!
        #-- 5179 (2) --
        #mean read depth: 120.7x
        #7,547 bp have a depth of zero (99.6946% coverage)
        #356 positions changed (0.0144% of total positions)
        #estimated pre-polishing sequence accuracy: 99.9856% (Q38.41)
        #Step 1: read QC
        fastp --in1 ../../s-epidermidis-5179_R1.fastq.gz --in2 ../../s-epidermidis-5179_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
        #Step 2: Polypolish
        for cluster in cluster_001 cluster_002; do
        bwa index ${cluster}/7_final_consensus.fasta
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
        polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
        done
        #Step 3: POLCA
        for cluster in cluster_001 cluster_002; do
        cd ${cluster}
        polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 49
        #Insertion/Deletion Errors: 23
        #Assembly Size: 2471418
        #Consensus Quality: 99.9971
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 17748
        #Consensus Quality: 100
        #Step 4: (optional) more rounds POLCA
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 10
        #Insertion/Deletion Errors: 5
        #Assembly Size: 2471442
        #Consensus Quality: 99.9994
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        Substitution Errors: 6
        Insertion/Deletion Errors: 0
        Assembly Size: 2471445
        Consensus Quality: 99.9998
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
        cd ..
        done
        Substitution Errors: 0
        Insertion/Deletion Errors: 0
        Assembly Size: 2471445
        Consensus Quality: 100
        #-- HD5_2 (2): without the short-sequencing we cannot correct the base-calling! --
        # !ERROR to be REPORTED, the
        #Polishing cluster_001_consensus (2,504,140 bp):
        #mean read depth: 94.4x
        #240,420 bp have a depth of zero (90.3991% coverage)
        #56,894 positions changed (2.2720% of total positions)
        #estimated pre-polishing sequence accuracy: 97.7280% (Q16.44)
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R2_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R1_001.fastq
        /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R2_001.fastq
        #Step 1: read QC
        fastp --in1 ../../HD5_2_S38_R1_001.fastq.gz --in2 ../../HD5_2_S38_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
        # NOTE that the following steps are not run since the short-reads are not correct!
        # #Step 2: Polypolish
        # for cluster in cluster_001 cluster_005; do
        #   bwa index ${cluster}/7_final_consensus.fasta
        #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
        #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
        #   polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
        # done
        # #Step 3: POLCA
        # for cluster in cluster_001 cluster_005; do
        #   cd ${cluster}
        #   polca.sh -a polypolish.fasta -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
        #   cd ..
        # done
        # #Step 4: (optional) more rounds POLCA
        # for cluster in cluster_001; do
        #   cd ${cluster}
        #   polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
        #   cd ..
        # done
        # NOTE that the plasmids of HD5_2_K5 and HD5_2_K6 were copied from Unicycler!
        #-- HD5_2_K5 (4) --
        mean read depth: 87.1x
        25 bp have a depth of zero (99.9990% coverage)
        1,085 positions changed (0.0433% of total positions)
        estimated pre-polishing sequence accuracy: 99.9567% (Q33.63)
        #Step 1: read QC
        fastp --in1 ../../275_K5_Holger_S92_R1_001.fastq.gz --in2 ../../275_K5_Holger_S92_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
        #Step 2: Polypolish
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        bwa index ${cluster}/7_final_consensus.fasta
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
        polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
        done
        #Step 3: POLCA
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        cd ${cluster}
        polca.sh -a polypolish.fasta -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 146
        #Insertion/Deletion Errors: 2
        #Assembly Size: 2504401
        #Consensus Quality: 99.9941
        #Substitution Errors: 41
        #Insertion/Deletion Errors: 0
        #Assembly Size: 41288
        #Consensus Quality: 99.9007
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 9191
        #Consensus Quality: 100
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2767
        #Consensus Quality: 100
        #Step 4: (optional) more rounds POLCA
        for cluster in cluster_001 cluster_002; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 41
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2504401
        #Consensus Quality: 99.9984
        #Substitution Errors: 8
        #Insertion/Deletion Errors: 0
        #Assembly Size: 41288
        #Consensus Quality: 99.9806
        for cluster in cluster_001 cluster_002; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 8
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2504401
        #Consensus Quality: 99.9997
        #Substitution Errors: 4
        #Insertion/Deletion Errors: 0
        #Assembly Size: 41288
        #Consensus Quality: 99.9903
        for cluster in cluster_001 cluster_002; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 8
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2504401
        #Consensus Quality: 99.9997
        #Substitution Errors: 4
        #Insertion/Deletion Errors: 0
        #Assembly Size: 41288
        #Consensus Quality: 99.9903
        #-- HD5_2_K6 (4) --
        #mean read depth: 116.7x
        #4 bp have a depth of zero (99.9998% coverage)
        #1,022 positions changed (0.0408% of total positions)
        #estimated pre-polishing sequence accuracy: 99.9592% (Q33.89)
        #Step 1: read QC
        fastp --in1 ../../276_K6_Holger_S95_R1_001.fastq.gz --in2 ../../276_K6_Holger_S95_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
        #Step 2: Polypolish
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        bwa index ${cluster}/7_final_consensus.fasta
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
        bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
        polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
        done
        #Step 3: POLCA
        for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
        cd ${cluster}
        polca.sh -a polypolish.fasta -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 164
        #Insertion/Deletion Errors: 2
        #Assembly Size: 2504398
        #Consensus Quality: 99.9934
        #Substitution Errors: 22
        #Insertion/Deletion Errors: 0
        #Assembly Size: 41288
        #Consensus Quality: 99.9467
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 9191
        #Consensus Quality: 100
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2767
        #Consensus Quality: 100
        #Step 4: (optional) more rounds POLCA
        for cluster in cluster_001 cluster_002; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 32
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2504400
        #Consensus Quality: 99.9987
        #Substitution Errors: 0
        #Insertion/Deletion Errors: 0
        #Assembly Size: 41288
        #Consensus Quality: 100
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 4
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2504400
        #Consensus Quality: 99.9998
        for cluster in cluster_001; do
        cd ${cluster}
        polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
        cd ..
        done
        #Substitution Errors: 2
        #Insertion/Deletion Errors: 0
        #Assembly Size: 2504400
        #Consensus Quality: 99.9999

(Optional) use platon to confirm the plasmid contigs: https://github.com/oschwengers/platon

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Processing Data_Patricia_Transposon_2025 (Workflow for Structural Variant Calling in Nanopore Sequencing)

Leave a Reply Cancel reply