Processing Data_Patricia_Transposon_2025 (Workflow for Structural Variant Calling in Nanopore Sequencing)

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)

     #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
     chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
     ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
     To activate this environment, use:
         micromamba activate /home/jhuang/mambaforge
     Or to execute a single command in this environment, use:
         micromamba run -p /home/jhuang/mambaforge mycommand
     installation finished.
    
     Do you wish to update your shell profile to automatically initialize conda?
     This will activate conda on startup and change the command prompt when activated.
     If you'd prefer that conda's base environment not be activated on startup,
       run the following command when conda is activated:
    
     conda config --set auto_activate_base false
    
     You can undo this by running `conda init --reverse $SHELL`? [yes|no]
     [no] >>> yes
     no change     /home/jhuang/mambaforge/condabin/conda
     no change     /home/jhuang/mambaforge/bin/conda
     no change     /home/jhuang/mambaforge/bin/conda-env
     no change     /home/jhuang/mambaforge/bin/activate
     no change     /home/jhuang/mambaforge/bin/deactivate
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
     no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
     no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
     no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
     no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
     modified      /home/jhuang/.bashrc
     ==> For changes to take effect, close and re-open your current shell. <==
     no change     /home/jhuang/mambaforge/condabin/conda
     no change     /home/jhuang/mambaforge/bin/conda
     no change     /home/jhuang/mambaforge/bin/conda-env
     no change     /home/jhuang/mambaforge/bin/activate
     no change     /home/jhuang/mambaforge/bin/deactivate
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
     no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
     no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
     no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
     no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
     no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
     no change     /home/jhuang/.bashrc
     No action taken.
     WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
     WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
     Added mamba to /home/jhuang/.bashrc
     ==> For changes to take effect, close and re-open your current shell. <==
     Thank you for installing Mambaforge!
    
     Close your terminal window and open a new one, or run:
     #source ~/mambaforge/bin/activate
     conda --version
     mamba --version
    
     https://github.com/conda-forge/miniforge/releases
     Note
    
         * After installation, please make sure that you do not have the Anaconda default channels configured.
             conda config --show channels
             conda config --remove channels defaults
             conda config --add channels conda-forge
             conda config --show channels
             conda config --set channel_priority strict
             #conda clean --all
             conda config --remove channels biobakery
    
         * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
     # --Deprecated method: mamba installing on conda--
     #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
     #    * Note that installing mamba into any other environment than base is not supported.
     #
     #conda activate base
     #conda install conda
     #conda uninstall mamba
     #conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Test the installed tools

     # Check versions
     sniffles --version
     RepeatModeler -h
     RepeatMasker -h
     svim --help
     SURVIVOR --help
     mamba install -c conda-forge perl r
  2. Data Preparation

     Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
     Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
  3. Preprocessing

     Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
     Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
  4. (Optional) Variant Calling for SNP and Indel Detection:

     Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
  5. (OFFICIAL STARTING POINT) Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

       #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
    
       # -- PREPARING the input fastq-data, merge the fastqz and move the top-directory
    
       # Under raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass
       zcat ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_0.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_1.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_2.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_3.fastq.gz ... | gzip > HD46_1.fastq.gz
       mv ./raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass/HD46_1.fastq.gz ~/DATA/Data_Patricia_Transposon_2025
    
         #this are the corresponding sample names:
         #barcode 1: HD46-1
         #barcode 2: HD46-2
         #barcode 3: HD46-3
         #barcode 4: HD46-4
         mv barcode01.fastq.gz HD46_1.fastq.gz
         mv barcode02.fastq.gz HD46_2.fastq.gz
         mv barcode03.fastq.gz HD46_3.fastq.gz
         mv barcode04.fastq.gz HD46_4.fastq.gz
    
       # -- CALCULATE the coverages
         #!/bin/bash
    
         for bam in barcode*_minimap2.sorted.bam; do
             echo "Processing $bam ..."
             avg_cov=$(samtools depth -a "$bam" | awk '{sum+=$3; cnt++} END {if (cnt>0) print sum/cnt; else print 0}')
             echo -e "${bam}\t${avg_cov}" >> coverage_summary.txt
         done
    
       # ---- !!!! LOGIN the suitable environment !!!! ----
    
       mamba activate transposon_long
    
       # -- TODO: AFTERNOON_DEBUG_THIS: FAILED and not_USED: Alignment and Detect structural variants in each sample using SVIM which used aligner ngmlr or mimimap2
       #mamba install -c bioconda ngmlr
       mamba install -c bioconda svim
    
       # ---- Option_1: minimap2 (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
       for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           #INS,INV,DUP:TANDEM,DUP:INT,BND
           svim reads --aligner minimap2 --nanopore minimap2+svim_${sample}    ${sample}.fastq.gz CP020463.fasta  --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
       done
    
       #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    
       # ---- Option_2: minamap2 (aligner) + Sniffles2 (structural variant caller) --> SUCCESSFUL ----
       #Minimap2: A commonly used aligner for nanopore sequencing data.
       #    Align Long Reads to the WT Reference using Minimap2
       #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
       #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
       #  -l 50: Reports SVs that are at least 50 base pairs long.
       #  -t 60: Uses 60 threads for faster processing.
       for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           #minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
           minimap2 --MD -t 60 -ax map-ont CP020463.fasta ${sample}.fastq.gz | samtools sort -o ${sample}_minimap2.sorted.bam
           samtools index ${sample}_minimap2.sorted.bam
           sniffles -m ${sample}_minimap2.sorted.bam -v ${sample}_minimap2+sniffles.vcf -s 10 -l 50 -t 60
           #QUAL < 20 ||
           bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_minimap2+sniffles.vcf > ${sample}_minimap2+sniffles_filtered.vcf
       done
    
         #Estimating parameter...
         #        Max dist between aln events: 44
         #        Max diff in window: 76
         #        Min score ratio: 2
         #        Avg DEL ratio: 0.0112045
         #        Avg INS ratio: 0.0364027
         #Start parsing... CP020463
         #                # Processed reads: 10000
         #                # Processed reads: 20000
         #        Finalizing  ..
         #Start genotype calling:
         #        Reopening Bam file for parsing coverage
         #        Finalizing  ..
         #Estimating parameter...
         #        Max dist between aln events: 28
         #        Max diff in window: 89
         #        Min score ratio: 2
         #        Avg DEL ratio: 0.013754
         #        Avg INS ratio: 0.17393
         #Start parsing... CP020463
         #                # Processed reads: 10000
         #                # Processed reads: 20000
         #                # Processed reads: 30000
         #                # Processed reads: 40000
    
         # Results:
         # * barcode01_minimap2+sniffles.vcf
         # * barcode01_minimap2+sniffles_filtered.vcf
         # * barcode02_minimap2+sniffles.vcf
         # * barcode02_minimap2+sniffles_filtered.vcf
         # * barcode03_minimap2+sniffles.vcf
         # * barcode03_minimap2+sniffles_filtered.vcf
         # * barcode04_minimap2+sniffles.vcf
         # * barcode04_minimap2+sniffles_filtered.vcf
    
       #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
       #for sample in barcode01 barcode02 barcode03 barcode04; do
       #    sniffles -m svim_reads_minimap2_${sample}/${sample}.fastq.minimap2.coordsorted.bam -v sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
       #    bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
       #done
    
       # ---- Option_3: NGMLR (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
       for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           svim reads --aligner ngmlr --nanopore    ngmlr+svim_${sample}       ${sample}.fastq.gz CP020463.fasta  --cores 10;
       done
    
       # ---- Option_4: NGMLR (aligner) + sniffles (structural variant caller) --> SUCCESSFUL ----
       for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
           sniffles -m ngmlr+svim_${sample}/${sample}.fastq.ngmlr.coordsorted.bam -v ${sample}_ngmlr+sniffles.vcf -s 10 -l 50 -t 60
           bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_ngmlr+sniffles.vcf > ${sample}_ngmlr+sniffles_filtered.vcf
       done
  6. Compare and integrate all results produced by minimap2+sniffles and ngmlr+sniffles, and check them each position in IGV!

         mv HD46_Ctrl_minimap2+sniffles_filtered.vcf HD46-Ctrl_minimap2+sniffles_filtered.vcf
         mv HD46_Ctrl_ngmlr+sniffles_filtered.vcf    HD46-Ctrl_ngmlr+sniffles_filtered.vcf
         mv HD46_1_minimap2+sniffles_filtered.vcf    HD46-1_minimap2+sniffles_filtered.vcf
         mv HD46_1_ngmlr+sniffles_filtered.vcf       HD46-1_ngmlr+sniffles_filtered.vcf
         mv HD46_2_minimap2+sniffles_filtered.vcf    HD46-2_minimap2+sniffles_filtered.vcf
         mv HD46_2_ngmlr+sniffles_filtered.vcf       HD46-2_ngmlr+sniffles_filtered.vcf
         mv HD46_3_minimap2+sniffles_filtered.vcf    HD46-3_minimap2+sniffles_filtered.vcf
         mv HD46_3_ngmlr+sniffles_filtered.vcf       HD46-3_ngmlr+sniffles_filtered.vcf
         mv HD46_4_minimap2+sniffles_filtered.vcf    HD46-4_minimap2+sniffles_filtered.vcf
         mv HD46_4_ngmlr+sniffles_filtered.vcf       HD46-4_ngmlr+sniffles_filtered.vcf
         mv HD46_5_minimap2+sniffles_filtered.vcf    HD46-5_minimap2+sniffles_filtered.vcf
         mv HD46_5_ngmlr+sniffles_filtered.vcf       HD46-5_ngmlr+sniffles_filtered.vcf
         mv HD46_6_minimap2+sniffles_filtered.vcf    HD46-6_minimap2+sniffles_filtered.vcf
         mv HD46_6_ngmlr+sniffles_filtered.vcf       HD46-6_ngmlr+sniffles_filtered.vcf
         mv HD46_7_minimap2+sniffles_filtered.vcf    HD46-7_minimap2+sniffles_filtered.vcf
         mv HD46_7_ngmlr+sniffles_filtered.vcf       HD46-7_ngmlr+sniffles_filtered.vcf
         mv HD46_8_minimap2+sniffles_filtered.vcf    HD46-8_minimap2+sniffles_filtered.vcf
         mv HD46_8_ngmlr+sniffles_filtered.vcf       HD46-8_ngmlr+sniffles_filtered.vcf
         mv HD46_13_minimap2+sniffles_filtered.vcf   HD46-13_minimap2+sniffles_filtered.vcf
         mv HD46_13_ngmlr+sniffles_filtered.vcf      HD46-13_ngmlr+sniffles_filtered.vcf
    
       conda activate plot-numpy1
       #python generate_common_vcf.py
       #mv common_variants.xlsx putative_transposons.xlsx
    
       # * Reads each of your VCFs.
       # * Filters variants → only keep those with FILTER == PASS.
       # * Compares the two aligner methods (minimap2+sniffles2 vs ngmlr+sniffles2) per sample.
       # * Keeps only variants that appear in both methods for the same sample.
       # * Outputs: An Excel file with the common variants and a log text file listing which variants were filtered out, and why (not_PASS or not_COMMON_in_two_VCF).
    
       #python generate_fuzzy_common_vcf_v1.py
       #Sample   PASS_minimap2   PASS_ngmlr  COMMON
       #  HD46-Ctrl_Ctrl 39  29  28
       #  HD46-1 39  32  29
       #  HD46-2 40  32  28
       #  HD46-3 38  30  27
       #  HD46-4 46  35  32
       #  HD46-5 40  35  31
       #  HD46-6 43  35  30
       #  HD46-7 40  33  28
       #  HD46-8 37  20  11
       #  HD46-13    39  38  27
    
     #Sample PASS_minimap2   PASS_ngmlr  COMMON_FINAL
     #HD46-Ctrl_Ctrl 39  29  6
     #HD46-1 39  32  8
     #HD46-2 40  32  8
     #HD46-3 38  30  6
     #HD46-4 46  35  8
     #HD46-5 40  35  9
     #HD46-6 43  35  10
     #HD46-7 40  33  8
     #HD46-8 37  20  4
     #HD46-13    39  38  5
    
     #!!!! Summarize the results of ngmlr+sniffles !!!!
     python merge_ngmlr+sniffles_filtered_results_and_summarize.py
    
     #!!!! Post-Processing !!!!
     #DELETE "2186168    N   

    . PASS” in Sheet HD46-13 and Summary #DELETE “2427785 N CGTCAGAATCGCTGTCTGCGTCCGAGTCACTGTCTGAGTCTGAATCACTATCTGCGTCTGAGTCACTGTCTG . PASS” due to “0/1:169:117” in HD46-13 and Summary #DELETE “2441640 N GCTCATTAAGAATCATTAAATTAC . PASS” due to 0/1:170:152 in HD46-13 and Summary

  7. Source code of merge_ngmlr+sniffles_filtered_results_and_summarize.py

     import os
     import pandas as pd
    
     # List all ngmlr VCF files
     vcf_files = sorted([f for f in os.listdir('.') if f.endswith('_ngmlr+sniffles_filtered.vcf')])
    
     log_lines = []
     df_list = []
    
     # Function to read VCF and filter PASS
     def read_vcf(vcf_path, log_lines, sample):
         variants = []
         with open(vcf_path) as f:
             for line in f:
                 if line.startswith('#'):
                     continue
                 parts = line.strip().split('\t')
                 pos, ref, alt, qual, flt = parts[1], parts[3], parts[4], parts[5], parts[6]
                 info = parts[7] if len(parts) > 7 else '.'
                 fmt = parts[8] if len(parts) > 8 else '.'
                 last_column = parts[9] if len(parts) > 9 else '.'  # Keep original column
                 var_id = f"{pos}:{ref}>{alt}"
                 if flt != 'PASS':
                     log_lines.append(f"{sample}\t{var_id}\tnot_PASS")
                     continue
                 variants.append({
                     'POS': int(pos),
                     'REF': ref,
                     'ALT': alt,
                     'QUAL': qual,
                     'FILTER': flt,
                     'INFO': info,
                     'FORMAT': fmt,
                     sample: last_column,  # Use only genotype for individual sheets
                     f"{sample}_with_alt": alt  # Use only ALT sequence for summary
                 })
         return pd.DataFrame(variants)
    
     # Read all VCFs
     sample_names = []
     sample_with_alt_columns = []
     for f in vcf_files:
         sample = os.path.basename(f).replace('_ngmlr+sniffles_filtered.vcf', '')
         sample_names.append(sample)
         sample_with_alt_columns.append(f"{sample}_with_alt")
         df = read_vcf(f, log_lines, sample)
         df_list.append(df)
    
     # Merge all variants into one DataFrame
     all_variants = pd.concat(df_list, ignore_index=True)
    
     # Fill missing sample columns with hyphen for summary
     for sample in sample_names:
         if sample not in all_variants.columns:
             all_variants[sample] = ''
         if f"{sample}_with_alt" not in all_variants.columns:
             all_variants[f"{sample}_with_alt"] = '-'
    
     # Pivot table for summary using exact POS, using columns with ALT
     summary_df = all_variants.groupby(['POS', 'REF'])[sample_with_alt_columns].first().reset_index()
    
     # Rename columns in summary to remove '_with_alt' suffix
     summary_df.columns = ['POS', 'REF'] + sample_names
    
     # Replace NaN or empty strings with hyphen in sample columns
     summary_df[sample_names] = summary_df[sample_names].fillna('-').replace('', '-')
    
     # Count how many samples have a non-hyphen value
     summary_df['Count'] = summary_df[sample_names].apply(lambda row: sum(val != '-' for val in row), axis=1)
    
     # Save Excel with individual sheets and summary
     writer = pd.ExcelWriter('merged_ngmlr+sniffles_variants.xlsx', engine='openpyxl')
    
     # Save individual sample sheets
     for df in df_list:
         sheet_name = df.columns[-2]  # Use sample name (not the _with_alt column)
         # Select only relevant columns for individual sheets (exclude _with_alt)
         df = df[[col for col in df.columns if not col.endswith('_with_alt')]]
         df.to_excel(writer, sheet_name=sheet_name, index=False)
    
     # Save summary sheet
     summary_df.to_excel(writer, sheet_name='Summary', index=False)
     writer.close()
    
     # Write log
     with open('ngmlr_filtering_log.txt', 'w') as logf:
         logf.write('Sample\tVariant\tReason\n')
         for line in log_lines:
             logf.write(line + '\n')
    
     print('Done: merged_ngmlr+sniffles_variants.xlsx with ALT sequences or hyphens in summary and genotype-only in individual sheets.')
  8. ———————– END of the transposon calculation. The following steps (e.g. assembly based on the long-reads is not necessary for the transposon analysis) —————————-

  9. (NOT_USED) Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)

       python vcf_to_fasta.py variants.vcf variants.fasta
       #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
       #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
       /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
       python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
       Exact Matches
       =============
       2 bacteria (blast name), Bacteria 
    (scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV​)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733​)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields. * In your VCF file generated by Sniffles, the REF=N in the results has a specific meaning: * In a standard VCF, the REF field usually contains the reference base(s) at the variant position. * For structural variants (SVs), especially insertions, there is no reference sequence replaced; the insertion occurs between reference bases. * Therefore, Sniffles uses N as a placeholder in the REF field to indicate “no reference base replaced”. * The actual inserted sequence is then stored in the ALT field.
  10. Why some records have UNRESOLVED in the FILTER field in the Excel output.

     1. Understanding the format
    
         The data appears to be structural variant (SV) calls from Sniffles, probably in a VCF-like tabular format exported to Excel:
    
             * gi|1176884116|gb|CP020463.1| → reference sequence
             * Positions: 1855752 and 2422820
             * N → insertion event
             * SVLEN=999 → size of the insertion
             * AF → allele frequency
             * GT:DR:DV → genotype, depth reference, depth variant (1/1:0:678, example values for a PASS variant)
             * FILTER → whether the variant passed filters (UNRESOLVED means it didn’t pass)
    
     2. What UNRESOLVED usually means
    
         In Sniffles:
    
         * UNRESOLVED is assigned to SVs when the tool cannot confidently resolve the exact sequence or breakpoint.
         * Reasons include:
             - Low read support (RE, DV) relative to the expected coverage
             - Ambiguous alignment at repetitive regions
             - Conflicting strand or orientation signals
             - Allele frequency inconsistent with expectations
    
     3. Examine your two records
    
         First record
    
             POS: 1855752
             SVTYPE: INS
             SVLEN: 999
             RE: 68
             AF: 1
             GT: 1/1
             FILTER: UNRESOLVED
    
         Observations:
    
         * AF = 1 → allele frequency 100%, homozygous insertion
         * RE = 68 → 68 reads support the variant, decent coverage
         * Still UNRESOLVED → likely because Sniffles could not resolve the inserted sequence precisely; sometimes long insertions in repetitive regions are hard to reconstruct fully even with good read support.
    
         Second record
    
             POS: 2422820
             SVTYPE: INS
             SVLEN: 999
             RE: 22
             AF: 0.025522
             GT: 0/0
             FILTER: UNRESOLVED
    
         Observations:
    
         * AF = 0.0255 → very low allele frequency (~2.5%)
         * RE = 22, DR = 840 → very low variant reads vs reference
         * GT = 0/0 → homozygous reference
         * Sniffles marks it UNRESOLVED because the variant is essentially noise, not confidently detected.
    
     4. Key difference between the two
         Feature First record    Second record
         Allele frequency (AF)   1 (high)    0.0255 (very low)
         Variant reads (RE)  68  22
         Genotype (GT)   1/1 0/0
         Reason for UNRESOLVED   Unresolvable inserted sequence
    
     ✅ 5. Conclusion
    
         * Sniffles marks a variant as UNRESOLVED when the SV cannot be confidently characterized.
         * Even if there is good read support (first record), complex insertions can’t always be reconstructed fully.
         * Very low allele frequency (second record) also triggers UNRESOLVED because the signal is too weak compared to background noise.
         * Essentially: “UNRESOLVED” ≠ bad data, it’s just unresolved uncertainty.
  11. (NOT_SURE_HOW_TO_USE) Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

       mamba install -c bioconda medaka
       medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
  12. Compare Insertions Across Samples

     Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
     SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
         Input: List of VCF files from Sniffles2.
         Output: A consolidated VCF file with shared and unique variants.
    
     Filter WT Insertions:
    
         Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
             bcftools isec WT.vcf merged.vcf -p comparison_results
  13. Validate and Visualize

     Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
     igv.sh
    
     Validate Findings:
         Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  14. Alternatives to TEPID for Long-Read Data

     If you’re looking for transposon-specific tools for long reads:
    
         REPET: A robust transposon annotation tool compatible with assembled genomes.
         EDTA (Extensive de novo TE Annotator):
             A pipeline to identify, classify, and annotate transposons.
             Works directly on your assembled genomes.
    
             perl EDTA.pl --genome WT.fasta --type all
  15. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

     Steps to Generate WT.vcf
     1. Align WT Reads to the WT Reference Genome
    
     The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
     Command using Minimap2:
    
     minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
     Index the BAM file:
    
     samtools index WT.sorted.bam
    
     2. Detect Structural Variants with Sniffles2
    
     Run Sniffles2 on the WT alignment to call structural variants:
    
     sniffles --input WT.sorted.bam --vcf WT.vcf
    
     This step identifies:
    
         Native transposons and insertions present in the WT genome.
         Other structural variants that are part of the reference genome or sequencing artifacts.
    
     Key parameters to consider:
    
         --min_support: Adjust based on your WT sequencing coverage.
         --max_distance: Define proximity for merging variants.
         --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  16. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

     To ensure the WT.vcf only includes relevant transposons or SVs:
    
         Use bcftools or similar tools to filter out low-confidence variants:
    
         bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
         bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
  17. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

     Why Use a Reference Genome (WT.fasta) from a Database?
    
         Higher Quality and Completeness:
             Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
         Consistency:
             Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
         Saves Time:
             Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
     Alternative: Assembling WT from FASTQ
    
     If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
         Assemble the genome from your WT.fastq:
             Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
         flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size 
    Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
  18. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

     # -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ !
     # Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches.
     #https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1
    
     # -- (failed!) try TEPID for annotation
     mamba install tepid=0.10 -c bioconda
     #(tepid_env)
     for sample in WT 1 2 3 4 5 7 8 9 10; do
         tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q  ../batch1_depth25/trycycler_${sample}/reads.fastq;
         tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se;
     done
    
     tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se;
    
     tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS]
                 [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL
    
     # -- (failed!) try EDTA for annotation
     https://github.com/oushujun/EDTA
     (transposon_long) mamba install -c conda-forge -c bioconda edta
     mamba install bioconda::rmblast  # cd RepeatMasker; ./configure
     EDTA.pl --genome CP020463.fasta --species others --threads 40
    
     For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options.
     For de novo repeat identification: RepeatScout is highly effective.
     For LTR retrotransposons: Use LTR_retriever.
     For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.
  19. Validation: Cross-validate with short-read sequencing data if available.

  20. (Optional) Assembly the nanopore-sequencing using

     1. merge and prepare fastq.gz
    
     2. calcuclate the precentage of coding DNA: https://www.ncbi.nlm.nih.gov/nuccore/CP020463 with additional plasmid.
    
         * Average gene length: 870.4 bp
         * Total coding region length: 2056168 bp
         * Percentage of coding DNA = (Total Coding Region Length / Total Genome Length) × 100 = 2056168 bp / 2454929 bp = 83.8%
    
     3. Prepare the fastq.gz
    
         conda activate /home/jhuang/miniconda3/envs/trycycler  # under jhuang@hamm (10.169.63.113).
    
         rsync -a -P *.fastq.gz jhuang@10.169.63.113:/home/jhuang/DATA/Data_Patricia_Transposon_2025/
         for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
             mkdir trycycler_${sample};
             mv ${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
             gunzip trycycler_${sample}/reads.fastq.gz;
         done
         #mkdir batch3;
         #cd batch3;
         #for sample in WT 1 2 3 4 5 7 8 9 10; do
         #    mkdir trycycler_${sample};
         #    cp ${sample}_nanopore/${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
         #    gunzip trycycler_${sample}/reads.fastq.gz;
         #done
    
     4. Running separate assemblies (6x4 times: canu, flye, miniasm_and_minipolish, necat, nextdenovo, raven) using trycycler_assembly_extra-thorough.sh (under the trycycler environment running the following steps)
             # batch1: min_read_cov=25; batch2: min_read_cov=50; batch3: min_read_cov=100 (if necessary on the monday!)
             cp ../Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_*.sh ./
             #for sample in trycycler_HDRNA_10 trycycler_HDRNA_13; do
             for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
                 cd trycycler_${sample};
                 ../trycycler_assembly_extra-thorough.sh
                 cd ..;
             done
    
     # END!
     #TODO: upload the nanopore-sequencing data to NCBI for the HDRNA_10 and HDRNA_13 to the correspoinding project.
    
     (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_02.fasta
     assembly_02.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
     [15:32:46] If you like MLST, you're absolutely going to love wgMLST!
     [15:32:46] Done.
     (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_20.fasta
    
     assembly_20.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
     [15:33:01] You can use --label XXX to replace an ugly filename in the output.
     [15:33:01] Done.
    
     (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_02.fasta
     assembly_02.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    
     (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_08.fasta
    
     assembly_08.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
     [15:38:22] Remember that --minscore is only used when using automatic scheme detection.
     [15:38:22] Done.
     (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_14.fasta
     assembly_14.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
     [15:38:50] Thanks for using mlst, I hope you found it useful.
     [15:38:50] Done.
     (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_20.fasta
     assembly_20.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
     [15:38:54] If you like MLST, you're going to absolutely love cgMLST!
     [15:38:54] Done.
    
             #- under the directory batch3
             #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
             #    cd ${sample};
             #    ../trycycler_assembly_extra-thorough_threads5_cov100.sh;
             #    cd ..;
             #done
    
             #if ERROR, running separate assembly-methods raven and canu
             #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
             #    cd ${sample};
             #    ../trycycler_assembly_extra-thorough_raven.sh;
             #    cd ..;
             #done
             #
             #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
             #    cd ${sample};
             #    ../trycycler_assembly_extra-thorough_canu.sh;
             #    cd ..;
             #done
    
     5. trycycler cluster
    
             for sample in trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
             for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4; do
                 cd ${sample};
                 rm -rf trycycler
                 trycycler cluster --threads 10 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler;
                 cd ..;
             done
    
     6. trycycler reconcile
    
             cd trycycler_WT
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             mkdir trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/C_utg000001c.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
    
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
                 # -- under batch3 --
                 176K Nov  5 12:38 P_bctg00000001.fasta
                 122K Nov  5 12:38 D_bctg00000001.fasta
                 95K Nov  5 12:38 J_bctg00000001.fasta
                 78K Nov  5 12:38 J_bctg00000002.fasta
                 #--> Error: unable to find a suitable common sequence
    
                 69K Nov  5 17:24 G_tig00000004.fasta
                 66K Nov  5 17:24 S_tig00000004.fasta
                 66K Nov  5 17:24 M_tig00000004.fasta
                 65K Nov  5 17:24 A_tig00000005.fasta
                 #--> Error: unable to find a suitable common sequence
                 #If repeat M_tig00000004.fasta twice, resulting in the following error: Circularising M_tig00000004:
                 #  using S_tig00000004:
                 #    unable to circularise: M_tig00000004's start and end were found in multiple places in S_tig00000004
                 #Error: failed to circularise sequence M_tig00000004 because its start/end sequences were found in multiple ambiguous places in other sequences. #This is likely because M_tig00000004 starts/ends in a repetitive region. You can either manually repair its circularisation (and ensure it does #not start/end in a repetitive region) or exclude the sequence altogether and try again.
    
                 47K Nov  5 17:24 V_bctg00000001.fasta
                 45K Nov  5 17:24 C_utg000002c.fasta
                 #Error: unable to find a suitable common sequence
    
                 34K Nov  5 17:24 U_utg000004c.fasta
                 34K Nov  5 17:24 N_contig_2.fasta
                 34K Nov  5 17:24 T_contig_2.fasta
                 #Error: unable to find a suitable common sequence
    
                 23K Nov  5 17:24 I_utg000003c.fasta
                 22K Nov  5 17:24 B_contig_2.fasta
                 22K Nov  5 17:24 H_contig_2.fasta
                 #Error: unable to find a suitable common sequence
    
                 13K Nov  5 17:24 G_tig00000006.fasta
                 12K Nov  5 17:24 M_tig00000003.fasta
                 12K Nov  5 17:24 O_utg000002c.fasta
                 12K Nov  5 17:24 S_tig00000003.fasta
                 11K Nov  5 17:24 A_tig00000004.fasta
                 #trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 --max_length_diff 1.2 --max_trim_seq_percent 20.0
                 #--> SUCCESSFULLY saving sequences to file: trycycler/cluster_002/2_all_seqs.fasta
    
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 --max_length_diff 1.3 --max_trim_seq_percent 30.0
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 #Error: two or more input contigs are required
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006 #Error: two or more input contigs are required
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_007 #Error: two or more input contigs are required
    
             cd ../trycycler_1
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             mkdir trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/I_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/W_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/A_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/U_utg000001c.fasta trycycler/cluster_001/1_contigs_removed
    
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
             #TODO_TOMORROW_HERE!
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 --max_length_diff 1.7 --max_trim_seq_percent 70.0
             #Error: failed to circularise sequence I_utg000002l because its end could not be found in other sequences.
             (NOTE) using the plasmid from other isolate to help the cut of the sequence.
                 #cp ~/DATA/Data_Patricia_Transposon/batch3/trycycler_WT/trycycler/cluster_004/7_final_consensus.fasta .
                 #cat *.fasta > all.fasta
                 #mafft --clustalout --adjustdirection all.fasta > all.aln
             cluster_004_con -------------------------------------------------------tatga
             I_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
             O_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
                                                                                 *****
             cluster_004_con gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaa---
             I_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
             O_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
                             ********************************************************
    
             grep "I_utg000002l" all.a_ > I_utg000002l.fasta
             grep "O_utg000002l" all.a_ > O_utg000002l.fasta
             cut -d' ' -f5 I_utg000002l.fasta > I_utg000002l_.fasta
             cut -d' ' -f5 O_utg000002l.fasta > O_utg000002l_.fasta
             sed 's/-//g' I_utg000002l_.fasta > I_utg000002l__.fasta
             sed 's/-//g' O_utg000002l_.fasta > O_utg000002l__.fasta
             seqtk seq I_utg000002l__.fasta -l 80 > I_utg000002l___.fasta
             seqtk seq O_utg000002l__.fasta -l 80 > O_utg000002l___.fasta
    
             #Cut two files Manually with the sequence: AGAATCAGAATTAGGCGCATAATTTACAGGAGGTTTGATTATGGCTAAGAAAAA
             cat I_utg000002l___.fasta O_utg000002l___.fasta > 2_all_seqs.fasta
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 #Error: two or more input contigs are required
    
             cd ../trycycler_2
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             mkdir trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/O_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
             cd ../trycycler_3
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             mkdir trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
             cd ../trycycler_4
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
             mkdir trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/F_Utg670.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/V_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
             cd ../trycycler_5
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
             cd ../trycycler_7
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             mkdir trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/D_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
             cd ../trycycler_8
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             mkdir trycycler/cluster_001/1_contigs_removed
             mv trycycler/cluster_001/1_contigs/M_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
             cd ../trycycler_9
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006
    
             cd ../trycycler_10
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
             trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
             #Nachmachen: for sample in trycycler_9 trycycler_10; do
             cd ${sample};
             ../trycycler_assembly_extra-thorough_canu.sh;
             cd ..;
             done
    
     7. (optional) map the circular assemblies to plasmid databases
             #bzip2 -d plsdb.fna.bz2
             #makeblastdb -in plsdb.fna -dbtype nucl
             #blastn -db plsdb.fna -query all.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > all_on_plsdb.blastn
    
     8. trycycler msa
    
             trycycler msa --threads 55 --cluster_dir trycycler/cluster_001
             trycycler msa --threads 55 --cluster_dir trycycler/cluster_002
             trycycler msa --threads 55 --cluster_dir trycycler/cluster_003
             trycycler msa --threads 55 --cluster_dir trycycler/cluster_004
             trycycler msa --threads 55 --cluster_dir trycycler/cluster_005
             #--> When finished, Trycycler reconcile will make a 3_msa.fasta file in the cluster directory
    
     9. trycycler partition
    
             #generate 4_reads.fastq for each contig!
             trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_*
             #trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001 trycycler/cluster_002 trycycler/cluster_003
             trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_003
    
     10. trycycler consensus
    
             trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001
             trycycler consensus --threads 55 --cluster_dir trycycler/cluster_002
             trycycler consensus --threads 55 --cluster_dir trycycler/cluster_003
             trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004
             trycycler consensus --threads 55 --cluster_dir trycycler/cluster_005
             #!!NOTE that we take the isolates of HD05_2_K5 and HD05_2_K6 assembled by Unicycler instead of Trycycler!!
             # TODO (TODAY), generate the 3 datasets below!
             # TODO (IMPORTANT): write a Email to Holger, say the short sequencing of HD5_2 is not correct, since the 3 datasets! However, the MTxxxxxxx is confirmed not in K5 and K6!
             TODO: variant calling needs the short-sequencing, they are not dorable without the correct short-reads! resequencing? It is difficult to call variants only from long-reads since too much errors in long-reads!
             #TODO: check the MT sequence if in the isolates, more deteiled annotations come late!
             #II. Comparing the results of Trycycler with Unicycler.
             #III. Eventually add the plasmids assembled from unicycler to the final results. E.g. add the 4 plasmids to K5 and K6
    
     11. Polishing after Trycycler
    
             #1. Oxford Nanopore sequencer (Ignored due to the samtools version incompatibility!)
             # for c in trycycler/cluster_*; do
             #     medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12
             #     mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
             #     rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi  # clean up
             # done
             # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
             #2. Short-read polishing
             #---- 5179_R1 (2) ----
             #  mean read depth: 205.8x
             #  188 bp have a depth of zero (99.9924% coverage)
             #  355 positions changed (0.0144% of total positions)
             #  estimated pre-polishing sequence accuracy: 99.9856% (Q38.42)
             #Step 1: read QC
             fastp --in1 ../../s-epidermidis-5179-r1_R1.fastq.gz --in2 ../../s-epidermidis-5179-r1_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
             #Step 2: Polypolish
             for cluster in cluster_001 cluster_002; do
             bwa index ${cluster}/7_final_consensus.fasta
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
             polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
             done
             #Step 3: POLCA
             for cluster in cluster_001 cluster_002; do
             cd ${cluster}
             polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 37
             #Insertion/Deletion Errors: 2
             #Assembly Size: 2470001
             #Consensus Quality: 99.9984
             #Substitution Errors: 4
             #Insertion/Deletion Errors: 0
             #Assembly Size: 17748
             #Consensus Quality: 99.9775
             #Step 4: (optional) more rounds and/or other polishers
             #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
             #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
             for cluster in cluster_001 cluster_002; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             Substitution Errors: 13
             Insertion/Deletion Errors: 0
             Assembly Size: 2470004
             Consensus Quality: 99.9995
             Substitution Errors: 0
             Insertion/Deletion Errors: 0
             Assembly Size: 17748
             Consensus Quality: 100
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2470004
             #Consensus Quality: 100
             #---- 1585 (4) ----
             #  mean read depth: 174.7x
             #  8,297 bp have a depth of zero (99.6604% coverage)
             #  271 positions changed (0.0111% of total positions)
             #  estimated pre-polishing sequence accuracy: 99.9889% (Q39.55)
             #Step 1: read QC
             fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
             #Step 2: Polypolish
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             bwa index ${cluster}/7_final_consensus.fasta
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
             polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
             done
             #Step 3: POLCA
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             cd ${cluster}
             polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 7
             #Insertion/Deletion Errors: 4
             #Assembly Size: 2443174
             #Consensus Quality: 99.9995
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 9014
             #Consensus Quality: 100
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 9014
             #Consensus Quality: 100
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2344
             #Consensus Quality: 100
             #Step 4: (optional) more rounds and/or other polishers
             #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
             #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2443176
             #Consensus Quality: 100
             #---- 1585 derived from unicycler, under 1585_normal/unicycler (4) ----
             #Step 0: copy chrom and plasmid1, plasmid2, plasmid3 to cluster_001/7_final_consensus.fasta, ...
             #Step 1: read QC
             fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
             #Step 2: Polypolish
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             bwa index ${cluster}/7_final_consensus.fasta
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
             polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
             done
             #Polishing 1 (2,443,574 bp):
             #mean read depth: 174.7x
             #8,298 bp have a depth of zero (99.6604% coverage)
             #52 positions changed (0.0021% of total positions)
             #estimated pre-polishing sequence accuracy: 99.9979% (Q46.72)
             #Polishing 2 (9,014 bp):
             #mean read depth: 766.5x
             #3 bp have a depth of zero (99.9667% coverage)
             #0 positions changed (0.0000% of total positions)
             #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
             #Polishing 7 (2,344 bp):
             #mean read depth: 2893.0x
             #4 bp have a depth of zero (99.8294% coverage)
             #0 positions changed (0.0000% of total positions)
             #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
             #Polishing 8 (2,255 bp):
             #mean read depth: 2719.6x
             #4 bp have a depth of zero (99.8226% coverage)
             #0 positions changed (0.0000% of total positions)
             #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
             #Step 3: POLCA
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             cd ${cluster}
             polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 7
             #Insertion/Deletion Errors: 4
             #Assembly Size: 2443598
             #Consensus Quality: 99.9995
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 9014
             #Consensus Quality: 100
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2344
             #Consensus Quality: 100
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2255
             #Consensus Quality: 100
             #Step 4: (optional) more rounds and/or other polishers
             #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
             #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2443600
             #Consensus Quality: 100
             #-- 1585v (1, no short reads, waiting) --
             # TODO!
             #-- 5179 (2) --
             #mean read depth: 120.7x
             #7,547 bp have a depth of zero (99.6946% coverage)
             #356 positions changed (0.0144% of total positions)
             #estimated pre-polishing sequence accuracy: 99.9856% (Q38.41)
             #Step 1: read QC
             fastp --in1 ../../s-epidermidis-5179_R1.fastq.gz --in2 ../../s-epidermidis-5179_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
             #Step 2: Polypolish
             for cluster in cluster_001 cluster_002; do
             bwa index ${cluster}/7_final_consensus.fasta
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
             polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
             done
             #Step 3: POLCA
             for cluster in cluster_001 cluster_002; do
             cd ${cluster}
             polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 49
             #Insertion/Deletion Errors: 23
             #Assembly Size: 2471418
             #Consensus Quality: 99.9971
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 17748
             #Consensus Quality: 100
             #Step 4: (optional) more rounds POLCA
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 10
             #Insertion/Deletion Errors: 5
             #Assembly Size: 2471442
             #Consensus Quality: 99.9994
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             Substitution Errors: 6
             Insertion/Deletion Errors: 0
             Assembly Size: 2471445
             Consensus Quality: 99.9998
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
             cd ..
             done
             Substitution Errors: 0
             Insertion/Deletion Errors: 0
             Assembly Size: 2471445
             Consensus Quality: 100
             #-- HD5_2 (2): without the short-sequencing we cannot correct the base-calling! --
             # !ERROR to be REPORTED, the
             #Polishing cluster_001_consensus (2,504,140 bp):
             #mean read depth: 94.4x
             #240,420 bp have a depth of zero (90.3991% coverage)
             #56,894 positions changed (2.2720% of total positions)
             #estimated pre-polishing sequence accuracy: 97.7280% (Q16.44)
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R2_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R1_001.fastq
             /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R2_001.fastq
             #Step 1: read QC
             fastp --in1 ../../HD5_2_S38_R1_001.fastq.gz --in2 ../../HD5_2_S38_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
             # NOTE that the following steps are not run since the short-reads are not correct!
             # #Step 2: Polypolish
             # for cluster in cluster_001 cluster_005; do
             #   bwa index ${cluster}/7_final_consensus.fasta
             #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
             #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
             #   polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
             # done
             # #Step 3: POLCA
             # for cluster in cluster_001 cluster_005; do
             #   cd ${cluster}
             #   polca.sh -a polypolish.fasta -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
             #   cd ..
             # done
             # #Step 4: (optional) more rounds POLCA
             # for cluster in cluster_001; do
             #   cd ${cluster}
             #   polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
             #   cd ..
             # done
             # NOTE that the plasmids of HD5_2_K5 and HD5_2_K6 were copied from Unicycler!
             #-- HD5_2_K5 (4) --
             mean read depth: 87.1x
             25 bp have a depth of zero (99.9990% coverage)
             1,085 positions changed (0.0433% of total positions)
             estimated pre-polishing sequence accuracy: 99.9567% (Q33.63)
             #Step 1: read QC
             fastp --in1 ../../275_K5_Holger_S92_R1_001.fastq.gz --in2 ../../275_K5_Holger_S92_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
             #Step 2: Polypolish
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             bwa index ${cluster}/7_final_consensus.fasta
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
             polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
             done
             #Step 3: POLCA
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             cd ${cluster}
             polca.sh -a polypolish.fasta -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 146
             #Insertion/Deletion Errors: 2
             #Assembly Size: 2504401
             #Consensus Quality: 99.9941
             #Substitution Errors: 41
             #Insertion/Deletion Errors: 0
             #Assembly Size: 41288
             #Consensus Quality: 99.9007
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 9191
             #Consensus Quality: 100
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2767
             #Consensus Quality: 100
             #Step 4: (optional) more rounds POLCA
             for cluster in cluster_001 cluster_002; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 41
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2504401
             #Consensus Quality: 99.9984
             #Substitution Errors: 8
             #Insertion/Deletion Errors: 0
             #Assembly Size: 41288
             #Consensus Quality: 99.9806
             for cluster in cluster_001 cluster_002; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 8
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2504401
             #Consensus Quality: 99.9997
             #Substitution Errors: 4
             #Insertion/Deletion Errors: 0
             #Assembly Size: 41288
             #Consensus Quality: 99.9903
             for cluster in cluster_001 cluster_002; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 8
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2504401
             #Consensus Quality: 99.9997
             #Substitution Errors: 4
             #Insertion/Deletion Errors: 0
             #Assembly Size: 41288
             #Consensus Quality: 99.9903
             #-- HD5_2_K6 (4) --
             #mean read depth: 116.7x
             #4 bp have a depth of zero (99.9998% coverage)
             #1,022 positions changed (0.0408% of total positions)
             #estimated pre-polishing sequence accuracy: 99.9592% (Q33.89)
             #Step 1: read QC
             fastp --in1 ../../276_K6_Holger_S95_R1_001.fastq.gz --in2 ../../276_K6_Holger_S95_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
             #Step 2: Polypolish
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             bwa index ${cluster}/7_final_consensus.fasta
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
             bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
             polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
             done
             #Step 3: POLCA
             for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
             cd ${cluster}
             polca.sh -a polypolish.fasta -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 164
             #Insertion/Deletion Errors: 2
             #Assembly Size: 2504398
             #Consensus Quality: 99.9934
             #Substitution Errors: 22
             #Insertion/Deletion Errors: 0
             #Assembly Size: 41288
             #Consensus Quality: 99.9467
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 9191
             #Consensus Quality: 100
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2767
             #Consensus Quality: 100
             #Step 4: (optional) more rounds POLCA
             for cluster in cluster_001 cluster_002; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 32
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2504400
             #Consensus Quality: 99.9987
             #Substitution Errors: 0
             #Insertion/Deletion Errors: 0
             #Assembly Size: 41288
             #Consensus Quality: 100
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 4
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2504400
             #Consensus Quality: 99.9998
             for cluster in cluster_001; do
             cd ${cluster}
             polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
             cd ..
             done
             #Substitution Errors: 2
             #Insertion/Deletion Errors: 0
             #Assembly Size: 2504400
             #Consensus Quality: 99.9999
  21. (Optional) use platon to confirm the plasmid contigs: https://github.com/oschwengers/platon

Leave a Reply

Your email address will not be published. Required fields are marked *