Processing Data_Patricia_Transposon_2025 (Workflow for Structural Variant Calling in Nanopore Sequencing)

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)

    #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
    To activate this environment, use:
        micromamba activate /home/jhuang/mambaforge
    Or to execute a single command in this environment, use:
        micromamba run -p /home/jhuang/mambaforge mycommand
    installation finished.
    
    Do you wish to update your shell profile to automatically initialize conda?
    This will activate conda on startup and change the command prompt when activated.
    If you'd prefer that conda's base environment not be activated on startup,
      run the following command when conda is activated:
    
    conda config --set auto_activate_base false
    
    You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    [no] >>> yes
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    modified      /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    no change     /home/jhuang/.bashrc
    No action taken.
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    Added mamba to /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    Thank you for installing Mambaforge!
    
    Close your terminal window and open a new one, or run:
    #source ~/mambaforge/bin/activate
    conda --version
    mamba --version
    
    https://github.com/conda-forge/miniforge/releases
    Note
    
        * After installation, please make sure that you do not have the Anaconda default channels configured.
            conda config --show channels
            conda config --remove channels defaults
            conda config --add channels conda-forge
            conda config --show channels
            conda config --set channel_priority strict
            #conda clean --all
            conda config --remove channels biobakery
    
        * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
    # --Deprecated method: mamba installing on conda--
    #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    #    * Note that installing mamba into any other environment than base is not supported.
    #
    #conda activate base
    #conda install conda
    #conda uninstall mamba
    #conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Test the installed tools

    # Check versions
    sniffles --version
    RepeatModeler -h
    RepeatMasker -h
    svim --help
    SURVIVOR --help
    mamba install -c conda-forge perl r
  2. Data Preparation

    Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
    Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
  3. Preprocessing

    Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
    Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
  4. (Optional) Variant Calling for SNP and Indel Detection:

    Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
  5. (OFFICIAL STARTING POINT) Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

      #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
    
      # -- PREPARING the input fastq-data, merge the fastqz and move the top-directory
    
      # Under raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass
      zcat ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_0.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_1.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_2.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_3.fastq.gz ... | gzip > HD46_1.fastq.gz
      mv ./raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass/HD46_1.fastq.gz ~/DATA/Data_Patricia_Transposon_2025
    
        #this are the corresponding sample names:
        #barcode 1: HD46-1
        #barcode 2: HD46-2
        #barcode 3: HD46-3
        #barcode 4: HD46-4
        mv barcode01.fastq.gz HD46_1.fastq.gz
        mv barcode02.fastq.gz HD46_2.fastq.gz
        mv barcode03.fastq.gz HD46_3.fastq.gz
        mv barcode04.fastq.gz HD46_4.fastq.gz
    
      # -- CALCULATE the coverages
        #!/bin/bash
    
        for bam in barcode*_minimap2.sorted.bam; do
            echo "Processing $bam ..."
            avg_cov=$(samtools depth -a "$bam" | awk '{sum+=$3; cnt++} END {if (cnt>0) print sum/cnt; else print 0}')
            echo -e "${bam}\t${avg_cov}" >> coverage_summary.txt
        done
    
      # ---- !!!! LOGIN the suitable environment !!!! ----
    
      mamba activate transposon_long
    
      # -- TODO: AFTERNOON_DEBUG_THIS: FAILED and not_USED: Alignment and Detect structural variants in each sample using SVIM which used aligner ngmlr or mimimap2
      #mamba install -c bioconda ngmlr
      mamba install -c bioconda svim
    
      # ---- Option_1: minimap2 (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          #INS,INV,DUP:TANDEM,DUP:INT,BND
          svim reads --aligner minimap2 --nanopore minimap2+svim_${sample}    ${sample}.fastq.gz CP020463.fasta  --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
      done
    
      #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    
      # ---- Option_2: minamap2 (aligner) + Sniffles2 (structural variant caller) --> SUCCESSFUL ----
      #Minimap2: A commonly used aligner for nanopore sequencing data.
      #    Align Long Reads to the WT Reference using Minimap2
      #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
      #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
      #  -l 50: Reports SVs that are at least 50 base pairs long.
      #  -t 60: Uses 60 threads for faster processing.
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          #minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ${sample}.fastq.gz | samtools sort -o ${sample}_minimap2.sorted.bam
          samtools index ${sample}_minimap2.sorted.bam
          sniffles -m ${sample}_minimap2.sorted.bam -v ${sample}_minimap2+sniffles.vcf -s 10 -l 50 -t 60
          #QUAL < 20 ||
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_minimap2+sniffles.vcf > ${sample}_minimap2+sniffles_filtered.vcf
      done
    
        #Estimating parameter...
        #        Max dist between aln events: 44
        #        Max diff in window: 76
        #        Min score ratio: 2
        #        Avg DEL ratio: 0.0112045
        #        Avg INS ratio: 0.0364027
        #Start parsing... CP020463
        #                # Processed reads: 10000
        #                # Processed reads: 20000
        #        Finalizing  ..
        #Start genotype calling:
        #        Reopening Bam file for parsing coverage
        #        Finalizing  ..
        #Estimating parameter...
        #        Max dist between aln events: 28
        #        Max diff in window: 89
        #        Min score ratio: 2
        #        Avg DEL ratio: 0.013754
        #        Avg INS ratio: 0.17393
        #Start parsing... CP020463
        #                # Processed reads: 10000
        #                # Processed reads: 20000
        #                # Processed reads: 30000
        #                # Processed reads: 40000
    
        # Results:
        # * barcode01_minimap2+sniffles.vcf
        # * barcode01_minimap2+sniffles_filtered.vcf
        # * barcode02_minimap2+sniffles.vcf
        # * barcode02_minimap2+sniffles_filtered.vcf
        # * barcode03_minimap2+sniffles.vcf
        # * barcode03_minimap2+sniffles_filtered.vcf
        # * barcode04_minimap2+sniffles.vcf
        # * barcode04_minimap2+sniffles_filtered.vcf
    
      #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
      #for sample in barcode01 barcode02 barcode03 barcode04; do
      #    sniffles -m svim_reads_minimap2_${sample}/${sample}.fastq.minimap2.coordsorted.bam -v sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
      #    bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
      #done
    
      # ---- Option_3: NGMLR (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          svim reads --aligner ngmlr --nanopore    ngmlr+svim_${sample}       ${sample}.fastq.gz CP020463.fasta  --cores 10;
      done
    
      # ---- Option_4: NGMLR (aligner) + sniffles (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          sniffles -m ngmlr+svim_${sample}/${sample}.fastq.ngmlr.coordsorted.bam -v ${sample}_ngmlr+sniffles.vcf -s 10 -l 50 -t 60
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_ngmlr+sniffles.vcf > ${sample}_ngmlr+sniffles_filtered.vcf
      done
  6. Compare and integrate all results produced by minimap2+sniffles and ngmlr+sniffles, and check them each position in IGV!

        mv HD46_Ctrl_minimap2+sniffles_filtered.vcf HD46-Ctrl_minimap2+sniffles_filtered.vcf
        mv HD46_Ctrl_ngmlr+sniffles_filtered.vcf    HD46-Ctrl_ngmlr+sniffles_filtered.vcf
        mv HD46_1_minimap2+sniffles_filtered.vcf    HD46-1_minimap2+sniffles_filtered.vcf
        mv HD46_1_ngmlr+sniffles_filtered.vcf       HD46-1_ngmlr+sniffles_filtered.vcf
        mv HD46_2_minimap2+sniffles_filtered.vcf    HD46-2_minimap2+sniffles_filtered.vcf
        mv HD46_2_ngmlr+sniffles_filtered.vcf       HD46-2_ngmlr+sniffles_filtered.vcf
        mv HD46_3_minimap2+sniffles_filtered.vcf    HD46-3_minimap2+sniffles_filtered.vcf
        mv HD46_3_ngmlr+sniffles_filtered.vcf       HD46-3_ngmlr+sniffles_filtered.vcf
        mv HD46_4_minimap2+sniffles_filtered.vcf    HD46-4_minimap2+sniffles_filtered.vcf
        mv HD46_4_ngmlr+sniffles_filtered.vcf       HD46-4_ngmlr+sniffles_filtered.vcf
        mv HD46_5_minimap2+sniffles_filtered.vcf    HD46-5_minimap2+sniffles_filtered.vcf
        mv HD46_5_ngmlr+sniffles_filtered.vcf       HD46-5_ngmlr+sniffles_filtered.vcf
        mv HD46_6_minimap2+sniffles_filtered.vcf    HD46-6_minimap2+sniffles_filtered.vcf
        mv HD46_6_ngmlr+sniffles_filtered.vcf       HD46-6_ngmlr+sniffles_filtered.vcf
        mv HD46_7_minimap2+sniffles_filtered.vcf    HD46-7_minimap2+sniffles_filtered.vcf
        mv HD46_7_ngmlr+sniffles_filtered.vcf       HD46-7_ngmlr+sniffles_filtered.vcf
        mv HD46_8_minimap2+sniffles_filtered.vcf    HD46-8_minimap2+sniffles_filtered.vcf
        mv HD46_8_ngmlr+sniffles_filtered.vcf       HD46-8_ngmlr+sniffles_filtered.vcf
        mv HD46_13_minimap2+sniffles_filtered.vcf   HD46-13_minimap2+sniffles_filtered.vcf
        mv HD46_13_ngmlr+sniffles_filtered.vcf      HD46-13_ngmlr+sniffles_filtered.vcf
    
      conda activate plot-numpy1
      #python generate_common_vcf.py
      #mv common_variants.xlsx putative_transposons.xlsx
    
      # * Reads each of your VCFs.
      # * Filters variants → only keep those with FILTER == PASS.
      # * Compares the two aligner methods (minimap2+sniffles2 vs ngmlr+sniffles2) per sample.
      # * Keeps only variants that appear in both methods for the same sample.
      # * Outputs: An Excel file with the common variants and a log text file listing which variants were filtered out, and why (not_PASS or not_COMMON_in_two_VCF).
    
      #python generate_fuzzy_common_vcf_v1.py
      #Sample   PASS_minimap2   PASS_ngmlr  COMMON
      #  HD46-Ctrl_Ctrl 39  29  28
      #  HD46-1 39  32  29
      #  HD46-2 40  32  28
      #  HD46-3 38  30  27
      #  HD46-4 46  35  32
      #  HD46-5 40  35  31
      #  HD46-6 43  35  30
      #  HD46-7 40  33  28
      #  HD46-8 37  20  11
      #  HD46-13    39  38  27
    
    #Sample PASS_minimap2   PASS_ngmlr  COMMON_FINAL
    #HD46-Ctrl_Ctrl 39  29  6
    #HD46-1 39  32  8
    #HD46-2 40  32  8
    #HD46-3 38  30  6
    #HD46-4 46  35  8
    #HD46-5 40  35  9
    #HD46-6 43  35  10
    #HD46-7 40  33  8
    #HD46-8 37  20  4
    #HD46-13    39  38  5
    
    #!!!! Summarize the results of ngmlr+sniffles !!!!
    python merge_ngmlr+sniffles_filtered_results_and_summarize.py
    
    #!!!! Post-Processing !!!!
    #DELETE "2186168    N   

    . PASS” in Sheet HD46-13 and Summary #DELETE “2427785 N CGTCAGAATCGCTGTCTGCGTCCGAGTCACTGTCTGAGTCTGAATCACTATCTGCGTCTGAGTCACTGTCTG . PASS” due to “0/1:169:117” in HD46-13 and Summary #DELETE “2441640 N GCTCATTAAGAATCATTAAATTAC . PASS” due to 0/1:170:152 in HD46-13 and Summary

  7. Source code of merge_ngmlr+sniffles_filtered_results_and_summarize.py

    import os
    import pandas as pd
    
    # List all ngmlr VCF files
    vcf_files = sorted([f for f in os.listdir('.') if f.endswith('_ngmlr+sniffles_filtered.vcf')])
    
    log_lines = []
    df_list = []
    
    # Function to read VCF and filter PASS
    def read_vcf(vcf_path, log_lines, sample):
        variants = []
        with open(vcf_path) as f:
            for line in f:
                if line.startswith('#'):
                    continue
                parts = line.strip().split('\t')
                pos, ref, alt, qual, flt = parts[1], parts[3], parts[4], parts[5], parts[6]
                info = parts[7] if len(parts) > 7 else '.'
                fmt = parts[8] if len(parts) > 8 else '.'
                last_column = parts[9] if len(parts) > 9 else '.'  # Keep original column
                var_id = f"{pos}:{ref}>{alt}"
                if flt != 'PASS':
                    log_lines.append(f"{sample}\t{var_id}\tnot_PASS")
                    continue
                variants.append({
                    'POS': int(pos),
                    'REF': ref,
                    'ALT': alt,
                    'QUAL': qual,
                    'FILTER': flt,
                    'INFO': info,
                    'FORMAT': fmt,
                    sample: last_column,  # Use only genotype for individual sheets
                    f"{sample}_with_alt": alt  # Use only ALT sequence for summary
                })
        return pd.DataFrame(variants)
    
    # Read all VCFs
    sample_names = []
    sample_with_alt_columns = []
    for f in vcf_files:
        sample = os.path.basename(f).replace('_ngmlr+sniffles_filtered.vcf', '')
        sample_names.append(sample)
        sample_with_alt_columns.append(f"{sample}_with_alt")
        df = read_vcf(f, log_lines, sample)
        df_list.append(df)
    
    # Merge all variants into one DataFrame
    all_variants = pd.concat(df_list, ignore_index=True)
    
    # Fill missing sample columns with hyphen for summary
    for sample in sample_names:
        if sample not in all_variants.columns:
            all_variants[sample] = ''
        if f"{sample}_with_alt" not in all_variants.columns:
            all_variants[f"{sample}_with_alt"] = '-'
    
    # Pivot table for summary using exact POS, using columns with ALT
    summary_df = all_variants.groupby(['POS', 'REF'])[sample_with_alt_columns].first().reset_index()
    
    # Rename columns in summary to remove '_with_alt' suffix
    summary_df.columns = ['POS', 'REF'] + sample_names
    
    # Replace NaN or empty strings with hyphen in sample columns
    summary_df[sample_names] = summary_df[sample_names].fillna('-').replace('', '-')
    
    # Count how many samples have a non-hyphen value
    summary_df['Count'] = summary_df[sample_names].apply(lambda row: sum(val != '-' for val in row), axis=1)
    
    # Save Excel with individual sheets and summary
    writer = pd.ExcelWriter('merged_ngmlr+sniffles_variants.xlsx', engine='openpyxl')
    
    # Save individual sample sheets
    for df in df_list:
        sheet_name = df.columns[-2]  # Use sample name (not the _with_alt column)
        # Select only relevant columns for individual sheets (exclude _with_alt)
        df = df[[col for col in df.columns if not col.endswith('_with_alt')]]
        df.to_excel(writer, sheet_name=sheet_name, index=False)
    
    # Save summary sheet
    summary_df.to_excel(writer, sheet_name='Summary', index=False)
    writer.close()
    
    # Write log
    with open('ngmlr_filtering_log.txt', 'w') as logf:
        logf.write('Sample\tVariant\tReason\n')
        for line in log_lines:
            logf.write(line + '\n')
    
    print('Done: merged_ngmlr+sniffles_variants.xlsx with ALT sequences or hyphens in summary and genotype-only in individual sheets.')
  8. ———————– END of the transposon calculation. The following steps (e.g. assembly based on the long-reads is not necessary for the transposon analysis) —————————-

  9. (NOT_USED) Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)

      python vcf_to_fasta.py variants.vcf variants.fasta
      #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
      #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
      /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
      python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
      Exact Matches
      =============
      2 bacteria (blast name), Bacteria 
    (scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV​)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733​)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields. * In your VCF file generated by Sniffles, the REF=N in the results has a specific meaning: * In a standard VCF, the REF field usually contains the reference base(s) at the variant position. * For structural variants (SVs), especially insertions, there is no reference sequence replaced; the insertion occurs between reference bases. * Therefore, Sniffles uses N as a placeholder in the REF field to indicate “no reference base replaced”. * The actual inserted sequence is then stored in the ALT field.
  10. Why some records have UNRESOLVED in the FILTER field in the Excel output.

    1. Understanding the format
    
        The data appears to be structural variant (SV) calls from Sniffles, probably in a VCF-like tabular format exported to Excel:
    
            * gi|1176884116|gb|CP020463.1| → reference sequence
            * Positions: 1855752 and 2422820
            * N → insertion event
            * SVLEN=999 → size of the insertion
            * AF → allele frequency
            * GT:DR:DV → genotype, depth reference, depth variant (1/1:0:678, example values for a PASS variant)
            * FILTER → whether the variant passed filters (UNRESOLVED means it didn’t pass)
    
    2. What UNRESOLVED usually means
    
        In Sniffles:
    
        * UNRESOLVED is assigned to SVs when the tool cannot confidently resolve the exact sequence or breakpoint.
        * Reasons include:
            - Low read support (RE, DV) relative to the expected coverage
            - Ambiguous alignment at repetitive regions
            - Conflicting strand or orientation signals
            - Allele frequency inconsistent with expectations
    
    3. Examine your two records
    
        First record
    
            POS: 1855752
            SVTYPE: INS
            SVLEN: 999
            RE: 68
            AF: 1
            GT: 1/1
            FILTER: UNRESOLVED
    
        Observations:
    
        * AF = 1 → allele frequency 100%, homozygous insertion
        * RE = 68 → 68 reads support the variant, decent coverage
        * Still UNRESOLVED → likely because Sniffles could not resolve the inserted sequence precisely; sometimes long insertions in repetitive regions are hard to reconstruct fully even with good read support.
    
        Second record
    
            POS: 2422820
            SVTYPE: INS
            SVLEN: 999
            RE: 22
            AF: 0.025522
            GT: 0/0
            FILTER: UNRESOLVED
    
        Observations:
    
        * AF = 0.0255 → very low allele frequency (~2.5%)
        * RE = 22, DR = 840 → very low variant reads vs reference
        * GT = 0/0 → homozygous reference
        * Sniffles marks it UNRESOLVED because the variant is essentially noise, not confidently detected.
    
    4. Key difference between the two
        Feature First record    Second record
        Allele frequency (AF)   1 (high)    0.0255 (very low)
        Variant reads (RE)  68  22
        Genotype (GT)   1/1 0/0
        Reason for UNRESOLVED   Unresolvable inserted sequence
    
    ✅ 5. Conclusion
    
        * Sniffles marks a variant as UNRESOLVED when the SV cannot be confidently characterized.
        * Even if there is good read support (first record), complex insertions can’t always be reconstructed fully.
        * Very low allele frequency (second record) also triggers UNRESOLVED because the signal is too weak compared to background noise.
        * Essentially: “UNRESOLVED” ≠ bad data, it’s just unresolved uncertainty.
  11. (NOT_SURE_HOW_TO_USE) Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

      mamba install -c bioconda medaka
      medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
  12. Compare Insertions Across Samples

    Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
    SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
        Input: List of VCF files from Sniffles2.
        Output: A consolidated VCF file with shared and unique variants.
    
    Filter WT Insertions:
    
        Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
            bcftools isec WT.vcf merged.vcf -p comparison_results
  13. Validate and Visualize

    Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
    igv.sh
    
    Validate Findings:
        Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  14. Alternatives to TEPID for Long-Read Data

    If you’re looking for transposon-specific tools for long reads:
    
        REPET: A robust transposon annotation tool compatible with assembled genomes.
        EDTA (Extensive de novo TE Annotator):
            A pipeline to identify, classify, and annotate transposons.
            Works directly on your assembled genomes.
    
            perl EDTA.pl --genome WT.fasta --type all
  15. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    Steps to Generate WT.vcf
    1. Align WT Reads to the WT Reference Genome
    
    The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
    Command using Minimap2:
    
    minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
    Index the BAM file:
    
    samtools index WT.sorted.bam
    
    2. Detect Structural Variants with Sniffles2
    
    Run Sniffles2 on the WT alignment to call structural variants:
    
    sniffles --input WT.sorted.bam --vcf WT.vcf
    
    This step identifies:
    
        Native transposons and insertions present in the WT genome.
        Other structural variants that are part of the reference genome or sequencing artifacts.
    
    Key parameters to consider:
    
        --min_support: Adjust based on your WT sequencing coverage.
        --max_distance: Define proximity for merging variants.
        --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  16. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

    To ensure the WT.vcf only includes relevant transposons or SVs:
    
        Use bcftools or similar tools to filter out low-confidence variants:
    
        bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
        bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
  17. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

    Why Use a Reference Genome (WT.fasta) from a Database?
    
        Higher Quality and Completeness:
            Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
        Consistency:
            Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
        Saves Time:
            Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
    Alternative: Assembling WT from FASTQ
    
    If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
        Assemble the genome from your WT.fastq:
            Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
        flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size 
    Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
  18. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

    # -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ !
    # Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches.
    #https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1
    
    # -- (failed!) try TEPID for annotation
    mamba install tepid=0.10 -c bioconda
    #(tepid_env)
    for sample in WT 1 2 3 4 5 7 8 9 10; do
        tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q  ../batch1_depth25/trycycler_${sample}/reads.fastq;
        tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se;
    done
    
    tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se;
    
    tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS]
                [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL
    
    # -- (failed!) try EDTA for annotation
    https://github.com/oushujun/EDTA
    (transposon_long) mamba install -c conda-forge -c bioconda edta
    mamba install bioconda::rmblast  # cd RepeatMasker; ./configure
    EDTA.pl --genome CP020463.fasta --species others --threads 40
    
    For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options.
    For de novo repeat identification: RepeatScout is highly effective.
    For LTR retrotransposons: Use LTR_retriever.
    For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.
  19. Validation: Cross-validate with short-read sequencing data if available.

  20. (Optional) Assembly the nanopore-sequencing using

    1. merge and prepare fastq.gz
    
    2. calcuclate the precentage of coding DNA: https://www.ncbi.nlm.nih.gov/nuccore/CP020463 with additional plasmid.
    
        * Average gene length: 870.4 bp
        * Total coding region length: 2056168 bp
        * Percentage of coding DNA = (Total Coding Region Length / Total Genome Length) × 100 = 2056168 bp / 2454929 bp = 83.8%
    
    3. Prepare the fastq.gz
    
        conda activate /home/jhuang/miniconda3/envs/trycycler  # under jhuang@hamm (10.169.63.113).
    
        rsync -a -P *.fastq.gz jhuang@10.169.63.113:/home/jhuang/DATA/Data_Patricia_Transposon_2025/
        for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
            mkdir trycycler_${sample};
            mv ${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
            gunzip trycycler_${sample}/reads.fastq.gz;
        done
        #mkdir batch3;
        #cd batch3;
        #for sample in WT 1 2 3 4 5 7 8 9 10; do
        #    mkdir trycycler_${sample};
        #    cp ${sample}_nanopore/${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz;
        #    gunzip trycycler_${sample}/reads.fastq.gz;
        #done
    
    4. Running separate assemblies (6x4 times: canu, flye, miniasm_and_minipolish, necat, nextdenovo, raven) using trycycler_assembly_extra-thorough.sh (under the trycycler environment running the following steps)
            # batch1: min_read_cov=25; batch2: min_read_cov=50; batch3: min_read_cov=100 (if necessary on the monday!)
            cp ../Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_*.sh ./
            #for sample in trycycler_HDRNA_10 trycycler_HDRNA_13; do
            for sample in barcode01 barcode02 barcode03 barcode04  HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
                cd trycycler_${sample};
                ../trycycler_assembly_extra-thorough.sh
                cd ..;
            done
    
    # END!
    #TODO: upload the nanopore-sequencing data to NCBI for the HDRNA_10 and HDRNA_13 to the correspoinding project.
    
    (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_02.fasta
    assembly_02.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:32:46] If you like MLST, you're absolutely going to love wgMLST!
    [15:32:46] Done.
    (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_20.fasta
    
    assembly_20.fasta       sepidermidis    87      arcC(7) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:33:01] You can use --label XXX to replace an ugly filename in the output.
    [15:33:01] Done.
    
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_02.fasta
    assembly_02.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_08.fasta
    
    assembly_08.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:38:22] Remember that --minscore is only used when using automatic scheme detection.
    [15:38:22] Done.
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_14.fasta
    assembly_14.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:38:50] Thanks for using mlst, I hope you found it useful.
    [15:38:50] Done.
    (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_20.fasta
    assembly_20.fasta       sepidermidis    5       arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
    [15:38:54] If you like MLST, you're going to absolutely love cgMLST!
    [15:38:54] Done.
    
            #- under the directory batch3
            #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            #    cd ${sample};
            #    ../trycycler_assembly_extra-thorough_threads5_cov100.sh;
            #    cd ..;
            #done
    
            #if ERROR, running separate assembly-methods raven and canu
            #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            #    cd ${sample};
            #    ../trycycler_assembly_extra-thorough_raven.sh;
            #    cd ..;
            #done
            #
            #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            #    cd ${sample};
            #    ../trycycler_assembly_extra-thorough_canu.sh;
            #    cd ..;
            #done
    
    5. trycycler cluster
    
            for sample in trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do
            for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4; do
                cd ${sample};
                rm -rf trycycler
                trycycler cluster --threads 10 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler;
                cd ..;
            done
    
    6. trycycler reconcile
    
            cd trycycler_WT
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/C_utg000001c.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
    
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
                # -- under batch3 --
                176K Nov  5 12:38 P_bctg00000001.fasta
                122K Nov  5 12:38 D_bctg00000001.fasta
                95K Nov  5 12:38 J_bctg00000001.fasta
                78K Nov  5 12:38 J_bctg00000002.fasta
                #--> Error: unable to find a suitable common sequence
    
                69K Nov  5 17:24 G_tig00000004.fasta
                66K Nov  5 17:24 S_tig00000004.fasta
                66K Nov  5 17:24 M_tig00000004.fasta
                65K Nov  5 17:24 A_tig00000005.fasta
                #--> Error: unable to find a suitable common sequence
                #If repeat M_tig00000004.fasta twice, resulting in the following error: Circularising M_tig00000004:
                #  using S_tig00000004:
                #    unable to circularise: M_tig00000004's start and end were found in multiple places in S_tig00000004
                #Error: failed to circularise sequence M_tig00000004 because its start/end sequences were found in multiple ambiguous places in other sequences. #This is likely because M_tig00000004 starts/ends in a repetitive region. You can either manually repair its circularisation (and ensure it does #not start/end in a repetitive region) or exclude the sequence altogether and try again.
    
                47K Nov  5 17:24 V_bctg00000001.fasta
                45K Nov  5 17:24 C_utg000002c.fasta
                #Error: unable to find a suitable common sequence
    
                34K Nov  5 17:24 U_utg000004c.fasta
                34K Nov  5 17:24 N_contig_2.fasta
                34K Nov  5 17:24 T_contig_2.fasta
                #Error: unable to find a suitable common sequence
    
                23K Nov  5 17:24 I_utg000003c.fasta
                22K Nov  5 17:24 B_contig_2.fasta
                22K Nov  5 17:24 H_contig_2.fasta
                #Error: unable to find a suitable common sequence
    
                13K Nov  5 17:24 G_tig00000006.fasta
                12K Nov  5 17:24 M_tig00000003.fasta
                12K Nov  5 17:24 O_utg000002c.fasta
                12K Nov  5 17:24 S_tig00000003.fasta
                11K Nov  5 17:24 A_tig00000004.fasta
                #trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 --max_length_diff 1.2 --max_trim_seq_percent 20.0
                #--> SUCCESSFULLY saving sequences to file: trycycler/cluster_002/2_all_seqs.fasta
    
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 --max_length_diff 1.3 --max_trim_seq_percent 30.0
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 #Error: two or more input contigs are required
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006 #Error: two or more input contigs are required
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_007 #Error: two or more input contigs are required
    
            cd ../trycycler_1
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/I_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/W_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/A_tig00000001.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/U_utg000001c.fasta trycycler/cluster_001/1_contigs_removed
    
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            #TODO_TOMORROW_HERE!
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 --max_length_diff 1.7 --max_trim_seq_percent 70.0
            #Error: failed to circularise sequence I_utg000002l because its end could not be found in other sequences.
            (NOTE) using the plasmid from other isolate to help the cut of the sequence.
                #cp ~/DATA/Data_Patricia_Transposon/batch3/trycycler_WT/trycycler/cluster_004/7_final_consensus.fasta .
                #cat *.fasta > all.fasta
                #mafft --clustalout --adjustdirection all.fasta > all.aln
            cluster_004_con -------------------------------------------------------tatga
            I_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
            O_utg000002l    agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga
                                                                                *****
            cluster_004_con gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaa---
            I_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
            O_utg000002l    gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg
                            ********************************************************
    
            grep "I_utg000002l" all.a_ > I_utg000002l.fasta
            grep "O_utg000002l" all.a_ > O_utg000002l.fasta
            cut -d' ' -f5 I_utg000002l.fasta > I_utg000002l_.fasta
            cut -d' ' -f5 O_utg000002l.fasta > O_utg000002l_.fasta
            sed 's/-//g' I_utg000002l_.fasta > I_utg000002l__.fasta
            sed 's/-//g' O_utg000002l_.fasta > O_utg000002l__.fasta
            seqtk seq I_utg000002l__.fasta -l 80 > I_utg000002l___.fasta
            seqtk seq O_utg000002l__.fasta -l 80 > O_utg000002l___.fasta
    
            #Cut two files Manually with the sequence: AGAATCAGAATTAGGCGCATAATTTACAGGAGGTTTGATTATGGCTAAGAAAAA
            cat I_utg000002l___.fasta O_utg000002l___.fasta > 2_all_seqs.fasta
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 #Error: two or more input contigs are required
    
            cd ../trycycler_2
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/O_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_3
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_4
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/F_Utg670.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/V_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_5
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004
    
            cd ../trycycler_7
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/D_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
            cd ../trycycler_8
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            mkdir trycycler/cluster_001/1_contigs_removed
            mv trycycler/cluster_001/1_contigs/M_utg000001l.fasta trycycler/cluster_001/1_contigs_removed
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
            cd ../trycycler_9
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006
    
            cd ../trycycler_10
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
            trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    
            #Nachmachen: for sample in trycycler_9 trycycler_10; do
            cd ${sample};
            ../trycycler_assembly_extra-thorough_canu.sh;
            cd ..;
            done
    
    7. (optional) map the circular assemblies to plasmid databases
            #bzip2 -d plsdb.fna.bz2
            #makeblastdb -in plsdb.fna -dbtype nucl
            #blastn -db plsdb.fna -query all.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > all_on_plsdb.blastn
    
    8. trycycler msa
    
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_001
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_002
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_003
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_004
            trycycler msa --threads 55 --cluster_dir trycycler/cluster_005
            #--> When finished, Trycycler reconcile will make a 3_msa.fasta file in the cluster directory
    
    9. trycycler partition
    
            #generate 4_reads.fastq for each contig!
            trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_*
            #trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001 trycycler/cluster_002 trycycler/cluster_003
            trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_003
    
    10. trycycler consensus
    
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_002
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_003
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004
            trycycler consensus --threads 55 --cluster_dir trycycler/cluster_005
            #!!NOTE that we take the isolates of HD05_2_K5 and HD05_2_K6 assembled by Unicycler instead of Trycycler!!
            # TODO (TODAY), generate the 3 datasets below!
            # TODO (IMPORTANT): write a Email to Holger, say the short sequencing of HD5_2 is not correct, since the 3 datasets! However, the MTxxxxxxx is confirmed not in K5 and K6!
            TODO: variant calling needs the short-sequencing, they are not dorable without the correct short-reads! resequencing? It is difficult to call variants only from long-reads since too much errors in long-reads!
            #TODO: check the MT sequence if in the isolates, more deteiled annotations come late!
            #II. Comparing the results of Trycycler with Unicycler.
            #III. Eventually add the plasmids assembled from unicycler to the final results. E.g. add the 4 plasmids to K5 and K6
    
    11. Polishing after Trycycler
    
            #1. Oxford Nanopore sequencer (Ignored due to the samtools version incompatibility!)
            # for c in trycycler/cluster_*; do
            #     medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12
            #     mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
            #     rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi  # clean up
            # done
            # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
            #2. Short-read polishing
            #---- 5179_R1 (2) ----
            #  mean read depth: 205.8x
            #  188 bp have a depth of zero (99.9924% coverage)
            #  355 positions changed (0.0144% of total positions)
            #  estimated pre-polishing sequence accuracy: 99.9856% (Q38.42)
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-5179-r1_R1.fastq.gz --in2 ../../s-epidermidis-5179-r1_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 37
            #Insertion/Deletion Errors: 2
            #Assembly Size: 2470001
            #Consensus Quality: 99.9984
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 17748
            #Consensus Quality: 99.9775
            #Step 4: (optional) more rounds and/or other polishers
            #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
            #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            Substitution Errors: 13
            Insertion/Deletion Errors: 0
            Assembly Size: 2470004
            Consensus Quality: 99.9995
            Substitution Errors: 0
            Insertion/Deletion Errors: 0
            Assembly Size: 17748
            Consensus Quality: 100
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2470004
            #Consensus Quality: 100
            #---- 1585 (4) ----
            #  mean read depth: 174.7x
            #  8,297 bp have a depth of zero (99.6604% coverage)
            #  271 positions changed (0.0111% of total positions)
            #  estimated pre-polishing sequence accuracy: 99.9889% (Q39.55)
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 7
            #Insertion/Deletion Errors: 4
            #Assembly Size: 2443174
            #Consensus Quality: 99.9995
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9014
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9014
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2344
            #Consensus Quality: 100
            #Step 4: (optional) more rounds and/or other polishers
            #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
            #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2443176
            #Consensus Quality: 100
            #---- 1585 derived from unicycler, under 1585_normal/unicycler (4) ----
            #Step 0: copy chrom and plasmid1, plasmid2, plasmid3 to cluster_001/7_final_consensus.fasta, ...
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Polishing 1 (2,443,574 bp):
            #mean read depth: 174.7x
            #8,298 bp have a depth of zero (99.6604% coverage)
            #52 positions changed (0.0021% of total positions)
            #estimated pre-polishing sequence accuracy: 99.9979% (Q46.72)
            #Polishing 2 (9,014 bp):
            #mean read depth: 766.5x
            #3 bp have a depth of zero (99.9667% coverage)
            #0 positions changed (0.0000% of total positions)
            #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
            #Polishing 7 (2,344 bp):
            #mean read depth: 2893.0x
            #4 bp have a depth of zero (99.8294% coverage)
            #0 positions changed (0.0000% of total positions)
            #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
            #Polishing 8 (2,255 bp):
            #mean read depth: 2719.6x
            #4 bp have a depth of zero (99.8226% coverage)
            #0 positions changed (0.0000% of total positions)
            #estimated pre-polishing sequence accuracy: 100.0000% (Q∞)
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 7
            #Insertion/Deletion Errors: 4
            #Assembly Size: 2443598
            #Consensus Quality: 99.9995
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9014
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2344
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2255
            #Consensus Quality: 100
            #Step 4: (optional) more rounds and/or other polishers
            #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape!
            #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes.
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2443600
            #Consensus Quality: 100
            #-- 1585v (1, no short reads, waiting) --
            # TODO!
            #-- 5179 (2) --
            #mean read depth: 120.7x
            #7,547 bp have a depth of zero (99.6946% coverage)
            #356 positions changed (0.0144% of total positions)
            #estimated pre-polishing sequence accuracy: 99.9856% (Q38.41)
            #Step 1: read QC
            fastp --in1 ../../s-epidermidis-5179_R1.fastq.gz --in2 ../../s-epidermidis-5179_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 49
            #Insertion/Deletion Errors: 23
            #Assembly Size: 2471418
            #Consensus Quality: 99.9971
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 17748
            #Consensus Quality: 100
            #Step 4: (optional) more rounds POLCA
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 10
            #Insertion/Deletion Errors: 5
            #Assembly Size: 2471442
            #Consensus Quality: 99.9994
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            Substitution Errors: 6
            Insertion/Deletion Errors: 0
            Assembly Size: 2471445
            Consensus Quality: 99.9998
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G
            cd ..
            done
            Substitution Errors: 0
            Insertion/Deletion Errors: 0
            Assembly Size: 2471445
            Consensus Quality: 100
            #-- HD5_2 (2): without the short-sequencing we cannot correct the base-calling! --
            # !ERROR to be REPORTED, the
            #Polishing cluster_001_consensus (2,504,140 bp):
            #mean read depth: 94.4x
            #240,420 bp have a depth of zero (90.3991% coverage)
            #56,894 positions changed (2.2720% of total positions)
            #estimated pre-polishing sequence accuracy: 97.7280% (Q16.44)
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R2_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R1_001.fastq
            /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R2_001.fastq
            #Step 1: read QC
            fastp --in1 ../../HD5_2_S38_R1_001.fastq.gz --in2 ../../HD5_2_S38_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            # NOTE that the following steps are not run since the short-reads are not correct!
            # #Step 2: Polypolish
            # for cluster in cluster_001 cluster_005; do
            #   bwa index ${cluster}/7_final_consensus.fasta
            #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            #   bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            #   polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            # done
            # #Step 3: POLCA
            # for cluster in cluster_001 cluster_005; do
            #   cd ${cluster}
            #   polca.sh -a polypolish.fasta -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
            #   cd ..
            # done
            # #Step 4: (optional) more rounds POLCA
            # for cluster in cluster_001; do
            #   cd ${cluster}
            #   polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G
            #   cd ..
            # done
            # NOTE that the plasmids of HD5_2_K5 and HD5_2_K6 were copied from Unicycler!
            #-- HD5_2_K5 (4) --
            mean read depth: 87.1x
            25 bp have a depth of zero (99.9990% coverage)
            1,085 positions changed (0.0433% of total positions)
            estimated pre-polishing sequence accuracy: 99.9567% (Q33.63)
            #Step 1: read QC
            fastp --in1 ../../275_K5_Holger_S92_R1_001.fastq.gz --in2 ../../275_K5_Holger_S92_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 146
            #Insertion/Deletion Errors: 2
            #Assembly Size: 2504401
            #Consensus Quality: 99.9941
            #Substitution Errors: 41
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9007
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9191
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2767
            #Consensus Quality: 100
            #Step 4: (optional) more rounds POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 41
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504401
            #Consensus Quality: 99.9984
            #Substitution Errors: 8
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9806
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 8
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504401
            #Consensus Quality: 99.9997
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9903
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 8
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504401
            #Consensus Quality: 99.9997
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9903
            #-- HD5_2_K6 (4) --
            #mean read depth: 116.7x
            #4 bp have a depth of zero (99.9998% coverage)
            #1,022 positions changed (0.0408% of total positions)
            #estimated pre-polishing sequence accuracy: 99.9592% (Q33.89)
            #Step 1: read QC
            fastp --in1 ../../276_K6_Holger_S95_R1_001.fastq.gz --in2 ../../276_K6_Holger_S95_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
            #Step 2: Polypolish
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            bwa index ${cluster}/7_final_consensus.fasta
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam
            bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam
            polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta
            done
            #Step 3: POLCA
            for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do
            cd ${cluster}
            polca.sh -a polypolish.fasta -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 164
            #Insertion/Deletion Errors: 2
            #Assembly Size: 2504398
            #Consensus Quality: 99.9934
            #Substitution Errors: 22
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 99.9467
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 9191
            #Consensus Quality: 100
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2767
            #Consensus Quality: 100
            #Step 4: (optional) more rounds POLCA
            for cluster in cluster_001 cluster_002; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 32
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504400
            #Consensus Quality: 99.9987
            #Substitution Errors: 0
            #Insertion/Deletion Errors: 0
            #Assembly Size: 41288
            #Consensus Quality: 100
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 4
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504400
            #Consensus Quality: 99.9998
            for cluster in cluster_001; do
            cd ${cluster}
            polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G
            cd ..
            done
            #Substitution Errors: 2
            #Insertion/Deletion Errors: 0
            #Assembly Size: 2504400
            #Consensus Quality: 99.9999
  21. (Optional) use platon to confirm the plasmid contigs: https://github.com/oschwengers/platon

Leave a Reply

Your email address will not be published. Required fields are marked *