-
install mambaforge https://conda-forge.org/miniforge/ (recommended)
#download Mambaforge-24.9.2-0-Linux-x86_64.sh from website chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh ./Mambaforge-24.9.2-0-Linux-x86_64.sh To activate this environment, use: micromamba activate /home/jhuang/mambaforge Or to execute a single command in this environment, use: micromamba run -p /home/jhuang/mambaforge mycommand installation finished. Do you wish to update your shell profile to automatically initialize conda? This will activate conda on startup and change the command prompt when activated. If you'd prefer that conda's base environment not be activated on startup, run the following command when conda is activated: conda config --set auto_activate_base false You can undo this by running `conda init --reverse $SHELL`? [yes|no] [no] >>> yes no change /home/jhuang/mambaforge/condabin/conda no change /home/jhuang/mambaforge/bin/conda no change /home/jhuang/mambaforge/bin/conda-env no change /home/jhuang/mambaforge/bin/activate no change /home/jhuang/mambaforge/bin/deactivate no change /home/jhuang/mambaforge/etc/profile.d/conda.sh no change /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish no change /home/jhuang/mambaforge/shell/condabin/Conda.psm1 no change /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1 no change /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh no change /home/jhuang/mambaforge/etc/profile.d/conda.csh modified /home/jhuang/.bashrc ==> For changes to take effect, close and re-open your current shell. <== no change /home/jhuang/mambaforge/condabin/conda no change /home/jhuang/mambaforge/bin/conda no change /home/jhuang/mambaforge/bin/conda-env no change /home/jhuang/mambaforge/bin/activate no change /home/jhuang/mambaforge/bin/deactivate no change /home/jhuang/mambaforge/etc/profile.d/conda.sh no change /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish no change /home/jhuang/mambaforge/shell/condabin/Conda.psm1 no change /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1 no change /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh no change /home/jhuang/mambaforge/etc/profile.d/conda.csh no change /home/jhuang/.bashrc No action taken. WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion Added mamba to /home/jhuang/.bashrc ==> For changes to take effect, close and re-open your current shell. <== Thank you for installing Mambaforge! Close your terminal window and open a new one, or run: #source ~/mambaforge/bin/activate conda --version mamba --version https://github.com/conda-forge/miniforge/releases Note * After installation, please make sure that you do not have the Anaconda default channels configured. conda config --show channels conda config --remove channels defaults conda config --add channels conda-forge conda config --show channels conda config --set channel_priority strict #conda clean --all conda config --remove channels biobakery * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!! # --Deprecated method: mamba installing on conda-- #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*' # * Note that installing mamba into any other environment than base is not supported. # #conda activate base #conda install conda #conda uninstall mamba #conda install mamba
2: install required Tools on the mamba env
* Sniffles2: Detect structural variants, including transposons, from long-read alignments.
* RepeatModeler2: Identify and classify transposons de novo.
* RepeatMasker: Annotate known transposable elements using transposon libraries.
* SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
* SURVIVOR: Consolidate structural variants across samples for comparative analysis.
mamba deactivate
# Create a new conda environment
mamba create -n transposon_long python=3.6 -y
# Activate the environment
mamba activate transposon_long
mamba install -c bioconda sniffles
mamba install -c bioconda repeatmodeler repeatmasker
# configure repeatmasker database
mamba info --envs
cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker
#mamba install python=3.6
mamba install -c bioconda svim
mamba install -c bioconda survivor
-
Test the installed tools
# Check versions sniffles --version RepeatModeler -h RepeatMasker -h svim --help SURVIVOR --help mamba install -c conda-forge perl r
-
Data Preparation
Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore. Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
-
Preprocessing
Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt. Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
-
(Optional) Variant Calling for SNP and Indel Detection:
Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
-
(OFFICIAL STARTING POINT) Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。
#NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same! # -- PREPARING the input fastq-data, merge the fastqz and move the top-directory # Under raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass zcat ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_0.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_1.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_2.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_3.fastq.gz ... | gzip > HD46_1.fastq.gz mv ./raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass/HD46_1.fastq.gz ~/DATA/Data_Patricia_Transposon_2025 #this are the corresponding sample names: #barcode 1: HD46-1 #barcode 2: HD46-2 #barcode 3: HD46-3 #barcode 4: HD46-4 mv barcode01.fastq.gz HD46_1.fastq.gz mv barcode02.fastq.gz HD46_2.fastq.gz mv barcode03.fastq.gz HD46_3.fastq.gz mv barcode04.fastq.gz HD46_4.fastq.gz # -- CALCULATE the coverages #!/bin/bash for bam in barcode*_minimap2.sorted.bam; do echo "Processing $bam ..." avg_cov=$(samtools depth -a "$bam" | awk '{sum+=$3; cnt++} END {if (cnt>0) print sum/cnt; else print 0}') echo -e "${bam}\t${avg_cov}" >> coverage_summary.txt done # ---- !!!! LOGIN the suitable environment !!!! ---- mamba activate transposon_long # -- TODO: AFTERNOON_DEBUG_THIS: FAILED and not_USED: Alignment and Detect structural variants in each sample using SVIM which used aligner ngmlr or mimimap2 #mamba install -c bioconda ngmlr mamba install -c bioconda svim # ---- Option_1: minimap2 (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ---- for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do #INS,INV,DUP:TANDEM,DUP:INT,BND svim reads --aligner minimap2 --nanopore minimap2+svim_${sample} ${sample}.fastq.gz CP020463.fasta --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names; done #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names # ---- Option_2: minamap2 (aligner) + Sniffles2 (structural variant caller) --> SUCCESSFUL ---- #Minimap2: A commonly used aligner for nanopore sequencing data. # Align Long Reads to the WT Reference using Minimap2 #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60 # -s 20: Requires at least 20 reads to support an SV for reporting. --> 10 # -l 50: Reports SVs that are at least 50 base pairs long. # -t 60: Uses 60 threads for faster processing. for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do #minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam minimap2 --MD -t 60 -ax map-ont CP020463.fasta ${sample}.fastq.gz | samtools sort -o ${sample}_minimap2.sorted.bam samtools index ${sample}_minimap2.sorted.bam sniffles -m ${sample}_minimap2.sorted.bam -v ${sample}_minimap2+sniffles.vcf -s 10 -l 50 -t 60 #QUAL < 20 || bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_minimap2+sniffles.vcf > ${sample}_minimap2+sniffles_filtered.vcf done #Estimating parameter... # Max dist between aln events: 44 # Max diff in window: 76 # Min score ratio: 2 # Avg DEL ratio: 0.0112045 # Avg INS ratio: 0.0364027 #Start parsing... CP020463 # # Processed reads: 10000 # # Processed reads: 20000 # Finalizing .. #Start genotype calling: # Reopening Bam file for parsing coverage # Finalizing .. #Estimating parameter... # Max dist between aln events: 28 # Max diff in window: 89 # Min score ratio: 2 # Avg DEL ratio: 0.013754 # Avg INS ratio: 0.17393 #Start parsing... CP020463 # # Processed reads: 10000 # # Processed reads: 20000 # # Processed reads: 30000 # # Processed reads: 40000 # Results: # * barcode01_minimap2+sniffles.vcf # * barcode01_minimap2+sniffles_filtered.vcf # * barcode02_minimap2+sniffles.vcf # * barcode02_minimap2+sniffles_filtered.vcf # * barcode03_minimap2+sniffles.vcf # * barcode03_minimap2+sniffles_filtered.vcf # * barcode04_minimap2+sniffles.vcf # * barcode04_minimap2+sniffles_filtered.vcf #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results! #for sample in barcode01 barcode02 barcode03 barcode04; do # sniffles -m svim_reads_minimap2_${sample}/${sample}.fastq.minimap2.coordsorted.bam -v sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60 # bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf #done # ---- Option_3: NGMLR (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ---- for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do svim reads --aligner ngmlr --nanopore ngmlr+svim_${sample} ${sample}.fastq.gz CP020463.fasta --cores 10; done # ---- Option_4: NGMLR (aligner) + sniffles (structural variant caller) --> SUCCESSFUL ---- for sample in HD46_Ctrl HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do sniffles -m ngmlr+svim_${sample}/${sample}.fastq.ngmlr.coordsorted.bam -v ${sample}_ngmlr+sniffles.vcf -s 10 -l 50 -t 60 bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_ngmlr+sniffles.vcf > ${sample}_ngmlr+sniffles_filtered.vcf done
-
Compare and integrate all results produced by minimap2+sniffles and ngmlr+sniffles, and check them each position in IGV!
mv HD46_Ctrl_minimap2+sniffles_filtered.vcf HD46-Ctrl_minimap2+sniffles_filtered.vcf mv HD46_Ctrl_ngmlr+sniffles_filtered.vcf HD46-Ctrl_ngmlr+sniffles_filtered.vcf mv HD46_1_minimap2+sniffles_filtered.vcf HD46-1_minimap2+sniffles_filtered.vcf mv HD46_1_ngmlr+sniffles_filtered.vcf HD46-1_ngmlr+sniffles_filtered.vcf mv HD46_2_minimap2+sniffles_filtered.vcf HD46-2_minimap2+sniffles_filtered.vcf mv HD46_2_ngmlr+sniffles_filtered.vcf HD46-2_ngmlr+sniffles_filtered.vcf mv HD46_3_minimap2+sniffles_filtered.vcf HD46-3_minimap2+sniffles_filtered.vcf mv HD46_3_ngmlr+sniffles_filtered.vcf HD46-3_ngmlr+sniffles_filtered.vcf mv HD46_4_minimap2+sniffles_filtered.vcf HD46-4_minimap2+sniffles_filtered.vcf mv HD46_4_ngmlr+sniffles_filtered.vcf HD46-4_ngmlr+sniffles_filtered.vcf mv HD46_5_minimap2+sniffles_filtered.vcf HD46-5_minimap2+sniffles_filtered.vcf mv HD46_5_ngmlr+sniffles_filtered.vcf HD46-5_ngmlr+sniffles_filtered.vcf mv HD46_6_minimap2+sniffles_filtered.vcf HD46-6_minimap2+sniffles_filtered.vcf mv HD46_6_ngmlr+sniffles_filtered.vcf HD46-6_ngmlr+sniffles_filtered.vcf mv HD46_7_minimap2+sniffles_filtered.vcf HD46-7_minimap2+sniffles_filtered.vcf mv HD46_7_ngmlr+sniffles_filtered.vcf HD46-7_ngmlr+sniffles_filtered.vcf mv HD46_8_minimap2+sniffles_filtered.vcf HD46-8_minimap2+sniffles_filtered.vcf mv HD46_8_ngmlr+sniffles_filtered.vcf HD46-8_ngmlr+sniffles_filtered.vcf mv HD46_13_minimap2+sniffles_filtered.vcf HD46-13_minimap2+sniffles_filtered.vcf mv HD46_13_ngmlr+sniffles_filtered.vcf HD46-13_ngmlr+sniffles_filtered.vcf conda activate plot-numpy1 #python generate_common_vcf.py #mv common_variants.xlsx putative_transposons.xlsx # * Reads each of your VCFs. # * Filters variants → only keep those with FILTER == PASS. # * Compares the two aligner methods (minimap2+sniffles2 vs ngmlr+sniffles2) per sample. # * Keeps only variants that appear in both methods for the same sample. # * Outputs: An Excel file with the common variants and a log text file listing which variants were filtered out, and why (not_PASS or not_COMMON_in_two_VCF). #python generate_fuzzy_common_vcf_v1.py #Sample PASS_minimap2 PASS_ngmlr COMMON # HD46-Ctrl_Ctrl 39 29 28 # HD46-1 39 32 29 # HD46-2 40 32 28 # HD46-3 38 30 27 # HD46-4 46 35 32 # HD46-5 40 35 31 # HD46-6 43 35 30 # HD46-7 40 33 28 # HD46-8 37 20 11 # HD46-13 39 38 27 #Sample PASS_minimap2 PASS_ngmlr COMMON_FINAL #HD46-Ctrl_Ctrl 39 29 6 #HD46-1 39 32 8 #HD46-2 40 32 8 #HD46-3 38 30 6 #HD46-4 46 35 8 #HD46-5 40 35 9 #HD46-6 43 35 10 #HD46-7 40 33 8 #HD46-8 37 20 4 #HD46-13 39 38 5 #!!!! Summarize the results of ngmlr+sniffles !!!! python merge_ngmlr+sniffles_filtered_results_and_summarize.py #!!!! Post-Processing !!!! #DELETE "2186168 N
. PASS” in Sheet HD46-13 and Summary #DELETE “2427785 N CGTCAGAATCGCTGTCTGCGTCCGAGTCACTGTCTGAGTCTGAATCACTATCTGCGTCTGAGTCACTGTCTG . PASS” due to “0/1:169:117” in HD46-13 and Summary #DELETE “2441640 N GCTCATTAAGAATCATTAAATTAC . PASS” due to 0/1:170:152 in HD46-13 and Summary
-
Source code of merge_ngmlr+sniffles_filtered_results_and_summarize.py
import os import pandas as pd # List all ngmlr VCF files vcf_files = sorted([f for f in os.listdir('.') if f.endswith('_ngmlr+sniffles_filtered.vcf')]) log_lines = [] df_list = [] # Function to read VCF and filter PASS def read_vcf(vcf_path, log_lines, sample): variants = [] with open(vcf_path) as f: for line in f: if line.startswith('#'): continue parts = line.strip().split('\t') pos, ref, alt, qual, flt = parts[1], parts[3], parts[4], parts[5], parts[6] info = parts[7] if len(parts) > 7 else '.' fmt = parts[8] if len(parts) > 8 else '.' last_column = parts[9] if len(parts) > 9 else '.' # Keep original column var_id = f"{pos}:{ref}>{alt}" if flt != 'PASS': log_lines.append(f"{sample}\t{var_id}\tnot_PASS") continue variants.append({ 'POS': int(pos), 'REF': ref, 'ALT': alt, 'QUAL': qual, 'FILTER': flt, 'INFO': info, 'FORMAT': fmt, sample: last_column, # Use only genotype for individual sheets f"{sample}_with_alt": alt # Use only ALT sequence for summary }) return pd.DataFrame(variants) # Read all VCFs sample_names = [] sample_with_alt_columns = [] for f in vcf_files: sample = os.path.basename(f).replace('_ngmlr+sniffles_filtered.vcf', '') sample_names.append(sample) sample_with_alt_columns.append(f"{sample}_with_alt") df = read_vcf(f, log_lines, sample) df_list.append(df) # Merge all variants into one DataFrame all_variants = pd.concat(df_list, ignore_index=True) # Fill missing sample columns with hyphen for summary for sample in sample_names: if sample not in all_variants.columns: all_variants[sample] = '' if f"{sample}_with_alt" not in all_variants.columns: all_variants[f"{sample}_with_alt"] = '-' # Pivot table for summary using exact POS, using columns with ALT summary_df = all_variants.groupby(['POS', 'REF'])[sample_with_alt_columns].first().reset_index() # Rename columns in summary to remove '_with_alt' suffix summary_df.columns = ['POS', 'REF'] + sample_names # Replace NaN or empty strings with hyphen in sample columns summary_df[sample_names] = summary_df[sample_names].fillna('-').replace('', '-') # Count how many samples have a non-hyphen value summary_df['Count'] = summary_df[sample_names].apply(lambda row: sum(val != '-' for val in row), axis=1) # Save Excel with individual sheets and summary writer = pd.ExcelWriter('merged_ngmlr+sniffles_variants.xlsx', engine='openpyxl') # Save individual sample sheets for df in df_list: sheet_name = df.columns[-2] # Use sample name (not the _with_alt column) # Select only relevant columns for individual sheets (exclude _with_alt) df = df[[col for col in df.columns if not col.endswith('_with_alt')]] df.to_excel(writer, sheet_name=sheet_name, index=False) # Save summary sheet summary_df.to_excel(writer, sheet_name='Summary', index=False) writer.close() # Write log with open('ngmlr_filtering_log.txt', 'w') as logf: logf.write('Sample\tVariant\tReason\n') for line in log_lines: logf.write(line + '\n') print('Done: merged_ngmlr+sniffles_variants.xlsx with ALT sequences or hyphens in summary and genotype-only in individual sheets.')
-
———————– END of the transposon calculation. The following steps (e.g. assembly based on the long-reads is not necessary for the transposon analysis) —————————-
-
(NOT_USED) Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)
python vcf_to_fasta.py variants.vcf variants.fasta #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 #1.9G python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head Exact Matches ============= 2 bacteria (blast name), Bacteria
(scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields. * In your VCF file generated by Sniffles, the REF=N in the results has a specific meaning: * In a standard VCF, the REF field usually contains the reference base(s) at the variant position. * For structural variants (SVs), especially insertions, there is no reference sequence replaced; the insertion occurs between reference bases. * Therefore, Sniffles uses N as a placeholder in the REF field to indicate “no reference base replaced”. * The actual inserted sequence is then stored in the ALT field. -
Why some records have UNRESOLVED in the FILTER field in the Excel output.
1. Understanding the format The data appears to be structural variant (SV) calls from Sniffles, probably in a VCF-like tabular format exported to Excel: * gi|1176884116|gb|CP020463.1| → reference sequence * Positions: 1855752 and 2422820 * N → insertion event * SVLEN=999 → size of the insertion * AF → allele frequency * GT:DR:DV → genotype, depth reference, depth variant (1/1:0:678, example values for a PASS variant) * FILTER → whether the variant passed filters (UNRESOLVED means it didn’t pass) 2. What UNRESOLVED usually means In Sniffles: * UNRESOLVED is assigned to SVs when the tool cannot confidently resolve the exact sequence or breakpoint. * Reasons include: - Low read support (RE, DV) relative to the expected coverage - Ambiguous alignment at repetitive regions - Conflicting strand or orientation signals - Allele frequency inconsistent with expectations 3. Examine your two records First record POS: 1855752 SVTYPE: INS SVLEN: 999 RE: 68 AF: 1 GT: 1/1 FILTER: UNRESOLVED Observations: * AF = 1 → allele frequency 100%, homozygous insertion * RE = 68 → 68 reads support the variant, decent coverage * Still UNRESOLVED → likely because Sniffles could not resolve the inserted sequence precisely; sometimes long insertions in repetitive regions are hard to reconstruct fully even with good read support. Second record POS: 2422820 SVTYPE: INS SVLEN: 999 RE: 22 AF: 0.025522 GT: 0/0 FILTER: UNRESOLVED Observations: * AF = 0.0255 → very low allele frequency (~2.5%) * RE = 22, DR = 840 → very low variant reads vs reference * GT = 0/0 → homozygous reference * Sniffles marks it UNRESOLVED because the variant is essentially noise, not confidently detected. 4. Key difference between the two Feature First record Second record Allele frequency (AF) 1 (high) 0.0255 (very low) Variant reads (RE) 68 22 Genotype (GT) 1/1 0/0 Reason for UNRESOLVED Unresolvable inserted sequence ✅ 5. Conclusion * Sniffles marks a variant as UNRESOLVED when the SV cannot be confidently characterized. * Even if there is good read support (first record), complex insertions can’t always be reconstructed fully. * Very low allele frequency (second record) also triggers UNRESOLVED because the signal is too weak compared to background noise. * Essentially: “UNRESOLVED” ≠ bad data, it’s just unresolved uncertainty.
-
(NOT_SURE_HOW_TO_USE) Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.
mamba install -c bioconda medaka medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
-
Compare Insertions Across Samples
Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT: SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf Input: List of VCF files from Sniffles2. Output: A consolidated VCF file with shared and unique variants. Filter WT Insertions: Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools: bcftools isec WT.vcf merged.vcf -p comparison_results
-
Validate and Visualize
Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality. igv.sh Validate Findings: Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
-
Alternatives to TEPID for Long-Read Data
If you’re looking for transposon-specific tools for long reads: REPET: A robust transposon annotation tool compatible with assembled genomes. EDTA (Extensive de novo TE Annotator): A pipeline to identify, classify, and annotate transposons. Works directly on your assembled genomes. perl EDTA.pl --genome WT.fasta --type all
-
The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:
Steps to Generate WT.vcf 1. Align WT Reads to the WT Reference Genome The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications. Command using Minimap2: minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam Index the BAM file: samtools index WT.sorted.bam 2. Detect Structural Variants with Sniffles2 Run Sniffles2 on the WT alignment to call structural variants: sniffles --input WT.sorted.bam --vcf WT.vcf This step identifies: Native transposons and insertions present in the WT genome. Other structural variants that are part of the reference genome or sequencing artifacts. Key parameters to consider: --min_support: Adjust based on your WT sequencing coverage. --max_distance: Define proximity for merging variants. --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
-
Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.
To ensure the WT.vcf only includes relevant transposons or SVs: Use bcftools or similar tools to filter out low-confidence variants: bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
-
NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:
Why Use a Reference Genome (WT.fasta) from a Database? Higher Quality and Completeness: Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection. Consistency: Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors. Saves Time: Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis. Alternative: Assembling WT from FASTQ If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads: Assemble the genome from your WT.fastq: Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome. flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size
Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism. -
Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.
# -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ ! # Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches. #https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1 # -- (failed!) try TEPID for annotation mamba install tepid=0.10 -c bioconda #(tepid_env) for sample in WT 1 2 3 4 5 7 8 9 10; do tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q ../batch1_depth25/trycycler_${sample}/reads.fastq; tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se; done tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se; tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS] [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL # -- (failed!) try EDTA for annotation https://github.com/oushujun/EDTA (transposon_long) mamba install -c conda-forge -c bioconda edta mamba install bioconda::rmblast # cd RepeatMasker; ./configure EDTA.pl --genome CP020463.fasta --species others --threads 40 For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options. For de novo repeat identification: RepeatScout is highly effective. For LTR retrotransposons: Use LTR_retriever. For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.
-
Validation: Cross-validate with short-read sequencing data if available.
-
(Optional) Assembly the nanopore-sequencing using
1. merge and prepare fastq.gz 2. calcuclate the precentage of coding DNA: https://www.ncbi.nlm.nih.gov/nuccore/CP020463 with additional plasmid. * Average gene length: 870.4 bp * Total coding region length: 2056168 bp * Percentage of coding DNA = (Total Coding Region Length / Total Genome Length) × 100 = 2056168 bp / 2454929 bp = 83.8% 3. Prepare the fastq.gz conda activate /home/jhuang/miniconda3/envs/trycycler # under jhuang@hamm (10.169.63.113). rsync -a -P *.fastq.gz jhuang@10.169.63.113:/home/jhuang/DATA/Data_Patricia_Transposon_2025/ for sample in barcode01 barcode02 barcode03 barcode04 HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do mkdir trycycler_${sample}; mv ${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz; gunzip trycycler_${sample}/reads.fastq.gz; done #mkdir batch3; #cd batch3; #for sample in WT 1 2 3 4 5 7 8 9 10; do # mkdir trycycler_${sample}; # cp ${sample}_nanopore/${sample}.fastq.gz trycycler_${sample}/reads.fastq.gz; # gunzip trycycler_${sample}/reads.fastq.gz; #done 4. Running separate assemblies (6x4 times: canu, flye, miniasm_and_minipolish, necat, nextdenovo, raven) using trycycler_assembly_extra-thorough.sh (under the trycycler environment running the following steps) # batch1: min_read_cov=25; batch2: min_read_cov=50; batch3: min_read_cov=100 (if necessary on the monday!) cp ../Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_*.sh ./ #for sample in trycycler_HDRNA_10 trycycler_HDRNA_13; do for sample in barcode01 barcode02 barcode03 barcode04 HD46_Ctrl HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do cd trycycler_${sample}; ../trycycler_assembly_extra-thorough.sh cd ..; done # END! #TODO: upload the nanopore-sequencing data to NCBI for the HDRNA_10 and HDRNA_13 to the correspoinding project. (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_02.fasta assembly_02.fasta sepidermidis 87 arcC(7) aroE(1) gtr(1) mutS(2) pyrR(2) tpiA(1) yqiL(1) [15:32:46] If you like MLST, you're absolutely going to love wgMLST! [15:32:46] Done. (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_10/assemblies$ mlst assembly_20.fasta assembly_20.fasta sepidermidis 87 arcC(7) aroE(1) gtr(1) mutS(2) pyrR(2) tpiA(1) yqiL(1) [15:33:01] You can use --label XXX to replace an ugly filename in the output. [15:33:01] Done. (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_02.fasta assembly_02.fasta sepidermidis 5 arcC(1) aroE(1) gtr(1) mutS(2) pyrR(2) tpiA(1) yqiL(1) (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_08.fasta assembly_08.fasta sepidermidis 5 arcC(1) aroE(1) gtr(1) mutS(2) pyrR(2) tpiA(1) yqiL(1) [15:38:22] Remember that --minscore is only used when using automatic scheme detection. [15:38:22] Done. (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_14.fasta assembly_14.fasta sepidermidis 5 arcC(1) aroE(1) gtr(1) mutS(2) pyrR(2) tpiA(1) yqiL(1) [15:38:50] Thanks for using mlst, I hope you found it useful. [15:38:50] Done. (bengal3_ac3) jhuang@WS-2290C:/mnt/md1/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/trycycler_HDRNA_13/assemblies$ mlst assembly_20.fasta assembly_20.fasta sepidermidis 5 arcC(1) aroE(1) gtr(1) mutS(2) pyrR(2) tpiA(1) yqiL(1) [15:38:54] If you like MLST, you're going to absolutely love cgMLST! [15:38:54] Done. #- under the directory batch3 #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do # cd ${sample}; # ../trycycler_assembly_extra-thorough_threads5_cov100.sh; # cd ..; #done #if ERROR, running separate assembly-methods raven and canu #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do # cd ${sample}; # ../trycycler_assembly_extra-thorough_raven.sh; # cd ..; #done # #for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4 trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do # cd ${sample}; # ../trycycler_assembly_extra-thorough_canu.sh; # cd ..; #done 5. trycycler cluster for sample in trycycler_5 trycycler_7 trycycler_8 trycycler_9 trycycler_10; do for sample in trycycler_WT trycycler_1 trycycler_2 trycycler_3 trycycler_4; do cd ${sample}; rm -rf trycycler trycycler cluster --threads 10 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler; cd ..; done 6. trycycler reconcile cd trycycler_WT trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 mkdir trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/C_utg000001c.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 # -- under batch3 -- 176K Nov 5 12:38 P_bctg00000001.fasta 122K Nov 5 12:38 D_bctg00000001.fasta 95K Nov 5 12:38 J_bctg00000001.fasta 78K Nov 5 12:38 J_bctg00000002.fasta #--> Error: unable to find a suitable common sequence 69K Nov 5 17:24 G_tig00000004.fasta 66K Nov 5 17:24 S_tig00000004.fasta 66K Nov 5 17:24 M_tig00000004.fasta 65K Nov 5 17:24 A_tig00000005.fasta #--> Error: unable to find a suitable common sequence #If repeat M_tig00000004.fasta twice, resulting in the following error: Circularising M_tig00000004: # using S_tig00000004: # unable to circularise: M_tig00000004's start and end were found in multiple places in S_tig00000004 #Error: failed to circularise sequence M_tig00000004 because its start/end sequences were found in multiple ambiguous places in other sequences. #This is likely because M_tig00000004 starts/ends in a repetitive region. You can either manually repair its circularisation (and ensure it does #not start/end in a repetitive region) or exclude the sequence altogether and try again. 47K Nov 5 17:24 V_bctg00000001.fasta 45K Nov 5 17:24 C_utg000002c.fasta #Error: unable to find a suitable common sequence 34K Nov 5 17:24 U_utg000004c.fasta 34K Nov 5 17:24 N_contig_2.fasta 34K Nov 5 17:24 T_contig_2.fasta #Error: unable to find a suitable common sequence 23K Nov 5 17:24 I_utg000003c.fasta 22K Nov 5 17:24 B_contig_2.fasta 22K Nov 5 17:24 H_contig_2.fasta #Error: unable to find a suitable common sequence 13K Nov 5 17:24 G_tig00000006.fasta 12K Nov 5 17:24 M_tig00000003.fasta 12K Nov 5 17:24 O_utg000002c.fasta 12K Nov 5 17:24 S_tig00000003.fasta 11K Nov 5 17:24 A_tig00000004.fasta #trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 --max_length_diff 1.2 --max_trim_seq_percent 20.0 #--> SUCCESSFULLY saving sequences to file: trycycler/cluster_002/2_all_seqs.fasta trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 --max_length_diff 1.3 --max_trim_seq_percent 30.0 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 #Error: two or more input contigs are required trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006 #Error: two or more input contigs are required trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_007 #Error: two or more input contigs are required cd ../trycycler_1 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 mkdir trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/I_utg000001l.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/K_ctg000000.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/S_tig00000001.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/W_ctg000000.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/A_tig00000001.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/U_utg000001c.fasta trycycler/cluster_001/1_contigs_removed trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 #TODO_TOMORROW_HERE! trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 --max_length_diff 1.7 --max_trim_seq_percent 70.0 #Error: failed to circularise sequence I_utg000002l because its end could not be found in other sequences. (NOTE) using the plasmid from other isolate to help the cut of the sequence. #cp ~/DATA/Data_Patricia_Transposon/batch3/trycycler_WT/trycycler/cluster_004/7_final_consensus.fasta . #cat *.fasta > all.fasta #mafft --clustalout --adjustdirection all.fasta > all.aln cluster_004_con -------------------------------------------------------tatga I_utg000002l agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga O_utg000002l agaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaatatga ***** cluster_004_con gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaaa--- I_utg000002l gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg O_utg000002l gcagaatcagaattaggcgcataatttacaggaggtttgattatggctaagaaaaatatg ******************************************************** grep "I_utg000002l" all.a_ > I_utg000002l.fasta grep "O_utg000002l" all.a_ > O_utg000002l.fasta cut -d' ' -f5 I_utg000002l.fasta > I_utg000002l_.fasta cut -d' ' -f5 O_utg000002l.fasta > O_utg000002l_.fasta sed 's/-//g' I_utg000002l_.fasta > I_utg000002l__.fasta sed 's/-//g' O_utg000002l_.fasta > O_utg000002l__.fasta seqtk seq I_utg000002l__.fasta -l 80 > I_utg000002l___.fasta seqtk seq O_utg000002l__.fasta -l 80 > O_utg000002l___.fasta #Cut two files Manually with the sequence: AGAATCAGAATTAGGCGCATAATTTACAGGAGGTTTGATTATGGCTAAGAAAAA cat I_utg000002l___.fasta O_utg000002l___.fasta > 2_all_seqs.fasta trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 #Error: two or more input contigs are required cd ../trycycler_2 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 mkdir trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/O_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 cd ../trycycler_3 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 mkdir trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 cd ../trycycler_4 trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001 mkdir trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/E_ctg000000.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/F_Utg670.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/V_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004 cd ../trycycler_5 trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_001 trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 5 --reads reads.fastq --cluster_dir trycycler/cluster_004 cd ../trycycler_7 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 mkdir trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/C_utg000001l.fasta trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/D_bctg00000000.fasta trycycler/cluster_001/1_contigs_removed trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 cd ../trycycler_8 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 mkdir trycycler/cluster_001/1_contigs_removed mv trycycler/cluster_001/1_contigs/M_utg000001l.fasta trycycler/cluster_001/1_contigs_removed trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 cd ../trycycler_9 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006 cd ../trycycler_10 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 #Nachmachen: for sample in trycycler_9 trycycler_10; do cd ${sample}; ../trycycler_assembly_extra-thorough_canu.sh; cd ..; done 7. (optional) map the circular assemblies to plasmid databases #bzip2 -d plsdb.fna.bz2 #makeblastdb -in plsdb.fna -dbtype nucl #blastn -db plsdb.fna -query all.fasta -evalue 1e-50 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > all_on_plsdb.blastn 8. trycycler msa trycycler msa --threads 55 --cluster_dir trycycler/cluster_001 trycycler msa --threads 55 --cluster_dir trycycler/cluster_002 trycycler msa --threads 55 --cluster_dir trycycler/cluster_003 trycycler msa --threads 55 --cluster_dir trycycler/cluster_004 trycycler msa --threads 55 --cluster_dir trycycler/cluster_005 #--> When finished, Trycycler reconcile will make a 3_msa.fasta file in the cluster directory 9. trycycler partition #generate 4_reads.fastq for each contig! trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_* #trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001 trycycler/cluster_002 trycycler/cluster_003 trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_003 10. trycycler consensus trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_002 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_003 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_005 #!!NOTE that we take the isolates of HD05_2_K5 and HD05_2_K6 assembled by Unicycler instead of Trycycler!! # TODO (TODAY), generate the 3 datasets below! # TODO (IMPORTANT): write a Email to Holger, say the short sequencing of HD5_2 is not correct, since the 3 datasets! However, the MTxxxxxxx is confirmed not in K5 and K6! TODO: variant calling needs the short-sequencing, they are not dorable without the correct short-reads! resequencing? It is difficult to call variants only from long-reads since too much errors in long-reads! #TODO: check the MT sequence if in the isolates, more deteiled annotations come late! #II. Comparing the results of Trycycler with Unicycler. #III. Eventually add the plasmids assembled from unicycler to the final results. E.g. add the 4 plasmids to K5 and K6 11. Polishing after Trycycler #1. Oxford Nanopore sequencer (Ignored due to the samtools version incompatibility!) # for c in trycycler/cluster_*; do # medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12 # mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta # rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi # clean up # done # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta #2. Short-read polishing #---- 5179_R1 (2) ---- # mean read depth: 205.8x # 188 bp have a depth of zero (99.9924% coverage) # 355 positions changed (0.0144% of total positions) # estimated pre-polishing sequence accuracy: 99.9856% (Q38.42) #Step 1: read QC fastp --in1 ../../s-epidermidis-5179-r1_R1.fastq.gz --in2 ../../s-epidermidis-5179-r1_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 37 #Insertion/Deletion Errors: 2 #Assembly Size: 2470001 #Consensus Quality: 99.9984 #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 17748 #Consensus Quality: 99.9775 #Step 4: (optional) more rounds and/or other polishers #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape! #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes. for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G cd .. done Substitution Errors: 13 Insertion/Deletion Errors: 0 Assembly Size: 2470004 Consensus Quality: 99.9995 Substitution Errors: 0 Insertion/Deletion Errors: 0 Assembly Size: 17748 Consensus Quality: 100 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2470004 #Consensus Quality: 100 #---- 1585 (4) ---- # mean read depth: 174.7x # 8,297 bp have a depth of zero (99.6604% coverage) # 271 positions changed (0.0111% of total positions) # estimated pre-polishing sequence accuracy: 99.9889% (Q39.55) #Step 1: read QC fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 7 #Insertion/Deletion Errors: 4 #Assembly Size: 2443174 #Consensus Quality: 99.9995 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9014 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9014 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2344 #Consensus Quality: 100 #Step 4: (optional) more rounds and/or other polishers #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape! #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes. for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2443176 #Consensus Quality: 100 #---- 1585 derived from unicycler, under 1585_normal/unicycler (4) ---- #Step 0: copy chrom and plasmid1, plasmid2, plasmid3 to cluster_001/7_final_consensus.fasta, ... #Step 1: read QC fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Polishing 1 (2,443,574 bp): #mean read depth: 174.7x #8,298 bp have a depth of zero (99.6604% coverage) #52 positions changed (0.0021% of total positions) #estimated pre-polishing sequence accuracy: 99.9979% (Q46.72) #Polishing 2 (9,014 bp): #mean read depth: 766.5x #3 bp have a depth of zero (99.9667% coverage) #0 positions changed (0.0000% of total positions) #estimated pre-polishing sequence accuracy: 100.0000% (Q∞) #Polishing 7 (2,344 bp): #mean read depth: 2893.0x #4 bp have a depth of zero (99.8294% coverage) #0 positions changed (0.0000% of total positions) #estimated pre-polishing sequence accuracy: 100.0000% (Q∞) #Polishing 8 (2,255 bp): #mean read depth: 2719.6x #4 bp have a depth of zero (99.8226% coverage) #0 positions changed (0.0000% of total positions) #estimated pre-polishing sequence accuracy: 100.0000% (Q∞) #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 7 #Insertion/Deletion Errors: 4 #Assembly Size: 2443598 #Consensus Quality: 99.9995 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9014 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2344 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2255 #Consensus Quality: 100 #Step 4: (optional) more rounds and/or other polishers #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape! #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes. for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2443600 #Consensus Quality: 100 #-- 1585v (1, no short reads, waiting) -- # TODO! #-- 5179 (2) -- #mean read depth: 120.7x #7,547 bp have a depth of zero (99.6946% coverage) #356 positions changed (0.0144% of total positions) #estimated pre-polishing sequence accuracy: 99.9856% (Q38.41) #Step 1: read QC fastp --in1 ../../s-epidermidis-5179_R1.fastq.gz --in2 ../../s-epidermidis-5179_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 49 #Insertion/Deletion Errors: 23 #Assembly Size: 2471418 #Consensus Quality: 99.9971 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 17748 #Consensus Quality: 100 #Step 4: (optional) more rounds POLCA for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 10 #Insertion/Deletion Errors: 5 #Assembly Size: 2471442 #Consensus Quality: 99.9994 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done Substitution Errors: 6 Insertion/Deletion Errors: 0 Assembly Size: 2471445 Consensus Quality: 99.9998 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done Substitution Errors: 0 Insertion/Deletion Errors: 0 Assembly Size: 2471445 Consensus Quality: 100 #-- HD5_2 (2): without the short-sequencing we cannot correct the base-calling! -- # !ERROR to be REPORTED, the #Polishing cluster_001_consensus (2,504,140 bp): #mean read depth: 94.4x #240,420 bp have a depth of zero (90.3991% coverage) #56,894 positions changed (2.2720% of total positions) #estimated pre-polishing sequence accuracy: 97.7280% (Q16.44) /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R2_001.fastq #Step 1: read QC fastp --in1 ../../HD5_2_S38_R1_001.fastq.gz --in2 ../../HD5_2_S38_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz # NOTE that the following steps are not run since the short-reads are not correct! # #Step 2: Polypolish # for cluster in cluster_001 cluster_005; do # bwa index ${cluster}/7_final_consensus.fasta # bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam # bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam # polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta # done # #Step 3: POLCA # for cluster in cluster_001 cluster_005; do # cd ${cluster} # polca.sh -a polypolish.fasta -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G # cd .. # done # #Step 4: (optional) more rounds POLCA # for cluster in cluster_001; do # cd ${cluster} # polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G # cd .. # done # NOTE that the plasmids of HD5_2_K5 and HD5_2_K6 were copied from Unicycler! #-- HD5_2_K5 (4) -- mean read depth: 87.1x 25 bp have a depth of zero (99.9990% coverage) 1,085 positions changed (0.0433% of total positions) estimated pre-polishing sequence accuracy: 99.9567% (Q33.63) #Step 1: read QC fastp --in1 ../../275_K5_Holger_S92_R1_001.fastq.gz --in2 ../../275_K5_Holger_S92_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 146 #Insertion/Deletion Errors: 2 #Assembly Size: 2504401 #Consensus Quality: 99.9941 #Substitution Errors: 41 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9007 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9191 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2767 #Consensus Quality: 100 #Step 4: (optional) more rounds POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 41 #Insertion/Deletion Errors: 0 #Assembly Size: 2504401 #Consensus Quality: 99.9984 #Substitution Errors: 8 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9806 for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 8 #Insertion/Deletion Errors: 0 #Assembly Size: 2504401 #Consensus Quality: 99.9997 #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9903 for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 8 #Insertion/Deletion Errors: 0 #Assembly Size: 2504401 #Consensus Quality: 99.9997 #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9903 #-- HD5_2_K6 (4) -- #mean read depth: 116.7x #4 bp have a depth of zero (99.9998% coverage) #1,022 positions changed (0.0408% of total positions) #estimated pre-polishing sequence accuracy: 99.9592% (Q33.89) #Step 1: read QC fastp --in1 ../../276_K6_Holger_S95_R1_001.fastq.gz --in2 ../../276_K6_Holger_S95_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 164 #Insertion/Deletion Errors: 2 #Assembly Size: 2504398 #Consensus Quality: 99.9934 #Substitution Errors: 22 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9467 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9191 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2767 #Consensus Quality: 100 #Step 4: (optional) more rounds POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 32 #Insertion/Deletion Errors: 0 #Assembly Size: 2504400 #Consensus Quality: 99.9987 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 100 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 2504400 #Consensus Quality: 99.9998 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 2 #Insertion/Deletion Errors: 0 #Assembly Size: 2504400 #Consensus Quality: 99.9999
-
(Optional) use platon to confirm the plasmid contigs: https://github.com/oschwengers/platon