1️⃣ Polca – A lightweight polishing tool from MaSuRCA that corrects small sequencing errors using short-read data. It efficiently fixes substitutions and small INDELs but is not ideal for large structural variations.
# Under the env (nextclade)
mamba install -c bioconda -c conda-forge masurca
#-- VZV_20S.assembly3-modify.fasta --
(nextclade) polca.sh -a ../viralngs/tmp/02_assembly/VZV_20S.assembly3-modify.fasta -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G
#3
(nextclade) polca.sh -a VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G
#0
2️⃣ Pilon – A more comprehensive short-read polishing tool that corrects SNPs, small INDELs, and some structural misassemblies. It works best with high-coverage Illumina reads and can iteratively improve assembly accuracy.
bwa index PCC1_VZV_20_2.assembly3-modify.fasta.PolcaCorrected.fa
bwa mem -t 40 PCC1_VZV_20_2.assembly3-modify.fasta.PolcaCorrected.fa \
PCC1_VZV_20_2_trimmed_P_1.fastq \
PCC1_VZV_20_2_trimmed_P_2.fastq \
> aln.sam
samtools view -bS aln.sam > aln.bam
samtools sort aln.bam aln.sorted
samtools index aln.sorted.bam
(nextclade) pilon --genome PCC1_VZV_20_2.assembly3-modify.fasta.PolcaCorrected.fa \
--bam aln.sorted.bam --output polished --threads 80 --changes --fix indels
3️⃣ Medaka – A polishing tool specifically designed for Nanopore sequencing data. It uses a neural network to refine base calls and correct systematic errors in long-read assemblies.
-
Quality check using QUAST
#mamba install -c bioconda quast #quast polished.fasta -r reference.fasta -o quast_output
-
Correcting assembly for Huang_Human_herpesvirus_3
./VZV_20S.fasta ./VZV_20c.fasta ./PCC1_VZV_20_1.fasta ./PCC1_VZV_20_2.fasta ./PCC1_VZV_20_5.fasta ./VZV_60S.fasta ./VZV_60c.fasta ./PCC1_VZV_60_1.fasta ./PCC1_VZV_60_4.fasta ./PCC1_VZV_60_6.fasta #find . -nma "*.assembly1-spades.fasta" | wc -l #find . -name "*.assembly2-gapfilled.fasta" | wc -l #find . -name "*.assembly3-modify.fasta" | wc -l #find . -name "*.assembly4-refined.fasta" | wc -l # Under the env (nextclade) and directory ~/DATA/Data_Huang_Human_herpesvirus_3/trimmed mamba install -c bioconda -c conda-forge masurca #-- VZV_20S.assembly3-modify.fasta -- (nextclade) polca.sh -a ../viralngs/tmp/02_assembly/VZV_20S.assembly3-modify.fasta -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 3 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124870 #Consensus Quality Before Polishing: 99.9976 #Consensus QV Before Polishing: 46.19 (nextclade) polca.sh -a VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124869 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- VZV_20c.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/VZV_20c.assembly3-modify.fasta -r "VZV_20c_trimmed_P_1.fastq VZV_20c_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 2 #Insertion/Deletion Errors Found: 2 #Assembly Size: 124872 #Consensus Quality Before Polishing: 99.9968 #Consensus QV Before Polishing: 44.94 polca.sh -a VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_20c_trimmed_P_1.fastq VZV_20c_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124873 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- PCC1_VZV_20_1.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_20_1.assembly3-modify.fasta -r "PCC1_VZV_20_1_trimmed_P_1.fastq PCC1_VZV_20_1_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 1 #Assembly Size: 124873 #Consensus Quality Before Polishing: 99.9984 #Consensus QV Before Polishing: 47.95 polca.sh -a PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_20_1_trimmed_P_1.fastq PCC1_VZV_20_1_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124873 #Consensus Quality Before Polishing: 99.9992 #Consensus QV Before Polishing: 50.96 polca.sh -a PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "PCC1_VZV_20_1_trimmed_P_1.fastq PCC1_VZV_20_1_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124873 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- PCC1_VZV_20_2.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_20_2.assembly2-gapfilled.fasta -r "PCC1_VZV_20_2_trimmed_P_1.fastq PCC1_VZV_20_2_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 1 #Assembly Size: 124866 #Consensus Quality Before Polishing: 99.9984 #Consensus QV Before Polishing: 47.95 polca.sh -a PCC1_VZV_20_2.assembly2-gapfilled.fasta.PolcaCorrected.fa -r "PCC1_VZV_20_2_trimmed_P_1.fastq PCC1_VZV_20_2_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124867 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- PCC1_VZV_20_5.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_20_5.assembly3-modify.fasta -r "PCC1_VZV_20_5_trimmed_P_1.fastq PCC1_VZV_20_5_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124873 #Consensus Quality Before Polishing: 99.9992 #Consensus QV Before Polishing: 50.96 polca.sh -a PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_20_5_trimmed_P_1.fastq PCC1_VZV_20_5_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124874 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- VZV_60S.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/VZV_60S.assembly3-modify.fasta -r "VZV_60S_trimmed_P_1.fastq VZV_60S_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 2 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124873 #Consensus Quality Before Polishing: 99.9984 #Consensus QV Before Polishing: 47.95 polca.sh -a VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_60S_trimmed_P_1.fastq VZV_60S_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124870 #Consensus Quality Before Polishing: 99.9992 #Consensus QV Before Polishing: 50.96 polca.sh -a VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "VZV_60S_trimmed_P_1.fastq VZV_60S_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124870 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- VZV_60c.assembly2-gapfilled.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/VZV_60c.assembly2-gapfilled.fasta -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 0 #Assembly Size: 119660 #Consensus Quality Before Polishing: 99.9992 #Consensus QV Before Polishing: 50.78 polca.sh -a VZV_60c.assembly2-gapfilled.fasta.PolcaCorrected.fa -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 119660 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- PCC1_VZV_60_1.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_60_1.assembly3-modify.fasta -r "PCC1_VZV_60_1_trimmed_P_1.fastq PCC1_VZV_60_1_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124843 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 polca.sh -a PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_60_1_trimmed_P_1.fastq PCC1_VZV_60_1_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124839 #Consensus Quality Before Polishing: 99.9992 #Consensus QV Before Polishing: 50.96 polca.sh -a PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "PCC1_VZV_60_1_trimmed_P_1.fastq PCC1_VZV_60_1_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124839 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- PCC1_VZV_60_4.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_60_4.assembly3-modify.fasta -r "PCC1_VZV_60_4_trimmed_P_1.fastq PCC1_VZV_60_4_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 1 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124851 #Consensus Quality Before Polishing: 99.9992 #Consensus QV Before Polishing: 50.96 polca.sh -a PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_60_4_trimmed_P_1.fastq PCC1_VZV_60_4_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124851 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 #-- PCC1_VZV_60_6.assembly3-modify.fasta -- polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_60_6.assembly3-modify.fasta -r "PCC1_VZV_60_6_trimmed_P_1.fastq PCC1_VZV_60_6_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 3 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124873 #Consensus Quality Before Polishing: 99.9976 #Consensus QV Before Polishing: 46.19 polca.sh -a PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_60_6_trimmed_P_1.fastq PCC1_VZV_60_6_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124871 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00
-
Multiple alignment of all corrected assembly
cat ./20S_polished/VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20c_polished/VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20_1_polished/PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./20_2_polished/PCC1_VZV_20_2.assembly2-gapfilled.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20_5_polished/PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 20.fasta cat ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60c_polished/VZV_60c.assembly2-gapfilled.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 60.fasta mafft --clustalout 20.fasta > 20.aln mafft --clustalout 60.fasta > 60.aln grep "NC_001348.1_con" 20.aln > PCC1_VZV_20_2-1.fasta seqtk seq PCC1_VZV_20_2-1.fasta -l 60 > PCC1_VZV_20_2.fasta polca.sh -a PCC1_VZV_20_2.fasta -r "PCC1_VZV_20_2_trimmed_P_1.fastq PCC1_VZV_20_2_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124866 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 cat ./20S_polished/VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20c_polished/VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20_1_polished/PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./PCC1_VZV_20_2.fasta.PolcaCorrected.fa ./20_5_polished/PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 20_round2.fasta mafft --clustalout 20_round2.fasta > 20_round2.aln #--leavegappyregion #(Optional) Delete "-1", set 2 positions of the lines "************" python check_SNP_positions.py grep "NC_001348.1_con" 60.aln > VZV_60c-1.fasta seqtk seq VZV_60c-1.fasta -l 60 > VZV_60c.fasta polca.sh -a VZV_60c.fasta -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 119660 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 cat ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa VZV_60c.fasta.PolcaCorrected.fa ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 60_round2.fasta mafft --clustalout 60_round2.fasta > 60_round2.aln muscle -in 20_round2.fasta -out 20_round2.aln -clw #(Optional) Delete "-1", set 2 positions of the lines "************" python check_SNP_positions.py grep "VZV_60c-1" 60_round2.aln > VZV_60c-2.fasta seqtk seq VZV_60c-2.fasta -l 60 > VZV_60c-3.fasta polca.sh -a VZV_60c-3.fasta -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 119660 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 polca.sh -a VZV_60c-3.fasta.PolcaCorrected.fa -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G #Stats BEFORE polishing: #Substitution Errors Found: 0 #Insertion/Deletion Errors Found: 0 #Assembly Size: 124896 #Consensus Quality Before Polishing: 100 #Consensus QV Before Polishing: 100.00 cat ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa VZV_60c-3.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 60_round3.fasta mafft --auto --op 3 --ep 0.1 --clustalout 60_round3.fasta > 60_round3.aln #--leavegappyregion ulimit -s unlimited muscle -in 60_round3.fasta -out 60_round3.aln -clw -maxiters 2 mamba install -c bioconda clustalo clustalo -i 60_round3.fasta -o output.fasta --auto mamba install -c bioconda t-coffee t_coffee -seq input.fasta -outfile output.fasta #(Optional) Delete "-1", set 2 positions of the lines "************" python check_SNP_positions.py 1. Try a Different MAFFT Mode The default mode (--auto) may favor gaps too much. Use --globalpair or --localpair instead. Command: mafft --globalpair --maxiterate 1000 input.fasta > output.fasta or mafft --localpair --maxiterate 1000 input.fasta > output.fasta --globalpair: More accurate for closely related sequences. --localpair: Better if your sequences have recombination or partial homology. 2. Increase Gap Open Penalty (--op) The default gap opening penalty is low, leading to excessive gaps. Try increasing it: mafft --auto --op 3 input.fasta > output.fasta Default is 1.53; higher values reduce gaps. 3. Reduce Gap Extension Penalty (--ep) MAFFT extends gaps too easily. Lowering --ep discourages long gaps. mafft --auto --op 3 --ep 0.1 input.fasta > output.fasta cp ./20S_polished/VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend cp ./20c_polished/VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend cp ./20_1_polished/PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa toSend cp ./PCC1_VZV_20_2.fasta.PolcaCorrected.fa toSend cp ./20_5_polished/PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend cp ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa toSend cp VZV_60c-3.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend cp ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa toSend cp ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend cp ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend