gene_x 0 like s 61 view s
Tags: processing
1️⃣ Polca – A lightweight polishing tool from MaSuRCA that corrects small sequencing errors using short-read data. It efficiently fixes substitutions and small INDELs but is not ideal for large structural variations.
# Under the env (nextclade)
mamba install -c bioconda -c conda-forge masurca
#-- VZV_20S.assembly3-modify.fasta --
(nextclade) polca.sh -a ../viralngs/tmp/02_assembly/VZV_20S.assembly3-modify.fasta -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G
#3
(nextclade) polca.sh -a VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G
#0
2️⃣ Pilon – A more comprehensive short-read polishing tool that corrects SNPs, small INDELs, and some structural misassemblies. It works best with high-coverage Illumina reads and can iteratively improve assembly accuracy.
bwa index PCC1_VZV_20_2.assembly3-modify.fasta.PolcaCorrected.fa
bwa mem -t 40 PCC1_VZV_20_2.assembly3-modify.fasta.PolcaCorrected.fa \
PCC1_VZV_20_2_trimmed_P_1.fastq \
PCC1_VZV_20_2_trimmed_P_2.fastq \
> aln.sam
samtools view -bS aln.sam > aln.bam
samtools sort aln.bam aln.sorted
samtools index aln.sorted.bam
(nextclade) pilon --genome PCC1_VZV_20_2.assembly3-modify.fasta.PolcaCorrected.fa \
--bam aln.sorted.bam --output polished --threads 80 --changes --fix indels
3️⃣ Medaka – A polishing tool specifically designed for Nanopore sequencing data. It uses a neural network to refine base calls and correct systematic errors in long-read assemblies.
Quality check using QUAST
#mamba install -c bioconda quast
#quast polished.fasta -r reference.fasta -o quast_output
Correcting assembly for Huang_Human_herpesvirus_3
./VZV_20S.fasta
./VZV_20c.fasta
./PCC1_VZV_20_1.fasta
./PCC1_VZV_20_2.fasta
./PCC1_VZV_20_5.fasta
./VZV_60S.fasta
./VZV_60c.fasta
./PCC1_VZV_60_1.fasta
./PCC1_VZV_60_4.fasta
./PCC1_VZV_60_6.fasta
#find . -nma "*.assembly1-spades.fasta" | wc -l
#find . -name "*.assembly2-gapfilled.fasta" | wc -l
#find . -name "*.assembly3-modify.fasta" | wc -l
#find . -name "*.assembly4-refined.fasta" | wc -l
# Under the env (nextclade) and directory ~/DATA/Data_Huang_Human_herpesvirus_3/trimmed
mamba install -c bioconda -c conda-forge masurca
#-- VZV_20S.assembly3-modify.fasta --
(nextclade) polca.sh -a ../viralngs/tmp/02_assembly/VZV_20S.assembly3-modify.fasta -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 3
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124870
#Consensus Quality Before Polishing: 99.9976
#Consensus QV Before Polishing: 46.19
(nextclade) polca.sh -a VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_20S_trimmed_P_1.fastq VZV_20S_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124869
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- VZV_20c.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/VZV_20c.assembly3-modify.fasta -r "VZV_20c_trimmed_P_1.fastq VZV_20c_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 2
#Insertion/Deletion Errors Found: 2
#Assembly Size: 124872
#Consensus Quality Before Polishing: 99.9968
#Consensus QV Before Polishing: 44.94
polca.sh -a VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_20c_trimmed_P_1.fastq VZV_20c_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124873
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- PCC1_VZV_20_1.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_20_1.assembly3-modify.fasta -r "PCC1_VZV_20_1_trimmed_P_1.fastq PCC1_VZV_20_1_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 1
#Assembly Size: 124873
#Consensus Quality Before Polishing: 99.9984
#Consensus QV Before Polishing: 47.95
polca.sh -a PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_20_1_trimmed_P_1.fastq PCC1_VZV_20_1_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124873
#Consensus Quality Before Polishing: 99.9992
#Consensus QV Before Polishing: 50.96
polca.sh -a PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "PCC1_VZV_20_1_trimmed_P_1.fastq PCC1_VZV_20_1_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124873
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- PCC1_VZV_20_2.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_20_2.assembly2-gapfilled.fasta -r "PCC1_VZV_20_2_trimmed_P_1.fastq PCC1_VZV_20_2_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 1
#Assembly Size: 124866
#Consensus Quality Before Polishing: 99.9984
#Consensus QV Before Polishing: 47.95
polca.sh -a PCC1_VZV_20_2.assembly2-gapfilled.fasta.PolcaCorrected.fa -r "PCC1_VZV_20_2_trimmed_P_1.fastq PCC1_VZV_20_2_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124867
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- PCC1_VZV_20_5.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_20_5.assembly3-modify.fasta -r "PCC1_VZV_20_5_trimmed_P_1.fastq PCC1_VZV_20_5_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124873
#Consensus Quality Before Polishing: 99.9992
#Consensus QV Before Polishing: 50.96
polca.sh -a PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_20_5_trimmed_P_1.fastq PCC1_VZV_20_5_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124874
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- VZV_60S.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/VZV_60S.assembly3-modify.fasta -r "VZV_60S_trimmed_P_1.fastq VZV_60S_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 2
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124873
#Consensus Quality Before Polishing: 99.9984
#Consensus QV Before Polishing: 47.95
polca.sh -a VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa -r "VZV_60S_trimmed_P_1.fastq VZV_60S_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124870
#Consensus Quality Before Polishing: 99.9992
#Consensus QV Before Polishing: 50.96
polca.sh -a VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "VZV_60S_trimmed_P_1.fastq VZV_60S_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124870
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- VZV_60c.assembly2-gapfilled.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/VZV_60c.assembly2-gapfilled.fasta -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 0
#Assembly Size: 119660
#Consensus Quality Before Polishing: 99.9992
#Consensus QV Before Polishing: 50.78
polca.sh -a VZV_60c.assembly2-gapfilled.fasta.PolcaCorrected.fa -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 119660
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- PCC1_VZV_60_1.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_60_1.assembly3-modify.fasta -r "PCC1_VZV_60_1_trimmed_P_1.fastq PCC1_VZV_60_1_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124843
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
polca.sh -a PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_60_1_trimmed_P_1.fastq PCC1_VZV_60_1_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124839
#Consensus Quality Before Polishing: 99.9992
#Consensus QV Before Polishing: 50.96
polca.sh -a PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "PCC1_VZV_60_1_trimmed_P_1.fastq PCC1_VZV_60_1_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124839
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- PCC1_VZV_60_4.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_60_4.assembly3-modify.fasta -r "PCC1_VZV_60_4_trimmed_P_1.fastq PCC1_VZV_60_4_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 1
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124851
#Consensus Quality Before Polishing: 99.9992
#Consensus QV Before Polishing: 50.96
polca.sh -a PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_60_4_trimmed_P_1.fastq PCC1_VZV_60_4_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124851
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
#-- PCC1_VZV_60_6.assembly3-modify.fasta --
polca.sh -a ../viralngs/tmp/02_assembly/PCC1_VZV_60_6.assembly3-modify.fasta -r "PCC1_VZV_60_6_trimmed_P_1.fastq PCC1_VZV_60_6_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 3
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124873
#Consensus Quality Before Polishing: 99.9976
#Consensus QV Before Polishing: 46.19
polca.sh -a PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa -r "PCC1_VZV_60_6_trimmed_P_1.fastq PCC1_VZV_60_6_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124871
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
Multiple alignment of all corrected assembly
cat ./20S_polished/VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20c_polished/VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20_1_polished/PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./20_2_polished/PCC1_VZV_20_2.assembly2-gapfilled.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20_5_polished/PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 20.fasta
cat ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60c_polished/VZV_60c.assembly2-gapfilled.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 60.fasta
mafft --clustalout 20.fasta > 20.aln
mafft --clustalout 60.fasta > 60.aln
grep "NC_001348.1_con" 20.aln > PCC1_VZV_20_2-1.fasta
seqtk seq PCC1_VZV_20_2-1.fasta -l 60 > PCC1_VZV_20_2.fasta
polca.sh -a PCC1_VZV_20_2.fasta -r "PCC1_VZV_20_2_trimmed_P_1.fastq PCC1_VZV_20_2_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124866
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
cat ./20S_polished/VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20c_polished/VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./20_1_polished/PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./PCC1_VZV_20_2.fasta.PolcaCorrected.fa ./20_5_polished/PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 20_round2.fasta
mafft --clustalout 20_round2.fasta > 20_round2.aln #--leavegappyregion
#(Optional) Delete "-1", set 2 positions of the lines "************"
python check_SNP_positions.py
grep "NC_001348.1_con" 60.aln > VZV_60c-1.fasta
seqtk seq VZV_60c-1.fasta -l 60 > VZV_60c.fasta
polca.sh -a VZV_60c.fasta -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 119660
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
cat ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa VZV_60c.fasta.PolcaCorrected.fa ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 60_round2.fasta
mafft --clustalout 60_round2.fasta > 60_round2.aln
muscle -in 20_round2.fasta -out 20_round2.aln -clw
#(Optional) Delete "-1", set 2 positions of the lines "************"
python check_SNP_positions.py
grep "VZV_60c-1" 60_round2.aln > VZV_60c-2.fasta
seqtk seq VZV_60c-2.fasta -l 60 > VZV_60c-3.fasta
polca.sh -a VZV_60c-3.fasta -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 119660
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
polca.sh -a VZV_60c-3.fasta.PolcaCorrected.fa -r "VZV_60c_trimmed_P_1.fastq VZV_60c_trimmed_P_2.fastq" -t 40 -m 10G
#Stats BEFORE polishing:
#Substitution Errors Found: 0
#Insertion/Deletion Errors Found: 0
#Assembly Size: 124896
#Consensus Quality Before Polishing: 100
#Consensus QV Before Polishing: 100.00
cat ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa VZV_60c-3.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa > 60_round3.fasta
mafft --auto --op 3 --ep 0.1 --clustalout 60_round3.fasta > 60_round3.aln #--leavegappyregion
ulimit -s unlimited
muscle -in 60_round3.fasta -out 60_round3.aln -clw -maxiters 2
mamba install -c bioconda clustalo
clustalo -i 60_round3.fasta -o output.fasta --auto
mamba install -c bioconda t-coffee
t_coffee -seq input.fasta -outfile output.fasta
#(Optional) Delete "-1", set 2 positions of the lines "************"
python check_SNP_positions.py
1. Try a Different MAFFT Mode
The default mode (--auto) may favor gaps too much. Use --globalpair or --localpair instead.
Command:
mafft --globalpair --maxiterate 1000 input.fasta > output.fasta
or
mafft --localpair --maxiterate 1000 input.fasta > output.fasta
--globalpair: More accurate for closely related sequences.
--localpair: Better if your sequences have recombination or partial homology.
2. Increase Gap Open Penalty (--op)
The default gap opening penalty is low, leading to excessive gaps.
Try increasing it:
mafft --auto --op 3 input.fasta > output.fasta
Default is 1.53; higher values reduce gaps.
3. Reduce Gap Extension Penalty (--ep)
MAFFT extends gaps too easily. Lowering --ep discourages long gaps.
mafft --auto --op 3 --ep 0.1 input.fasta > output.fasta
cp ./20S_polished/VZV_20S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp ./20c_polished/VZV_20c.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp ./20_1_polished/PCC1_VZV_20_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp ./PCC1_VZV_20_2.fasta.PolcaCorrected.fa toSend
cp ./20_5_polished/PCC1_VZV_20_5.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp ./60S_polished/VZV_60S.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp VZV_60c-3.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp ./60_1_polished/PCC1_VZV_60_1.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp ./60_4_polished/PCC1_VZV_60_4.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend
cp ./60_6_polished/PCC1_VZV_60_6.assembly3-modify.fasta.PolcaCorrected.fa.PolcaCorrected.fa toSend
点赞本文的读者
还没有人对此文章表态
没有评论
Submit ChIP-seq raw data to www.ebi.ac.uk/arrayexpress (Project E-MTAB-10475)
Enhanced Visualization of Gene Presence for the Selected Genes in Bongarts_S.epidermidis_HDRNA
RNA-seq Tam on CP059040.1 (Acinetobacter baumannii strain ATCC 19606)
© 2023 XGenes.com Impressum