gene_x 0 like s 166 view s
Tags: processing, pipeline
The following data-cleaning strategies were applied before variant calling:
* Note that the intrahost results does not contain interhost variants.
* Using the two methods (freebayes and spandx for reporting interhost variant), using viral-ngs reporting intrahost variants.
* The interhost results is released in the point 4 and the intrahost results released in the step 13.
* Merge the two results, delete the items from intrahost variants if it occurs in the interhost tables.
* A records in intrahost table in which the frequency in an isolate >= 0.5 while in an other isolate < 0.5 should be in interhost table. If both are >= 0.5, or < 0.5 should be not in the interhost table.
* We can roughly check if the correctness of the intrahost variant results with the table from point 18 generated by "~/Scripts/check_sequence_differences.py aligned_1_.aln".
* At the end, we should have a interhost variant calling table + a intrahost varicant calling table in which the frequency varies between 0.05 and 0.5.
* Another control method: merge the two tables and sort according to the coordinate. Then compare the coordinate with results ~/Scripts/check_sequence_differences.py aligned_1_.aln. They should be similar.
* If a record occurs in the table from point 18, not in intrahost+interhost table, meaning the base is wrongly called during the assembly.
* The correction of the assembly in the step data/02_assembly and data/03_multialign_to_ref/aligned_1.fasta is actually not very critical, since if a wrongly called base, the intrahost will be a record >= 0.5 frequency.
* The error earlier: only report the intrahost variant calling, few interhost variant calling. The interhost variant calling will be found if they the bases in the assembly is wrongly called.
* !!!! IMPORTANT: Delete the last position of the alleles if there are three in the alleles in the intrahost Excel-table before releasing !!!!
#Report the interhost+intrahost results to Nicole.
Read Counts for BAM Files:
File Read Count
HSV1_S1.raw.bam 1,816,139 × 2
HSV1_S1.bmtagger_depleted.bam 1,750,387 × 2
HSV1_S1.rmdup.bam 1,278,873 × 2
HSV1_S1.cleaned.bam 664,544 × 2
HSV1_S1.taxfilt.bam 22,841 × 2
HSV1_S1.mapped.bam 131 × 2
HSV-Klinik_S2.raw.bam 2,709,058 × 2
HSV-Klinik_S2.bmtagger_depleted.bam 1,582,923 × 2
HSV-Klinik_S2.rmdup.bam 595,066 × 2
HSV-Klinik_S2.cleaned.bam 442,841 × 2
HSV-Klinik_S2.taxfilt.bam 400,301 × 2
HSV-Klinik_S2.mapped.bam 80,915 × 2
bin/taxon_filter.py deplete \
inBam=data/00_raw/HSV-Klinik_S2.bam \
revertBam=tmp/01_cleaned/HSV-Klinik_S2.raw.bam \
bmtaggerBam=tmp/01_cleaned/HSV-Klinik_S2.bmtagger_depleted.bam \
rmdupBam=tmp/01_cleaned/HSV-Klinik_S2.rmdup.bam \
blastnBam=data/01_cleaned/HSV-Klinik_S2.cleaned.bam \
bmtaggerDbs=['/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3', \
'/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA', \
'/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19'] \
blastDbs=['/home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters', \
'/home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus'] \
srprism_memory=14250 \
chunkSize=1000000 \
clear_tags=False \
tags_to_clear=['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP'] \
JVMmemory=50g \
threads=120 \
loglevel=INFO \
tmp_dir=/tmp \
tmp_dirKeep=False
inBam Input BAM file.
revertBam Output BAM: read markup reverted with Picard.
bwaBam Output BAM: depleted of reads with BWA.
bmtaggerBam Output BAM: depleted of reads with BMTagger.
rmdupBam Output BAM: bmtaggerBam run through M-Vicuna duplicate removal.
blastnBam Output BAM: rmdupBam run through another depletion of reads with BLASTN.
last876
Using bengal3_ac3 pipeline to get trimmed reads and snippy interhost variants (for virus, it does not work!)
#using the env bengal3_ac3
conda activate bengal3_ac3
mkdir interhost_variants; cd interhost_variants
#prepare scritps
cp /home/jhuang/Tools/bacto/bacto-0.1.json .
cp /home/jhuang/Tools/bacto/cluster.json .
cp /home/jhuang/Tools/bacto/Snakefile .
ln -s /home/jhuang/Tools/bacto/local .
ln -s /home/jhuang/Tools/bacto/db .
ln -s /home/jhuang/Tools/bacto/envs .
#preparing raw_data
mkdir raw_data; cd raw_data
ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R1_001.fastq.gz HSV1_S1_R1.fastq.gz
ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R2_001.fastq.gz HSV1_S1_R2.fastq.gz
ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R1_001.fastq.gz HSV-Klinik_S2_R1.fastq.gz
ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R2_001.fastq.gz HSV-Klinik_S2_R2.fastq.gz
#ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R1_001.fastq.gz NTC_S3_R1.fastq.gz
#ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R2_001.fastq.gz NTC_S3_R2.fastq.gz
#preparing bacto-0.1.json.
"fastqc": false,
"taxonomic_classifier": false,
"assembly": false,
"typing_ariba": false,
"typing_mlst": false,
"pangenome": false,
"variants_calling": true,
"phylogeny_fasttree": true,
"phylogeny_raxml": true,
"recombination": true,
"genus": "Herpesvirus",
"kingdom": "Viruses",
"species": "human herpesvirus 1",
"species": "herpes"
"reference": "db/OP297860.gb",
(bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
# --DEBUG_1 (Don't need to run the step by changing the configuration!)--
prokka --force --outdir prokka/HSV-Klinik_S2 --cpus 2 --usegenus --genus Herpesvirus --kingdom Viruses --species human herpesvirus 1 --addgenes --addmrna --prefix HSV-Klinik_S2 --locustag HSV-Klinik_S2 shovill/HSV-Klinik_S2/contigs.fa -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
# - using bakta instead due to the error during the prokka-running (bakta doesn't work due to too huge fasta-file)
bakta --db /mnt/nvme0n1p1/bakta_db shovill/HSV-Klinik_S2/contigs.fa --prefix HSV-Klinik_S2 --output prokka/HSV-Klinik_S2 --force
# ---- running directly freebayes as follows ----
cd data
mkdir 02_align_to_OP297860
../bin/read_utils.py align_and_fix 01_per_sample/HSV1_S1.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV1_S1.bam --outBamFiltered 02_align_to_OP297860/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
../bin/read_utils.py align_and_fix 01_per_sample/HSV-Klinik_S2.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV-Klinik_S2.bam --outBamFiltered 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
b
samtools sort 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam -o HSV-Klinik_S2_reads_aligned_sorted.bam
samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
freebayes -f ../ref_genome/reference.fasta -i HSV-Klinik_S2_reads_aligned_sorted.bam --min-coverage 10 --min-alternate-count 3 --vcf freebayes_interhost_out.vcf
#CHROM POS ID REF ALT QUAL
OP297860.1 8885 . A G 7.37349e-13
OP297860.1 8895 . A G 6.00837e-05
OP297860.1 8956 . A G 339.579
OP297860.1 8991 . ATTGT CCTGC 3188.1
OP297860.1 9616 . C A 4.44801e-14
OP297860.1 12748 . C A 63475.5
#HSV-Klinik_S2-1 13203 A C 0.8466 snp 1.34479 C:1581:1060:1581:1060:1 A:21:15:21:15:1
* OP297860.1 13203 . T C 86820.7
OP297860.1 13755 . G A 107298
OP297860.1 14114 . C A 1.21987e-13
OP297860.1 46861 . T C 710.176
* OP297860.1 47109 . T G 9375.53
OP297860.1 47170 . G T 5942.86
OP297860.1 47182 . G A 6108.66
OP297860.1 47320 . A G 10275.4
OP297860.1 47377 . G T 972.379
OP297860.1 47516 . T C 257.388
OP297860.1 47563 . G A 372.177
OP297860.1 47660 . G A 438.692
OP297860.1 47707 . T C 3252.11
OP297860.1 47722 . A G 5343.39
OP297860.1 48064 . G A 21575.7
OP297860.1 48113 . C T 4284.1
OP297860.1 48129 . T C 1778.66
OP297860.1 48167 . T C 3316.44
OP297860.1 48219 . A C 6892.21
OP297860.1 48398 . C A 5.72805e-16
OP297860.1 53216 . G T 2031
OP297860.1 53298 . A G 465.154
OP297860.1 53423 . C T 5586.37
OP297860.1 54025 . A G 385.75
OP297860.1 54073 . G A 8463.94
OP297860.1 54408 . T G 2923.39
OP297860.1 54568 . G T 1391.08
OP297860.1 54708 . TG GA,TA 840.319
OP297860.1 54769 . G T 1.72979e-14
* OP297860.1 55501 . T C 33158.1
* OP297860.1 55807 . C A 0
OP297860.1 56493 . A G 39336.9
OP297860.1 56867 . C A 7.83521e-14
OP297860.1 57513 . C A 0
OP297860.1 58047 . A T 4.21917e-15
OP297860.1 58054 . C A 0
OP297860.1 58056 . ACCA TCCT 0
OP297860.1 58075 . ACTC GCTT 2947.03
OP297860.1 63377 . C A 0
OP297860.1 63393 . G T 1.39225e-14
OP297860.1 65179 . T C 7903.32
* OP297860.1 65225 . G A 13223.5
* OP297860.1 65402 . C A 1.53811e-13
OP297860.1 65992 . T C 25982.5
OP297860.1 66677 . G A 5.27367e-15
OP297860.1 67131 . C A 225.935
OP297860.1 67336 . G A 8.13698e-15
OP297860.1 94706 . C A 0
OP297860.1 94709 . G T 0
* OP297860.1 94750 . G T 0
OP297860.1 95750 . C A 2.89975e-08
OP297860.1 95990 . C A 0
OP297860.1 96070 . G T 0
OP297860.1 137360 . G T 0
OP297860.1 137373 . C A 0
OP297860.1 137527 . A T 4880.59
OP297860.1 137569 . C T 10142.1
OP297860.1 137602 . C A 19065.3
OP297860.1 137986 . A G 0
OP297860.1 138170 . T C 53588.3
OP297860.1 138343 . C T 7310.38
spandx varicant calling (see http://xgenes.com/article/article-content/314/call-and-merge-snp-and-indel-results-from-using-two-variant-calling-methods/)
mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860
#cp OP297860.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860/genes.gbk
vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
/home/jhuang/miniconda3/envs/spandx/bin/snpEff build OP297860 -d
#Protein check: OP297860 OK: 73 Not found: 0 Errors: 0 Error percentage: 0.0%
## -- try using gffs+fa to install the database, but failed --
#cp OP297860.gff3 ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860/genes.gff
#cp OP297860.fasta ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860//sequences.fa
#vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
##OP297860.genome : Herpes_Simplex_Virus_1
#snpEff build -v OP297860 #-gff3
#cd /path/to/snpEff/data
#mkdir NC_001806
#cp NC_001806.gff3 NC_001806/genes.gff
#cp NC_001806.fa NC_001806/sequences.fa
##NC_001806.genome : Herpes_Simplex_Virus_1
##bcftools reheader -h new_header.vcf HSV-Klinik_S2.PASS.snps.vcf -o updated_vcf.vcf
##table_annovar <input.vcf> <humandb> -buildver <genome_version> -out <output_prefix> -protocol <protocol_list> -operation <operation_list>
cd trimmed
mv HSV1_S1_trimmed_P_1.fastq HSV1_S1_R1.fastq
mv HSV1_S1_trimmed_P_2.fastq HSV1_S1_R2.fastq
mv HSV-Klinik_S2_trimmed_P_1.fastq HSV-Klinik_S2_R1.fastq
mv HSV-Klinik_S2_trimmed_P_2.fastq HSV-Klinik_S2_R2.fastq
gzip *_R1.fastq *_R2.fastq
cp ref_genome/reference.fasta OP297860.fasta
#Clean the header to only retain the accession-id "OP297860.1"
ln -s /home/jhuang/Tools/spandx/ spandx
(spandx) nextflow run spandx/main.nf --fastq "trimmed/*_R{1,2}.fastq.gz" --ref OP297860.fasta --annotation --database OP297860 -resume
# -- DEBUG: All_SNPs_indels_annotated.txt is not correctly annotated, manually rerun snpeff-4.1l-8 and related steps --
## OPTION_1: copy the viral-ngs4 database to the spandx database, failed during the version difference --
#cp /home/jhuang/miniconda3/envs/viral-ngs4/share/snpeff-4.1l-8/data/1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c/genes.gbk .
##/home/jhuang/miniconda3/envs/spandx/bin/snpEff build OP297860 -d
# OPTION_2: run via interhost.py (SUCCESSFUL!)
#repeat the processing in spandx/bin/SNP_matrix.sh to generate All_SNPs_indels_annotated.txt, the snpEff step using we with 'bin/interhost.py snpEff' in the env viral-ngs4
cd work/f8/93141f3ef382d7ac9dd40def9c50ce (last directory sorted by timestamp)
#gatk VariantsToTable -V out.vcf -F CHROM -F POS -F REF -F ALT -F TYPE -GF GT -O out.vcf.table.all
#
##clean-up the out.vcf.table.all because GATK outputs A/A
#sed -i 's#|#/#g' out.vcf.table.all
#awk ' { for (i=6; i<=NF; i++) {
# if ($i == "A/A") $i="A";
# if ($i == "G/G") $i="G";
# if ($i == "C/C") $i="C";
# if ($i == "T/T") $i="T";
# if ($i == "*/*") $i="*";
# if ($i == "./.") $i=".";
# }};
# {print $0} ' out.vcf.table.all > out.vcf.table.all.tmp
#awk ' { for (i=6; i<=NF; i++) {
# if ($i ~ /\//) {
# split($i, a, "/");
# if (a[1] == a[2]) $i=a[1];
# }
# }
# };
# {print $0} ' out.vcf.table.all.tmp > out.vcf.table.all
# Switch the env to viral-ngs4 and manully run snpEff
#(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/work/ea/6f30cd5eed0efbbf3e3fe1ddfac0df$ snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v 1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c out.vcf > out.annotated.vcf ##-debug
# Number of chromosomes : 1
# Chromosomes : Format 'chromo_name size codon_table'
# 'OP297860' 152526 Standard
#vim /home/jhuang/miniconda3/envs/viral-ngs4/share/snpeff-4.1l-8/snpEff.config
#1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c.chromosomes : OP297860
# Alternative snpEff calling with "bin/interhost.py snpEff"
#(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/work/ea/6f30cd5eed0efbbf3e3fe1ddfac0df$ ../../../bin/interhost.py snpEff out.filtered.vcf OP297860.1 out.annotated.vcf j.huang@uke.de --loglevel DEBUG
#remove headers from annotated vcf and out.vcf
grep -v '#' out.annotated.vcf > out.annotated.vcf.headerless
#grep -v '#' out.vcf > out.vcf.headerless
awk '{
if (match($0,"EFF=")){print substr($0,RSTART)}
else
print ""
}' out.annotated.vcf.headerless > effects
sed -i 's/EFF=//' effects
sed -i 's/(/ /g' effects
sed -i 's/|/ /g' effects
sed -i 's/UPSTREAM MODIFIER /UPSTREAM MODIFIER - /g' effects
cut -d " " -f -8 effects > effects.mrg
sed -i 's/ /\t/g' effects.mrg
rm effects
tail -n+2 out.vcf.table.all > out.vcf.table.all.headerless
sed -i 's/ /\t/g' out.vcf.table.all.headerless
paste out.vcf.table.all.headerless effects.mrg > out.vcf.headerless.plus.effects
head -n1 out.vcf.table.all | sed 's/.GT//g' > header.left
echo -e "Effect\tImpact\tFunctional_Class\tCodon_change\tProtein_and_nucleotide_change\tAmino_Acid_Length\tGene_name\tBiotype" > header.right
paste header.left header.right > header
cat header out.vcf.headerless.plus.effects > All_SNPs_indels_annotated.txt
echo "SPANDx has finished"
cp All_SNPs_indels_annotated.txt ../../../Outputs/Phylogeny_and_annotation/
merge the two variant calling
#Output: interhost_variants/snippy/summary_snps_indels.csv
python3 ~/Scripts/summarize_snippy_res.py interhost_variants/snippy #Note that although the ALT bases are wrong, but we only need the positions. We can use the results for downstream processing!
#Sort summary_snps_indels.csv according to the coordinate positions.
#merge the following two files summary_snps_indels.csv (70) and All_SNPs_indels_annotated.txt (819) --> merged_variants.csv (69)
python3 ~/Scripts/merge_snps_indels.py interhost_variants/snippy/summary_snps_indels.csv Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt merged_variants.csv
#check if the number of the output file is correct?
comm -12 <(cut -d, -f2 interhost_variants/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq) | wc -l
comm -12 <(cut -d, -f2 interhost_variants/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq)
#The only difference is 58615
#Manually check the final results and delete some strange results and save merged_variants.csv as variants.xlsx
#sort interhost_index -u > interhost_index_sorted
#sort intrahost_index -u > intrahost_index_sorted
#comm interhost_index_sorted intrahost_index_sorted
# !!!! Manually checking intrahost records, if one record in a sample-group > 0.5, it should be a record interhost, look for if the records in the spandx-result. If the record is there, copy it to the interhost variant sheet!
The records in all records of intrahost variants should be always < 0.5, if a record is > 0.5, if should be in interhost variants. Delete all records from intrahost variants when a record > 0.5 and it is not occuring in All_SNPs_indels_annotated.txt !!!! Ausnahme ist the record such as 65225:
#OP297860 65225 G A SNP G G/A intragenic_variant MODIFIER n.65225G>A UL30
#OP297860 65225 HSV1_S1 HSV1_S1 G,A 0 intragenic_variant n.65225G>A UL30 Gene_63070_67475
#OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.891530460624071 intragenic_variant n.65225G>A UL30 Gene_63070_67475
##improve the header
#sed -i '1s/_trimmed_P//g' merged_variants.csv
##check the REF and K1 have the same base and delete those records with difference.
#cut -f3 -d',' merged_variants.csv > f3
#cut -f6 -d',' merged_variants.csv > f6
#diff f3 f6
#awk -F, '$3 == $6 || NR==1' merged_variants.csv > filtered_merged_variants.csv #(93)
#cut -f3 -d',' filtered_merged_variants.csv > f3
#cut -f6 -d',' filtered_merged_variants.csv > f6
#diff f3 f6
##MANUALLY REMOVE the column f6 in filtered_merged_variants.csv, and rename CHROM to HDRNA_01_K01 in the header, summarize chr and plasmids SNPs of a sample together to a single list, save as an Excel-file.
(Optional, the step is currently only for intrahost variant calling) Filtering low complexity
fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
Read1 before filtering:
total reads: 1755209
total bases: 163663141
Q20 bases: 162306612(99.1711%)
Q30 bases: 159234526(97.2941%)
Read2 before filtering:
total reads: 1755209
total bases: 163045950
Q20 bases: 161178082(98.8544%)
Q30 bases: 157052184(96.3239%)
Read1 after filtering:
total reads: 1733241
total bases: 161547828
Q20 bases: 160217907(99.1768%)
Q30 bases: 157196236(97.3063%)
Read2 aftering filtering:
total reads: 1733241
total bases: 160825521
Q20 bases: 159057902(98.9009%)
Q30 bases: 155354052(96.5979%)
Filtering result:
reads passed filter: 3466482
reads failed due to low quality: 550
reads failed due to too many N: 0
reads failed due to too short: 0
reads failed due to low complexity: 43386
reads with adapter trimmed: 21424
bases trimmed due to adapters: 159261
Duplication rate: 14.2379%
Insert size peak (evaluated by paired-end reads): 41
JSON report: fastp.json
HTML report: fastp.html
fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
fastp v0.20.1, time used: 7 seconds
Read1 before filtering:
total reads: 2688264
total bases: 330035144
Q20 bases: 326999269(99.0801%)
Q30 bases: 320136918(97.0009%)
Read2 before filtering:
total reads: 2688264
total bases: 327364405
Q20 bases: 323331005(98.7679%)
Q30 bases: 314500076(96.0703%)
Read1 after filtering:
total reads: 2660598
total bases: 326564634
Q20 bases: 323572956(99.0839%)
Q30 bases: 316783667(97.0049%)
Read2 aftering filtering:
total reads: 2660598
total bases: 324709841
Q20 bases: 320840657(98.8084%)
Q30 bases: 312570288(96.2614%)
Filtering result:
reads passed filter: 5321196
reads failed due to low quality: 1110
reads failed due to too many N: 0
reads failed due to too short: 0
reads failed due to low complexity: 54222
reads with adapter trimmed: 39080
bases trimmed due to adapters: 357915
Duplication rate: 9.91821%
Insert size peak (evaluated by paired-end reads): 96
JSON report: fastp.json
HTML report: fastp.html
fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
fastp v0.20.1, time used: 15 seconds
Using vrap to assembly and annotate the contigs, the spades-step was replaced with idba of DAMIAN; DAMIAN's host-removal steps can also as the confirmation steps for viral-ngs.
# Starting data: ln -s interhost_variants/trimmed .
ln -s ~/Tools/vrap/ .
#CHANGE the txid10298 in download_db.py: txid10298[Organism] AND complete genome[Title]
gzip trimmed/*_R1.fastq trimmed/*_R2.fastq
mv trimmed/*.gz ./
#--host /home/jhuang/REFs/genome.fa --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr
vrap/vrap.py -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz -o HSV1_S1_vrap_out_v3 --bt2idx=/home/jhuang/REFs/genome -t 100 -l 200 -g
vrap/vrap.py -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz -o HSV-Klinik_S2_vrap_out_v3 --bt2idx=/home/jhuang/REFs/genome -t 100 -l 200 -g
#--> If ERROR in spades-assembly, we usding idba from DAMIAN assembly, copy the assembly to spades. Then rerun vrap.py above!
# * 4 nt_dbs (--virus, --host, download_db.py(nucleotide), nt), 2 prot_db (download_db.py(protein), nr) for blast, save under ./blast/db/virus, ./blast/db/host, vrap/database/viral_db/viral_nucleotide, vrap/database/viral_db/viral_protein
# * 1 bowtie_database for host removal (--host), save under ./bowtie/host.
# * bowtie run before assembly
# * blast run after assembly for the contigs, therefore it does not exist the taxfilt step in vrap.
# * checking the order of the databases for annotation step, namely which database will be taken firstly for annotionn after setting --virus?
# * If --host is for both bowtie and blastn, if only --bt2idx define, only bowtie, no blastn! --> commented --host=/home/jhuang/REFs/genome.fa still has the host-removal step!
# * "--virus=vrap/database/viral_db/nucleotide.fa" don't need give, since it is already defined in ./blast/db/virus
# * the process: lighter (fast, memory-efficient tool for correcting sequencing errors) --> flash (tool to find the correct overlap between paired-end reads and extend the reads by stitching them together) --> bowtie (delete the host reads) --> spades --> cap3 (CAP3: A DNA sequence assembly program and it has a capability to clip 5' and 3' low-quality regions of reads) --> calculating orf density --> hmmer --> blast
# Download all virus genomes
mv datasets /usr/local/bin/
chmod +x /usr/local/bin/datasets
#datasets download virus genome --complete-only --assembly-source refseq
datasets download virus genome taxon "Viruses" --complete-only --refseq
#To check for RefSeq data only, look for NC_, NM_, or similar prefixes in sequence headers and identifiers.
wget -r -np -nH --cut-dirs=3 ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/
# The commends for more comprehensive blast annotation
vrap/vrap.py -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz -o HSV1_S1_vrap_out_v4 --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/vrap/database/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr -t 100 -l 200 -g
vrap/vrap.py -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz -o HSV-Klinik_S2_vrap_out_v4 --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/vrap/database/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr -t 100 -l 200 -g
#END
#using the bowtie of vrap to map the reads on ref_genome/reference.fasta
vrap/vrap.py -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz -o HSV1_S1_vrap_out_v5 --host ref_genome/reference.fasta -t 100 -l 200 -g
vrap/vrap.py -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz -o HSV-Klinik_S2_vrap_out_v5 --host ref_genome/reference.fasta -t 100 -l 200 -g
cd bowtie
mv mapped mapped.sam
samtools view -S -b mapped.sam > mapped.bam
samtools sort mapped.bam -o mapped_sorted.bam
samtools index mapped_sorted.bam
samtools view -H mapped_sorted.bam
samtools flagstat mapped_sorted.bam
#106435 + 0 mapped (3.11% : N/A)
#106435 + 0 primary mapped (3.11% : N/A)
#8204 + 0 properly paired (0.26% : N/A)
#63742 + 0 with itself and mate mapped
8204+63742
#1144448 + 0 mapped (26.25% : N/A)
#1144448 + 0 primary mapped (26.25% : N/A)
#124068 + 0 properly paired (3.76% : N/A)
#581256 + 0 with itself and mate mapped
124068+581256
bamCoverage -b mapped_sorted.bam -o ../../HSV1_S1_reads_coverage2.bw
bamCoverage -b mapped_sorted.bam -o ../../HSV-Klinik_S2_reads_coverage2.bw
#Command line spades:
/home/jhuang/miniconda3/envs/vrap/bin/spades.py -1 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.1.fastq -2 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.2.fastq --s1 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.fastq -k 33,55,77,99,127 --cov-cutoff off --only-assembler --careful -t 100 -o /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/spades
#Command line cap3:
/home/jhuang/Tools/vrap/external_tools/cap3 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/spades/contigs.fasta -y 100
damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R2.fastq.gz --sample HSV1_S1_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R2.fastq.gz --sample HSV-Klinik_S2_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
[16:42:55 2024-11-12] Removing adapter and host sequences
Trimming readpair 1: /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV1_S1_R1.fastq.gz and /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV1_S1_R2.fastq.gz
Host reads: 11.71%
Fragment size: 212 (sd:64)
Subtracting host: human3 (Homo_sapiens_UCSC_hg38 (dna))
Alignment rate: 0.52%
Subtracting host: human3 (Homo sapiens (cdna))
Alignment rate: 0.02%
Subtracting host: human3 (Homo sapiens (ncrna))
Alignment rate: 0.01%
[17:20:31 2024-11-12] Removing adapter and host sequences
Trimming readpair 1: /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV-Klinik_S2_R1.fastq.gz and /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV-Klinik_S2_R2.fastq.gz
Host reads: 44.47%
Fragment size: 236 (sd:77)
Subtracting host: human3 (Homo_sapiens_UCSC_hg38 (dna))
Alignment rate: 29.34%
Subtracting host: human3 (Homo sapiens (cdna))
Alignment rate: 0.66%
Subtracting host: human3 (Homo sapiens (ncrna))
Alignment rate: 0.64%
[17:25:27 2024-11-12] Assembling
[17:38:39 2024-11-12] Parsing assembly
Large contigs (500bp and longer): 259
Large orfs (75bp and longer): 843
[17:38:58 2024-11-12] Seeking protein domains
Contigs with domains: 162
[17:40:36 2024-11-12] Annotating contigs
cp ~/rtpd_files/HSV1_S1_megablast/idba_ud_assembly/contig.fa contigs.fasta
cp ~/rtpd_files/HSV-Klinik_S2_megablast/idba_ud_assembly/contig.fa contigs.fasta
#RERUN vrap/vrap.py again with the replaced contigs.fasta!
#vrap/vrap.py -1 Affe30_trimmed_R1.fastq.gz -2 Affe30_trimmed_R2.fastq.gz -o Affe30_trimmed_vrap_out -t 40 -l 100
#vrap/vrap.py -1 Affe31_trimmed_R1.fastq.gz -2 Affe31_trimmed_R2.fastq.gz -o Affe31_trimmed_vrap_out -t 40 -l 100
# -- DEBUG_1 --
#DO NOT use '-l 100' in command line
#name 'generic_dna' is not defined
mamba install biopython=1.77 python=3.9 #for supporting "generic_dna"
# SET all records from vrap/database/viral_db/nucleotide.fa as lastal.acids, choose the most occurred in vrap_out as refsel.acids and the record for accessions_for_ref_genome_build in config.yaml.
# Query coverage
Query sequence name Query length ORF density Percentage identity Subject sequence length Subject accession Subject name E-value
grep "Human alphaherpesvirus 1" HSV-Klinik_S2_contigs_summary.csv > HSV-Klinik_S2_contigs_summary_.csv
#--> ON960057.1
name: The name or identifier of the query sequence. This is typically the header from the input sequence file, such as a FASTA file.
qleng: Query length, or the total length of the input sequence (in nucleotides or amino acids, depending on the input type).
orf_d: ORF (Open Reading Frame) direction. This indicates the strand or frame in which the ORF was found, often shown as + for the forward direction or - for the reverse direction.
hmmer_eval: The E-value from the HMMER (Hidden Markov Model) search. This represents the statistical significance of the match between the identified ORF and the reference HMM model. Lower values indicate more significant matches.
hmm_model: The name of the HMM (Hidden Markov Model) profile matched by the ORF. This typically corresponds to a specific viral or protein family model from an HMM database, such as Pfam or custom models used by VRAP.
ident: Percentage identity between the query sequence and the target model or database entry. This measures the similarity of the ORF to the matched model.
qcov: Query coverage, or the percentage of the query sequence that aligns to the target model. This indicates how much of the ORF sequence aligns with the HMM profile.
tcov: Target coverage, or the percentage of the target HMM profile that aligns with the query. This helps assess how well the ORF represents the entire HMM model.
tlength: Target length, or the length of the HMM model sequence in the database. This value can be used to understand how much of the target model was covered by the ORF alignment.
tid: Target identifier, often an accession or ID number for the matched HMM model. This is used to uniquely identify the model within the HMM database.
tname: Target name or description, which provides more information about the HMM model or protein family that the ORF matches.
mean_eval: Mean E-value for the HMMER match, averaged over multiple potential alignments (if any). Lower values imply higher significance, with the mean providing an aggregate metric if there were multiple HMM matches.
#reads_and_contigs_on_JX878414.png
#using the assembly for the calling!
#TODO_TOMORROW: In the final results only mark the SNPs in the contigs > 500 nt (shown as in the figure), otherwise we have too much results! then merge snps (now there is an ERROR during merging!)
Analyses using viral-ngs
conda activate viral3
#conda install -c anaconda openjdk=8
ln -s ~/Tools/viral-ngs/Snakefile Snakefile
ln -s ~/Tools/viral-ngs/bin bin
cp ~/Tools/viral-ngs/refsel.acids refsel.acids
cp ~/Tools/viral-ngs/lastal.acids lastal.acids
cp ~/Tools/viral-ngs/config.yaml config.yaml
cp ~/Tools/viral-ngs/samples-runs.txt samples-runs.txt
cp ~/Tools/viral-ngs/samples-depletion.txt samples-depletion.txt
cp ~/Tools/viral-ngs/samples-metagenomics.txt samples-metagenomics.txt
cp ~/Tools/viral-ngs/samples-assembly.txt samples-assembly.txt
cp ~/Tools/viral-ngs/samples-assembly-failures.txt samples-assembly-failures.txt
mkdir data
cd data
mkdir 00_raw
cd ../..
Prepare lastal.acids, refsel.acids and accessions_for_ref_genome_build in config.yaml
#Herpes simplex virus 1 (HSV-1) and Human alphaherpesvirus 1 (also known as Simplexvirus humanalpha1) are indeed the same virus.
#The different names result from varied naming conventions:
# * Herpes simplex virus 1 (HSV-1) is the common name, often used in clinical and general contexts.
# * Human alphaherpesvirus 1 is the official taxonomic name, as defined by the International Committee on Taxonomy of Viruses (ICTV). This name is used in scientific classifications and databases like NCBI to specify its place in the Herpesviridae family under the Alphaherpesvirinae subfamily.
#In some databases or references, it might also appear under Simplexvirus humanalpha1, which refers to its taxonomic classification at the genus level (Simplexvirus) and species level (Human alphaherpesvirus 1). However, all these terms refer to the same virus, commonly known as HSV-1.
#https://www.uniprot.org/taxonomy?query=Human+herpesvirus
#https://www.uniprot.org/taxonomy/3050292
esearch -db nuccore -query "txid3050292[Organism]" | efetch -format fasta > taxon_3050292_sequences.fasta
esearch -db nuccore -query "txid3050292[Organism]" | efetch -format acc > taxon_3050292_accessions.txt
esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format fasta > taxon_3050292_complete_genomes.fasta
esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format acc > taxon_10298_complete_genomes.acc # 161 genomes
mv taxon_10298_complete_genomes.acc lastal.acids
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10298
Human alphaherpesvirus 1 (Herpes simplex virus type 1) Click on organism name to get more information.
Human alphaherpesvirus 1 strain 17
Human alphaherpesvirus 1 strain A44
Human alphaherpesvirus 1 strain Angelotti
Human alphaherpesvirus 1 strain CL101
Human alphaherpesvirus 1 strain CVG-2
Human alphaherpesvirus 1 strain F
Human alphaherpesvirus 1 strain H129
Human alphaherpesvirus 1 strain HFEM
Human alphaherpesvirus 1 strain HZT
Human alphaherpesvirus 1 strain KOS
Human alphaherpesvirus 1 strain MGH-10
Human alphaherpesvirus 1 strain MP
Human alphaherpesvirus 1 strain Patton
Human alphaherpesvirus 1 strain R-15
Human alphaherpesvirus 1 strain R19
Human alphaherpesvirus 1 strain RH2
Human alphaherpesvirus 1 strain SC16
Trimming using trimmomatic
# Starting data: ln -s interhost_variants/raw_data .
mkdir bams
for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
for sample in HSV1_S1; do
java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ./raw_data/${sample}_R1.fastq.gz ./raw_data/${sample}_R2.fastq.gz trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; \
done
Mapping
cd trimmed
seqtk sample -s100 HSV1_S1_R1.fastq.gz 0.1 > HSV1_S1_sampled_R1.fastq
seqtk sample -s100 HSV1_S1_R2.fastq.gz 0.1 > HSV1_S1_sampled_R2.fastq
gzip HSV1_S1_sampled_R1.fastq HSV1_S1_sampled_R2.fastq
ref_fa="NC_001806.fasta";
for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
for sample in HSV1_S1; do
for sample in HSV1_S1_sampled; do
bwa index ${ref_fa}; \
bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_R1.fastq.gz trimmed/${sample}_R2.fastq.gz | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
#for table filling using the following commands! -->3000000 \
#bwa mem -M -t 14 ${ref_fa} ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz | samtools view -bS -F 256 - > bams/${sample}_uniqmap.bam; \
done
AddOrReplaceReadGroup is IMPORTANT step, otherwise the step viral_ngs cannot run correctly
for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
for sample in HSV1_S1; do
for sample in HSV1_S1_sampled; do
picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
done
Configure the viral-ngs conda environment
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels broad-viral
# -- Works not correctly --
#conda list --export > environment2.yml
#mamba create --name viral-ngs4 --file environment2.yml
mamba env remove -n viral-ngs4
mamba create -n viral-ngs4 python=3.6 blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
conda activate viral-ngs4
mamba create -n viral-ngs4 python=3.6
conda activate viral-ngs4
#vim requirements-conda.txt
mamba install blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
# -- Eventually DEBUG --
#mamba remove picard
#mamba clean --packages
#mamba install -c bioconda picard
##mamba install libgfortran=5 sqlite=3.46.0
##mamba install picard --clobber
##mamba create -n viral-ngs-fresh -c bioconda -c conda-forge picard python=3.6 sqlite=3.46.0 libgfortran=5
mamba install cd-hit cd-hit-auxtools diamond gap2seq=2.1 mafft=7.221 mummer4 muscle=3.8 parallel pigz prinseq samtools=1.6 tbl2asn trimmomatic trinity unzip vphaser2 bedtools -c r -c defaults -c conda-forge -c bioconda #-c broad-viral
mamba install snpeff=4.1l
mamba install gatk=3.6
mamba install bwa
#IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
#IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
#IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
#IMPORTANT_CHECK if it works
# java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
# /usr/local/bin/gatk -T RealignerTargetCreator --help
#IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
mamba install vphaser2=2.0
# -- NO ERROR --> INSTALL END HERE --
# -- DEBUG: ClobberError: This transaction has incompatible packages due to a shared path. --
# SafetyError: The package for snpeff located at /home/jhuang/miniconda3/pkgs/snpeff-4.1l-hdfd78af_8
# appears to be corrupted. The path 'share/snpeff-4.1l-8/snpEff.config'
# has an incorrect size.
# reported size: 9460047 bytes
# actual size: 9460357 bytes
#
# ClobberError: This transaction has incompatible packages due to a shared path.
# packages: bioconda/linux-64::bowtie2-2.5.4-h7071971_4, bioconda/linux-64::bowtie-1.3.1-py36h769816f_3
# path: 'bin/scripts/convert_quals.pl'
# sovle confilict between bowtie, bowtie2 and snpeff
mamba remove bowtie
mamba install bowtie2
mamba remove snpeff
mamba install snpeff=4.1l
# -- WITH ERROR caused by bowtie and snpeff --> INSTALL END HERE --
#mamba install -c bioconda viral-ngs #so that gatk3-register and novoalign-license-register available --> ERROR
#Due to license restrictions, the viral-ngs conda package cannot distribute and install GATK directly. To fully install GATK, you must download a licensed copy of GATK v3.8 from the Broad Institute, and call “gatk3-register,” which will copy GATK into your viral-ngs conda environment:
mkdir -p /path/to/gatk_dir
wget -O - 'https://software.broadinstitute.org/gatk/download/auth?package=GATK-archive&version=3.6-0-g89b7209' | tar -xjvC /path/to/gatk_dir
gatk3-register /path/to/gatk_dir/GenomeAnalysisTK.jar
#The single-threaded version of Novoalign is installed by default. If you have a license for Novoalign to enable multi-threaded operation, viral-ngs will copy it to the viral-ngs conda environment if the NOVOALIGN_LICENSE_PATH environment variable is set. Alternatively, the conda version of Novoalign can be overridden if the NOVOALIGN_PATH environment variable is set. If you obtain a Novoalign license after viral-ngs has already been installed, it can be added to the conda environment by calling:
# obtain a Novoalign license file: novoalign.lic
novoalign-license-register /path/to/novoalign.lic
# # --We don't have registers, so we have to manually install novoalign and gatk--
# #At first install novoalign, then samtools
# mamba remove samtools
# mamba install -c bioconda novoalign # Eventually not necessary, since the path is defined in config.yaml NOVOALIGN_PATH: "/home/jhuang/Tools/novocraft_v3", and novoalign.lic is also in the same path.
# mamba install -c bioconda samtools
#
# mamba install -c bioconda gatk #(3.8) #IN /usr/local/bin/gatk FROM /home/jhuang/Tools/SPANDx_v3.2/GenomeAnalysisTK.jar
# #UPDATED TO: '/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar'
# # If necessary, clean up the conda cache. This will remove any partially installed or corrupted packages.
# conda clean --all
## reinstall samtools 1.6 --> NOT RELEVANT
#mamba install samtools=1.6
Run snakemake
#Set values in samples-*.txt before running viral-ngs
rm -rf ref_genome refsel_db lastal_db
mv data data_v1;
mv tmp tmp_v1;
mkdir data tmp
mv data_v1/00_raw data
snakemake --printshellcmds --cores 10
#Manully remove the records in the intrahost-results when it occurs in the interhost-tables as save the final intrahost-results as a Excel-Sheet in the variants.xlsx.
/usr/local/bin/gatk
https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
java -jar ~/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
/usr/local/bin/gatk -T RealignerTargetCreator --help
https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
Djava.io.tmpdir=/tmp/tmp-assembly-refine_assembly-2d9z3pcr
java -jar ~/Tools/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
~/Tools/GenomeAnalysisTK-4.1.2.0/gatk -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
# -- DEBUG_1: Configure the Conda Environment to Use Host's Java (version 17) while keeping BLAST 2.6.0+ --
bin/taxon_filter.py deplete data/00_raw/HSV1_S1.bam tmp/01_cleaned/HSV1_S1.raw.bam tmp/01_cleaned/HSV1_S1.bmtagger_depleted.bam tmp/01_cleaned/HSV1_S1.rmdup.bam data/01_cleaned/HSV1_S1.cleaned.bam --bmtaggerDbs /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19 --blastDbs /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters --threads 120 --srprismMemory 142500000 --JVMmemory 256g --loglevel DEBUG
#2024-11-06 15:55:01,162 - __init__:444:_attempt_install - DEBUG - Currently installed version of blast: 2.16.0-hc155240_2
#2024-11-06 15:55:01,162 - __init__:448:_attempt_install - DEBUG - Expected version of blast: 2.6.0
#2024-11-06 15:55:01,162 - __init__:449:_attempt_install - DEBUG - Incorrect version of blast installed. Removing it...
# + (blast 2.6.0 needs java 17, therefore java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java" in /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard) blast 2.6.0 boost1.64_2 bioconda Cached
# + (bmtagger 3.101 needs blast 2.6.0) blast=2.6.0 + bmtagger 3.101 h470a237_4 bioconda Cached
# + pango 1.50.7 hbd2fdc8_0 conda-forge Cached
# + openjdk 11.0.15 hc6918da_0 conda-forge Cached
# + r-base 4.2.0 h1ae530e_0 pkgs/r Cached
# + picard 3.0.0 hdfd78af_0 bioconda Cached
# + java -version openjdk version "11.0.15-internal" 2022-04-19
Then, edit in the following file so that it can use the host java (version 17) for the viral-ngs2 picard 3.0.0! --
vim /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard
# ---------------------------------------------------------
# Use Java installed with Anaconda to ensure correct version
java="$ENV_PREFIX/bin/java"
# if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
if [ -n "${JAVA_HOME:=}" ]; then
if [ -e "$JAVA_HOME/bin/java" ]; then
java="$JAVA_HOME/bin/java"
fi
fi
# -------------------------------------------------------->
#COMMENTED
# Use Java installed with Anaconda to ensure correct version
#java="$ENV_PREFIX/bin/java"
#MODIFIED
## if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
#if [ -n "${JAVA_HOME:=}" ]; then
# if [ -e "$JAVA_HOME/bin/java" ]; then
# java="$JAVA_HOME/bin/java"
# fi
#fi
java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java"
# ---------------------------------------------------------
# -- DEBUG_2: lastal version not compatible --
bin/ncbi.py fetch_fastas j.huang@uke.de lastal_db NC_001806.2 --combinedFilePrefix lastal --removeSeparateFiles --forceOverwrite --chunkSize 300
bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
mamba remove last
mamba install -c bioconda last=876
lastal -V
bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
# -- DEBUG_3: lastal version not compatible --
bin/assembly.py gapfill_gap2seq tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffolded.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
#2024-11-07 12:34:14,732 - __init__:460:_attempt_install - DEBUG - Attempting install...
#2024-11-07 12:34:14,733 - __init__:545:install_package - DEBUG - Creating conda environment and installing package gap2seq=2.1
mamba install gap2seq=2.1
# -- DEBUG_4 --
bin/assembly.py impute_from_reference tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffold_ref.fasta tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta --newName HSV1_S1_sampled --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index --loglevel DEBUG
2024-11-07 14:05:20,438 - __init__:445:_attempt_install - DEBUG - Currently installed version of muscle: 5.2-h4ac6f70_0
2024-11-07 14:05:20,438 - __init__:448:_attempt_install - DEBUG - Expected version of muscle: 3.8.1551
2024-11-07 14:05:20,438 - __init__:449:_attempt_install - DEBUG - Incorrect version of muscle installed. Removing it...
mamba install muscle=3.8
#- muscle 5.2 h4ac6f70_0 bioconda Cached
#+ muscle 3.8.1551 h7d875b9_6 bioconda Cached
#/home/jhuang/Tools/novocraft_v3/novoalign -f data/01_per_sample/HSV1_S1.cleaned.bam -r Random -l 20 -g 40 -x 20 -t 100 -F BAM -d tmp/02_assembly/HSV1_S1.assembly4-refined.nix -o SAM
# -- DEBUG_5 --
bin/assembly.py refine_assembly tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV1_S1_sampled.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
#Shebang in /usr/local/bin/gatk is corrupt.
# -- DEBUG_6 --
bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1_sampled.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
2024-11-07 14:47:34,163 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.526-h4bc722e_0
2024-11-07 14:47:34,163 - __init__:448:_attempt_install - DEBUG - Expected version of mafft: 7.221
2024-11-07 14:47:34,164 - __init__:449:_attempt_install - DEBUG - Incorrect version of mafft installed. Removing it...
mamba install mafft=7.221
# -- DEBUG_7 --
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1_sampled.mapped.bam data/02_assembly/HSV1_S1_sampled.fasta data/04_intrahost/vphaser2.HSV1_S1_sampled.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
export TMPDIR=/home/jhuang/tmp
(viral-ngs) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing$ /home/jhuang/miniconda3/envs/viral-ngs/bin/vphaser2 -i /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam -o /home/jhuang/tmp/tmpyg8mlj5qvphaser2
samtools depth /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam > coverage.txt
# -- DEBUG_8 --
snakemake --printshellcmds --cores 100
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
snakemake --printshellcmds --cores 100
# -- DEBUG_9 --
bin/assembly.py refine_assembly tmp/02_assembly/HSV-Klinik_S2.assembly3-modify.fasta data/01_per_sample/HSV-Klinik_S2.cleaned.bam tmp/02_assembly/HSV-Klinik_S2.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV-Klinik_S2.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
/usr/local/bin/gatk -Xmx20g -Djava.io.tmpdir=/home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p -T RealignerTargetCreator -I /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpwbzvjo9j.rmdup.bam -R /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpxq4obe29.deambig.fasta -o /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmptkw8zcf3.intervals --num_threads 120
mamba install gatk=3.6
#IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
#IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
#IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
#IMPORTANT_CHECK if it works
# java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
# /usr/local/bin/gatk -T RealignerTargetCreator --help
#IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
# -- DEBUG_10 (if the sequencing is too shawlow, then seperate running) --
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam -o /tmp/tmp1x6jsiu_vphaser2
[EXIT]: gather_alignments: Failed to set region for reference HSV-Klinik_S2-1 in file /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam
# Run seperate intrahost.py --> no error:
#342 reads
2024-11-08 14:27:33,575 - intrahost:223:compute_library_bias - DEBUG - LB:standard has 161068 reads in 1 read group(s) (HSV-Klinik_S2)
2024-11-08 14:27:34,875 - __init__:445:_attempt_install - DEBUG - Currently installed version of vphaser2: 2.0-h7a259b3_14
samtools index HSV1_S1.mapped.bam
samtools index HSV-Klinik_S2.mapped.bam
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --minReadsEach 1 --maxBias 2 --loglevel DEBUG # --vphaserNumThreads 120 --removeDoublyMappedReads
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
samtools idxstats data/02_align_to_self/HSV-Klinik_S2.mapped.bam
samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
samtools view -H data/02_align_to_self/HSV-Klinik_S2.mapped.bam
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/output_dir
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
samtools view -b data/02_align_to_self/HSV-Klinik_S2.mapped.bam "HSV-Klinik_S2-1" > subset.bam
samtools index subset.bam
@SQ SN:HSV-Klinik_S2-1 LN:141125 AS:tmp35_s3ghx.ref_copy.nix
samtools view -b subset.bam "HSV-Klinik_S2-1:1-10000" > small_subset.bam
samtools index small_subset.bam
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i small_subset.bam -o /tmp/output_dir
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i subset.bam -o vphaser2_out
# -- DEBUG_11 in step multi_align_mafft: aligned_1.fasta is always empty, we need generate it manually with mafft and mark it as complete --
#[Fri Nov 8 14:51:45 2024]
#rule multi_align_mafft:
# input: data/02_assembly/HSV1_S1.fasta, data/02_assembly/HSV-Klinik_S2.fasta, ref_genome/reference.fasta
# output: data/03_multialign_to_ref/sampleNameList.txt, data/03_multialign_to_ref/aligned_1.fasta, data/03_multialign_to_ref/aligned_2.fasta, ... data/03_multialign_to_ref/aligned_161.fasta
# jobid: 24
# resources: tmpdir=/tmp, mem=8, threads=120
bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
#b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
#-------
#2024-11-08 14:51:46,324 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, #Feb 28 2019, 09:07:38)
#[GCC 7.3.0]
#2024-11-08 15:00:26,375 - cmd:195:main_argparse - INFO - command: bin/interhost.py multichr_mafft inFastas=['ref_genome/reference.fasta', 'data/02_assembly/HSV1_S1.fasta', 'data/02_assembly/HSV-Klinik_S2.fasta'] localpair=True globalpair=None preservecase=True reorder=None gapOpeningPenalty=1.53 ep=0.123 verbose=False outputAsClustal=None maxiters=1000 outDirectory=data/03_multialign_to_ref outFilePrefix=aligned sampleRelationFile=None sampleNameListFile=data/03_multialign_to_ref/sampleNameList.txt threads=120 loglevel=DEBUG tmp_dir=/tmp tmp_dirKeep=False
#2024-11-08 15:00:26,375 - cmd:209:main_argparse - DEBUG - using tempDir: /tmp/tmp-interhost-multichr_mafft-sw91_svl
#2024-11-08 15:00:27,718 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.221-0
#2024-11-08 15:00:27,719 - mafft:141:execute - DEBUG - /home/jhuang/miniconda3/envs/viral-ngs4/bin/mafft --thread 120 --localpair --preservecase --op 1.53 --ep 0.123 --quiet --maxiterate 1000 /tmp/tmp-interhost-multichr_mafft-sw91_svl/tmp68_ln_ha.fasta
snakemake --cleanup-metadata 03_multialign_to_ref --cores 4
# -- DEBUG_12 --
#[EXIT]: gather_alignments: Failed to set region for reference HSV-Klinik_S2-1 in file data/02_align_to_self/HSV-Klinik_S2.mapped.bam
#DEBUG_PROCESS1: rm temp/*.region
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -w 5000 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/temp
# 5209 snp
# 21 lv
#SOLUTION: MODFIED AS 'cmd = [self.install_and_get_path(), '-w 5000', '-i', inBam, '-o', outDir]' in bin/tools/vphaser2.py
#ADDED
cmd.append('-w')
cmd.append('25000')
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
#BEFORE_CHANGE:
b'\n--------------------------------------------------------\nProgram runs with the following Parameter setting:\n\n\tinput BAM file\t=\t/tmp/tmpt6fgovqk.mapped-withdoublymappedremoved.bam\n\toutput Directory\t=\t/tmp/tmp53_oxecyvphaser2\n\terrModel\t\t=\tpileup + phase\n\talpha\t\t=\t0.05\n\tignoreBases \t=\t0\n\t(var_matepair, var_cycle, var_dt, var_qt)\t=\t1,1,1,20\n\tpSample\t\t=\t30%\n\twindowSz\t=\t500\n\tdelta\t=\t2\n\n
#AFTER_CHANGE:
windowSz=5000
#mkdir 02_align_to_ref
bin/read_utils.py align_and_fix data/01_per_sample/HSV1_S1.cleaned.bam refsel_db/refsel.fasta --outBamAll data/02_align_to_ref/HSV1_S1.bam --outBamFiltered data/02_align_to_ref/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
bin/read_utils.py align_and_fix data/01_per_sample/HSV-Klinik_S2.cleaned.bam refsel_db/refsel.fasta --outBamAll data/02_align_to_ref/HSV-Klinik_S2.bam --outBamFiltered data/02_align_to_ref/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
bin/intrahost.py vphaser_one_sample data/02_align_to_ref/HSV1_S1.mapped.bam refsel_db/refsel.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_ref.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
bin/intrahost.py vphaser_one_sample data/02_align_to_ref/HSV-Klinik_S2.mapped.bam refsel_db/refsel.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_ref.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
/home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -w 10000 -i data/02_align_to_ref/HSV-Klinik_S2.mapped.bam -o /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/temp
mkdir 02_align_to_NC_001806
bin/read_utils.py align_and_fix data/01_per_sample/HSV1_S1.cleaned.bam refsel_db/NC_001806.2.fasta --outBamAll data/02_align_to_NC_001806/HSV1_S1.bam --outBamFiltered data/02_align_to_NC_001806/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
bin/read_utils.py align_and_fix data/01_per_sample/HSV-Klinik_S2.cleaned.bam refsel_db/NC_001806.2.fasta --outBamAll data/02_align_to_NC_001806/HSV-Klinik_S2.bam --outBamFiltered data/02_align_to_NC_001806/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
bin/intrahost.py vphaser_one_sample data/02_align_to_NC_001806/HSV1_S1.mapped.bam refsel_db/NC_001806.2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_NC_001806.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
bin/intrahost.py vphaser_one_sample data/02_align_to_NC_001806/HSV-Klinik_S2.mapped.bam refsel_db/NC_001806.2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_NC_001806.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
#align to self
#-rw-rw-r-- 1 jhuang jhuang 47M Nov 8 18:24 HSV-Klinik_S2.bam
#-rw-rw-r-- 1 jhuang jhuang 6,3M Nov 8 18:24 HSV-Klinik_S2.mapped.bam
#-rw-rw-r-- 1 jhuang jhuang 74M Nov 8 17:25 HSV1_S1.bam
#-rw-rw-r-- 1 jhuang jhuang 25K Nov 8 17:25 HSV1_S1.mapped.bam
#align to NC_001806
#-rw-rw-r-- 1 jhuang jhuang 48M Nov 11 13:26 HSV-Klinik_S2.bam
#-rw-rw-r-- 1 jhuang jhuang 4,9M Nov 11 13:26 HSV-Klinik_S2.mapped.bam
#-rw-rw-r-- 1 jhuang jhuang 74M Nov 11 13:31 HSV1_S1.bam
#-rw-rw-r-- 1 jhuang jhuang 34K Nov 11 13:31 HSV1_S1.mapped.bam
#align to OP297860
#-rw-rw-r-- 1 jhuang jhuang 47M Nov 12 12:35 HSV-Klinik_S2.bam
#-rw-rw-r-- 1 jhuang jhuang 5,3M Nov 12 12:35 HSV-Klinik_S2.mapped.bam
#-rw-rw-r-- 1 jhuang jhuang 74M Nov 12 12:31 HSV1_S1.bam
#-rw-rw-r-- 1 jhuang jhuang 34K Nov 12 12:31 HSV1_S1.mapped.bam
#align to self
#-rw-rw-r-- 1 jhuang jhuang 47M Nov 11 21:44 HSV-Klinik_S2.bam
#-rw-rw-r-- 1 jhuang jhuang 6,3M Nov 11 21:44 HSV-Klinik_S2.mapped.bam
#-rw-rw-r-- 1 jhuang jhuang 74M Nov 11 21:09 HSV1_S1.bam
#-rw-rw-r-- 1 jhuang jhuang 25K Nov 11 21:09 HSV1_S1.mapped.bam
# -- DEBUG_13 --
[Mon Nov 11 15:36:54 2024]
rule isnvs_vcf:
input: data/04_intrahost/vphaser2.HSV1_S1.txt.gz, data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz, data/03_multialign_to_ref/aligned_1.fasta, ref_genome/reference.fasta
output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi
jobid: 21
resources: tmpdir=/tmp, mem=4
b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
-------
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
-------
2024-11-11 15:36:55,581 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
[GCC 7.3.0]
2024-11-11 15:36:55,581 - cmd:195:main_argparse - INFO - command: bin/intrahost.py merge_to_vcf refFasta=ref_genome/reference.fasta outVcf=data/04_intrahost/isnvs.vcf.gz samples=['HSV1_S1', 'HSV-Klinik_S2'] isnvs=['data/04_intrahost/vphaser2.HSV1_S1.txt.gz', 'data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz'] alignments=['data/03_multialign_to_ref/aligned_1.fasta'] strip_chr_version=True naive_filter=False parse_accession=True loglevel=INFO
2024-11-11 15:36:55,581 - intrahost:476:merge_to_vcf - INFO - loaded CoordMapper for all genomes, starting VCF merge...
Traceback (most recent call last):
File "bin/intrahost.py", line 1152, in <module>
util.cmd.main_argparse(__commands__, __doc__)
File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 221, in main_argparse
ret = args.func_main(args)
File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 102, in _main
mainfunc(**args2)
File "bin/intrahost.py", line 677, in merge_to_vcf
(sample, (s_pos, samp_offsets[sample]), ref_sequence.id, pos))
NotImplementedError: Sample HSV-Klinik_S2-1 has variants at 2 positions (8704, 8703) mapped to same reference position (AB291960.1:63)
[Mon Nov 11 15:36:56 2024]
Error in rule isnvs_vcf:
jobid: 0
output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi
RuleException:
CalledProcessError in line 61 of /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules:
Command 'set -euo pipefail; bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession' returned non-zero exit status 1.
File "/mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules", line 61, in __rule_isnvs_vcf
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/.snakemake/log/2024-11-11T151925.063825.snakemake.log
# --DEBUG_14 --
bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120
bin/read_utils.py bwamem_idxstats inBam=data/01_cleaned/HSV-Klinik_S2.cleaned.bam refFasta=/home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta outBam=None outStats=reports/spike_count/HSV-Klinik_S2.spike_count.txt min_score_to_filter=60 aligner_options=None loglevel=INFO tmp_dir=/tmp tmp_dirKeep=False loglevel=DEBUG
Assembly results (look what are difference of the four versions 15K vs 73K in ~/DATA/Data_Nicole_CaptureProbeSequencing/tmp/02_assembly)
HSV1_S1.assembly2-gapfilled.fasta vs HSV-Klinik_S2.assembly2-gapfilled.fasta
-rw-rw-r-- 1 jhuang jhuang 15K Nov 8 17:12 HSV1_S1.assembly1-spades.fasta
-rw-rw-r-- 1 jhuang jhuang 155K Nov 8 17:12 HSV1_S1.assembly2-scaffold_ref.fasta
-rw-rw-r-- 1 jhuang jhuang 130K Nov 8 17:12 HSV1_S1.assembly2-scaffolded.fasta
-rw-rw-r-- 1 jhuang jhuang 176 Nov 8 17:12 HSV1_S1.assembly2-alternate_sequences.fasta
-rw-rw-r-- 1 jhuang jhuang 130K Nov 8 17:14 HSV1_S1.assembly2-gapfilled.fasta
-rw-rw-r-- 1 jhuang jhuang 26 Nov 8 17:18 HSV1_S1.assembly3-modify.fasta.fai
-rw-rw-r-- 1 jhuang jhuang 182 Nov 8 17:18 HSV1_S1.assembly3-modify.dict
-rw-r--r-- 1 jhuang jhuang 1,7M Nov 8 17:18 HSV1_S1.assembly3-modify.nix
-rw-rw-r-- 1 jhuang jhuang 155K Nov 8 17:18 HSV1_S1.assembly3-modify.fasta
-rw-rw-r-- 1 jhuang jhuang 212 Nov 8 17:21 HSV1_S1.assembly3.vcf.gz.tbi
-rw-rw-r-- 1 jhuang jhuang 183 Nov 8 17:21 HSV1_S1.assembly4-refined.dict
-rw-rw-r-- 1 jhuang jhuang 26 Nov 8 17:21 HSV1_S1.assembly4-refined.fasta.fai
-rw-r--r-- 1 jhuang jhuang 1,2M Nov 8 17:21 HSV1_S1.assembly4-refined.nix
-rw-rw-r-- 1 jhuang jhuang 137K Nov 8 17:21 HSV1_S1.assembly4-refined.fasta
-rw-rw-r-- 1 jhuang jhuang 494K Nov 8 17:21 HSV1_S1.assembly3.vcf.gz
-rw-rw-r-- 1 jhuang jhuang 203 Nov 8 17:22 HSV1_S1.assembly4.vcf.gz.tbi
-rw-rw-r-- 1 jhuang jhuang 428K Nov 8 17:22 HSV1_S1.assembly4.vcf.gz
-rw-rw-r-- 1 jhuang jhuang 73K Nov 8 18:03 HSV-Klinik_S2.assembly1-spades.fasta
-rw-rw-r-- 1 jhuang jhuang 144K Nov 8 18:03 HSV-Klinik_S2.assembly2-scaffolded.fasta
-rw-rw-r-- 1 jhuang jhuang 0 Nov 8 18:03 HSV-Klinik_S2.assembly2-alternate_sequences.fasta
-rw-rw-r-- 1 jhuang jhuang 155K Nov 8 18:03 HSV-Klinik_S2.assembly2-scaffold_ref.fasta
-rw-rw-r-- 1 jhuang jhuang 144K Nov 8 18:07 HSV-Klinik_S2.assembly2-gapfilled.fasta
-rw-rw-r-- 1 jhuang jhuang 32 Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.fasta.fai
-rw-rw-r-- 1 jhuang jhuang 194 Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.dict
-rw-r--r-- 1 jhuang jhuang 1,7M Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.nix
-rw-rw-r-- 1 jhuang jhuang 155K Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.fasta
draw coverages
* Mapping the contig on the reference JX878414
bowtie2-build refsel_db/refsel.fasta refsel_index
#spades/contigs.fasta
#bowtie2 -f -x refsel_index -U HSV1_S1_vrap_out/HSV1_S1_contigs.fasta -N 1 --score-min L,0,-1 --rdg 5,3 --rfg 5,3 -S HSV1_S1_contigs_aligned.sam
bowtie2 -f -x refsel_index -U HSV1_S1_vrap_out/HSV1_S1_contigs.fasta -S HSV1_S1_contigs_aligned.sam
samtools view -bS -F 4 HSV1_S1_contigs_aligned.sam > HSV1_S1_contigs_aligned.bam
#samtools view -S -b HSV1_S1_contigs_aligned.sam > HSV1_S1_contigs_aligned.bam
samtools sort HSV1_S1_contigs_aligned.bam -o HSV1_S1_contigs_aligned_sorted.bam
samtools index HSV1_S1_contigs_aligned_sorted.bam
samtools view HSV1_S1_contigs_aligned_sorted.bam > HSV1_S1_contigs_aligned_sorted.sam
Query sequence name Query length ORF density Percentage identity Subject sequence length Subject accession Subject name E-value
#TODO: Analyis in next time consider keep the column of query_coverage for quality control?
#2486 reads; of these:
# 2486 (100.00%) were unpaired; of these:
# 2407 (96.82%) aligned 0 times
# 79 (3.18%) aligned exactly 1 time
# 0 (0.00%) aligned >1 times
#3.18% overall alignment rate
11 reads; of these:
11 (100.00%) were unpaired; of these:
8 (72.73%) aligned 0 times
3 (27.27%) aligned exactly 1 time
0 (0.00%) aligned >1 times
27.27% overall alignment rate
NODE_14_length_862_cov_192.742857
NODE_19_length_621_cov_61.380567
CAP_16_length_559
gi|946552631|gb|KT425109.1| Human alphaherpesvirus 1 strain KOS79
gi|2549839763|gb|OQ724891.1| Human alphaherpesvirus 1 strain BP-K5
gi|2228071600|gb|ON007132.1| Human alphaherpesvirus 1 strain v40_unk_gen
samtools faidx HSV1_S1_contigs.fasta 'NODE_14_length_862_cov_192.742857' > HSV1_S1_contigs_.fasta
samtools faidx HSV1_S1_contigs.fasta 'NODE_19_length_621_cov_61.380567' >> HSV1_S1_contigs_.fasta
samtools faidx HSV1_S1_contigs.fasta 'CAP_16_length_559' >> HSV1_S1_contigs_.fasta
bowtie2 -f -x refsel_index -U HSV-Klinik_S2_vrap_out/HSV-Klinik_S2_contigs.fasta -S HSV-Klinik_S2_contigs_aligned.sam
samtools view -bS -F 4 HSV-Klinik_S2_contigs_aligned.sam > HSV-Klinik_S2_contigs_aligned.bam
#samtools view -S -b HSV-Klinik_S2_contigs_aligned.sam > HSV-Klinik_S2_contigs_aligned.bam
samtools sort HSV-Klinik_S2_contigs_aligned.bam -o HSV-Klinik_S2_contigs_aligned_sorted.bam
samtools index HSV-Klinik_S2_contigs_aligned_sorted.bam
samtools view HSV-Klinik_S2_contigs_aligned_sorted.bam > HSV-Klinik_S2_contigs_aligned_sorted.sam
31 reads; of these:
31 (100.00%) were unpaired; of these:
8 (25.81%) aligned 0 times
21 (67.74%) aligned exactly 1 time
2 (6.45%) aligned >1 times
74.19% overall alignment rate
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_14_length_2544_cov_467.428217' > HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_81_length_1225_cov_1080.820583' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_114_length_1046_cov_1018.474429' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_117_length_1033_cov_1618.421858' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_152_length_927_cov_105.347500' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_161_length_900_cov_3.283312' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_220_length_795_cov_0.748879' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_245_length_763_cov_900.518868' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_352_length_664_cov_61.363128' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_368_length_644_cov_489.846591' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_373_length_653_cov_0.340304' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_414_length_634_cov_2501.944773' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_626_length_568_cov_1.630385' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_1026_length_506_cov_2.593668' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_7_length_1389' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_8_length_1267' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_9_length_1581' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_18_length_896' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_25_length_841' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_52_length_1849' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_53_length_665' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_54_length_820' >> HSV-Klinik_S2_contigs_.fasta
samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_56_length_1189' >> HSV-Klinik_S2_contigs_.fasta
gi|1059802459|gb|LT594105.1|
gi|2315197778|gb|OP297860.1|
gi|2549840487|gb|OQ724911.1|
gi|1059802767|gb|LT594109.1|
gi|2620238293|gb|OR771685.1|
gi|2620238293|gb|OR771685.1|
gi|2315199769|gb|OP297886.1|
gi|2549841171|gb|OQ724933.1|
gi|2620238293|gb|OR771685.1|
gi|1809626902|gb|MN925871.1|
gi|2618798953|gb|OR723971.1|
gi|2315197778|gb|OP297860.1|
gi|2277963097|gb|ON960059.1|
gi|2620238293|gb|OR771685.1|
gi|2549599151|gb|OQ724836.1|
gi|1717903527|gb|MN136523.1|
gi|1059802459|gb|LT594105.1|
gi|2549841171|gb|OQ724933.1|
gi|2315197778|gb|OP297860.1|
gi|2620238293|gb|OR771685.1|
gi|2315197778|gb|OP297860.1|
gi|1809626902|gb|MN925871.1|
Human herpesvirus 1 isolate 172_2010 genome assembly
Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
Human alphaherpesvirus 1 strain BP-K12
Human herpesvirus 1 isolate 270_2007 genome assembly
Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
Human alphaherpesvirus 1 strain HSV1-v72_d53_cu_gen_les
Human alphaherpesvirus 1 strain BP-L2
Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
UNVERIFIED: Human alphaherpesvirus 1 strain Sample4_DOCK8
Mutant Human alphaherpesvirus 1 isolate dsncRNA12
Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
Human alphaherpesvirus 1 strain HSV1-San-Francisco-USA-1974-HTZ
Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
UNVERIFIED: Human alphaherpesvirus 1 strain BP-C8
Human alphaherpesvirus 1 strain MacIntyre
Human herpesvirus 1 isolate 172_2010 genome assembly
Human alphaherpesvirus 1 strain BP-L2
Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
UNVERIFIED: Human alphaherpesvirus 1 strain Sample4_DOCK8
Human alphaherpesvirus 1 strain HSV1-v67_d346_cu_gen_les
#-->OR771685.1
#8278 reads; of these:
# 8278 (100.00%) were unpaired; of these:
# 3775 (45.60%) aligned 0 times
# 4500 (54.36%) aligned exactly 1 time
# 3 (0.04%) aligned >1 times
#54.40% overall alignment rate
* Generate Coverage Profile for Reads (from Fastq): Align the trimmed fastq reads to the reference genome using a mapper like BWA or Bowtie2 (WRONG), we should use novoalign
#bwa index refsel_db/refsel.fasta
#bwa mem refsel_db/refsel.fasta trimmed/HSV1_S1_R1.fastq.gz trimmed/HSV1_S1_R2.fastq.gz > HSV1_S1_reads_aligned.sam
#samtools view -Sb HSV1_S1_reads_aligned.sam | samtools sort -o HSV1_S1_reads_aligned_sorted.bam
#samtools index HSV1_S1_reads_aligned_sorted.bam
#bwa mem refsel_db/refsel.fasta trimmed/HSV-Klinik_S2_R1.fastq.gz trimmed/HSV-Klinik_S2_R2.fastq.gz > HSV-Klinik_S2_reads_aligned.sam
#samtools view -Sb HSV-Klinik_S2_reads_aligned.sam | samtools sort -o HSV-Klinik_S2_reads_aligned_sorted.bam
#samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
cd data
mkdir 02_align_to_OP297860
../bin/read_utils.py align_and_fix 01_per_sample/HSV1_S1.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV1_S1.bam --outBamFiltered 02_align_to_OP297860/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
../bin/read_utils.py align_and_fix 01_per_sample/HSV-Klinik_S2.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV-Klinik_S2.bam --outBamFiltered 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
samtools sort 02_align_to_OP297860/HSV1_S1.mapped.bam -o HSV1_S1_reads_aligned_sorted.bam
samtools index HSV1_S1_reads_aligned_sorted.bam
samtools sort 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam -o HSV-Klinik_S2_reads_aligned_sorted.bam
samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
mv 02_align_to_OP297860/*.bam ..
rmdir 02_align_to_OP297860
* Generate Coverage Tracks: Use BamCoverage to generate coverage files (in bigWig format) for both the reads and contigs.
#find . -name "*_aligned_sorted.bam"
bamCoverage -b ./HSV1_S1_reads_aligned_sorted.bam -o HSV1_S1_reads_coverage.bw
bamCoverage -b ./HSV1_S1_contigs_aligned_sorted.bam -o HSV1_S1_contigs_coverage.bw
bamCoverage -b ./HSV-Klinik_S2_reads_aligned_sorted.bam -o HSV-Klinik_S2_reads_coverage.bw
bamCoverage -b ./HSV-Klinik_S2_contigs_aligned_sorted.bam -o HSV-Klinik_S2_contigs_coverage.bw
* Visualize Alignments: Use tools like IGV (Integrative Genomics Viewer)
Reproduce 03_multialign_to_ref by generating consensus fasta
#bedtools bamtobed -i HSV-Klinik_S2_contigs_aligned_sorted.bam > contigs.bed
bedtools bamtobed -i HSV1_S1_vrap_out_v5/bowtie/mapped_sorted.bam > contigs.bed
bedtools merge -i contigs.bed > merged_contigs_coverage.bed
awk '{sum += $3 - $2} END {print sum}' merged_contigs_coverage.bed
#20916
#generate alignment form contigs.bam and refsel.fasta
bcftools mpileup -f refsel_db/refsel.fasta -d 1000000 HSV-Klinik_S2_contigs_aligned_sorted.bam | bcftools call -mv --ploidy 1 -Ov -o contigs_variants.vcf
bgzip contigs_variants.vcf
tabix -p vcf contigs_variants.vcf.gz
cat refsel_db/refsel.fasta | bcftools consensus contigs_variants.vcf.gz > aligned_contigs_to_reference.fasta
# tabix -p vcf contigs_variants.vcf.gz
# cat refsel_db/refsel.fasta | bcftools consensus contigs_variants.vcf.gz > aligned_contigs_to_reference.fasta
#Note: the --sample option not given, applying all records regardless of the genotype
#Applied 30 variants
cat refsel_db/refsel.fasta aligned_contigs_to_reference.fasta > aligned_1.fasta
#Header of the 2nd record is >HSV-Klinik_S2-1
mafft aligned_1.fasta | sed '/^>/! s/[a-z]/\U&/g' > data/03_multialign_to_ref/aligned_1.fasta
Reproduce 04_intrahost, #DEBUG_IMPORTANT_NOT_SAME_BETWEEN_VPHASER2_AND_FREEBAYES: why not intrahost variant calling not having the frequencies between 0.2 and 0.8. The list is also total different to the results from freebayes. try different combination of ""--removeDoublyMappedReads --minReadsEach 5 --maxBias 0"
awk '$6 >= 0.05' isnvs.annot.txt > isnvs.annot_.txt
chr pos sample patient time alleles iSNV_freq Hw Hs eff_type eff_codon_dna eff_aa eff_aa_pos eff_prot_len eff_gene eff_protein
* OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1 0.0165025249227804 1 synonymous_variant,intragenic_variant 1614A>G,1614A>T,n.13203T>C,n.13203T>A Val538Val 538 882 UL5 UXY89136.1,Gene_11440_14815
* OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.992975413948821 0.0139504824839776 1 missense_variant 1126A>C Asn376His 376 376 UL23 UXY89153.1
OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.0537303216974675 0.101686748455508 1 synonymous_variant 246C>A Pro82Pro 82 376 UL23 UXY89153.1
OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1 0.0204843614284831 1 synonymous_variant,intragenic_variant 720A>G,720A>T,n.55501T>C,n.55501T>A Ala240Ala 240 904 UL27,UL28 UXY89158.1,Gene_53483_58584
OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.0622837370242215 0.116808946253038 1 missense_variant,intragenic_variant 414G>T,n.55807C>A Glu138Asp 138 904 UL27,UL28 UXY89158.1,Gene_53483_58584
* OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.891530460624071 0.193407796807005 1 intragenic_variant n.65225G>A UL30 Gene_63070_67475
* OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.102222222222222 0.183545679012346 1 intragenic_variant n.65402C>A UL30 Gene_63070_67475
OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.0518433179723502 0.0983111767079359 1 intragenic_variant n.66570G>T UL30 Gene_63070_67475
OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.0528511821974965 0.100115869475647 1 missense_variant 108G>T Gln36His 36 488 UL42 UXY89171.1
samtools faidx aligned_1.fasta "OP297860.1":13203-13203 #T
samtools faidx aligned_1.fasta HSV-Klinik_S2-1:13203-13203 #T
samtools faidx aligned_1.fasta "OP297860.1":47109-47109 #T
samtools faidx aligned_1.fasta HSV-Klinik_S2-1:47109-47109 #T
samtools faidx aligned_1.fasta "OP297860.1":47989-47989 #G
samtools faidx aligned_1.fasta HSV-Klinik_S2-1:47989-47989 #G
samtools faidx aligned_1.fasta "OP297860.1":65225-65225 #G
samtools faidx aligned_1.fasta HSV-Klinik_S2-1:65225-65225 #A
#DEBUG_IMPORTANT_NOT_SAME_BETWEEN_VPHASER2_AND_FREEBAYES: why not intrahost variant calling not located in 0.6, 0.4
vim bin/tools/vphaser2.py # set w=25000
rm -rf data/04_intrahost
snakemake --printshellcmds --cores 10
samtools index data/02_align_to_self/HSV1_S1.mapped.bam
samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 -loglevel DEBUG
#interhost variant calling, the number below should be not the same to the intrahost variant calling (the varaints from the isolate to its consensus assemby, this is why the frequency theoretically under 0.5. In intrahost variant calling, the REF refers to the base OP297860.1. It is possible that a ALT has 90% in the clinical samples --> All positions with > 0.5 means the consensus sequences are different to the CHROM. The frequences varies 0.00000001 to 1.0, since if the frequences with 0.0 will be not reported.)
#The contigs contains a lot of positions wrongly assembled, so it is actually only much fewer following positions are interhost variants.
samtools index HSV1_S1.mapped.bam
samtools index HSV-Klinik_S2.mapped.bam
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 0 --loglevel DEBUG
awk '$7 >= 5' vphaser2.HSV-Klinik_S2_v2.txt > vphaser2.HSV-Klinik_S2_v2_.txt
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 0 --loglevel DEBUG
Manully complete the assemblies with the reference genome and recreated 02_assembly, then rerun the pipelines for the steps after 02_align_to_self
~/Scripts/convert_fasta_to_clustal.py aligned_1.fasta_orig aligned_1.aln
~/Scripts/convert_clustal_to_clustal.py aligned_1.aln aligned_1_.aln
#manully delete the postion with all or '-' in aligned_1_.aln
~/Scripts/check_sequence_differences.py aligned_1_.aln
#Differences found at the following positions (150):
Position 8956: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 8991: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 8992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 8995: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 9190: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
Position 9294: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 9298: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 9319: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 9324: OP297860.1 = T, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 9352: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 9368: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 10036: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 12006: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 12131: OP297860.1 = C, HSV1_S1-1 = M, HSV-Klinik_S2-1 = C
Position 12748: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 12753: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
* Position 13203: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
* Position 13522: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 13557: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 13637: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
* Position 13659: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 13731: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 13755: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 13778: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 14835: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 34549: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 34705: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 41118: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 41422: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 44110: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 44137: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 44190: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 44227: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
Position 44295: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 46861: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
# Position 47109: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 47170: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = T
Position 47182: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 47320: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 47375: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 47377: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 47393: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 47433: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 47436: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 47484: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 47516: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 47563: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 47660: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 47707: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 47722: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = G
* Position 47969: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 48064: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = A
Position 48113: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 48129: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 48167: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 48219: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 48255: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 48384: OP297860.1 = C, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
Position 53216: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 53254: OP297860.1 = C, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
Position 53265: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 53291: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 53298: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = G
Position 53403: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 53423: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 53445: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 53450: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 53460: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 53659: OP297860.1 = A, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
* Position 53691: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 54007: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 54013: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 54025: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 54073: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 54408: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 54568: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 54708: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 54709: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
* Position 55501: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 55507: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 55543: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 56493: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 56753: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 56981: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 58075: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 58078: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 58526: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 58550: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 58604: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 58615: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 58789: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
* Position 63248: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 63799: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
* Position 64328: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 65179: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
* Position 65225: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 65992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 66677: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 67336: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 87848: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 87866: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
Position 87942: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
Position 87949: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
* Position 95302: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 95320: OP297860.1 = G, HSV1_S1-1 = K, HSV-Klinik_S2-1 = G
Position 95992: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 96124: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 96138: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 96145: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
Position 100159: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 107885: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
Position 114972: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 117663: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 117802: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = A
Position 117834: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 117841: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 118616: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 119486: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 119519: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 120688: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 120690: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 120711: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 120714: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 133842: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 133894: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
Position 134778: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 134788: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 134867: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 134895: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 134898: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 134942: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
Position 136436: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 136900: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
Position 137047: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 137155: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
Position 137527: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
Position 137569: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 137602: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 137944: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
Position 138170: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
Position 138343: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
Position 138880: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 139104: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
Position 140457: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = M
Position 141865: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
Position 141889: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
Position 141937: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
Position 142056: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
Position 144444: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
~/Scripts/convert_clustal_to_fasta.py aligned_1_.aln aligned_1.fasta
samtools faidx aligned_1.fasta
samtools faidx aligned_1.fasta OP297860.1 > OP297860.1.fasta
samtools faidx aligned_1.fasta HSV1_S1-1 > HSV1_S1-1.fasta
samtools faidx aligned_1.fasta HSV-Klinik_S2-1 > HSV-Klinik_S2-1.fasta
seqkit seq OP297860.1.fasta -w 70 > OP297860.1_w70.fasta
diff OP297860.1_w70.fasta ../../refsel_db/refsel.fasta
#< >OP297860.1
#---
#> >OP297860.1 Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les, complete genome
#2180c2180,2181
#< ACGGGCCCCCCCCCGAAACACACCCCCCGGGGGTCGCGCGCGGCCCTT
#---
#> ACGGGCCCCCCCCCGAAACACACCCCCCGGGGGTCGCGCGCGGCCCTTTAAAAAGGCGGGGCGGGT
mv 02_assembly 02_assembly_v1
mv 02_align_to_self 02_align_to_self_v1
mv 03_multialign_to_ref/ 03_multialign_to_ref_v1
mv 04_intrahost 04_intrahost_v1
mkdir 02_assembly
cp 03_multialign_to_ref_v1/HSV1_S1-1.fasta 02_assembly/HSV1_S1.fasta
cp 03_multialign_to_ref_v1/HSV-Klinik_S2-1.fasta 02_assembly/HSV-Klinik_S2.fasta
samtools faidx HSV1_S1.fasta
picard CreateSequenceDictionary R=HSV1_S1.fasta O=HSV1_S1.dict
~/Tools/novocraft_v3/novoindex HSV1_S1.nix HSV1_S1.fasta
samtools faidx HSV-Klinik_S2.fasta
picard CreateSequenceDictionary R=HSV-Klinik_S2.fasta O=HSV-Klinik_S2.dict
~/Tools/novocraft_v3/novoindex HSV-Klinik_S2.nix HSV-Klinik_S2.fasta
If the reads in mapped.bam too few, we can manully rerun the following steps with custom defined bam, for example cleaned.bam or taxfilt.bam files (see the point 1).
# -- Adjust Novoalign parameter to increase the mapped reads in 02_align_to_self --
If you are working with NovoAlign for virus variant calling and find that very few reads are retained, you can adjust certain parameters to increase the read count while still maintaining high mapping quality. Here are some suggestions for tuning the parameters in NovoAlign:
Reduce the Minimum Alignment Score Threshold (-t):
Current Setting: -t 100
Suggestion: Try reducing this threshold to around -t 90 or -t 80.
Explanation: The -t parameter in NovoAlign sets the minimum alignment score, which is the threshold for accepting an alignment. Lowering this score allows more alignments to pass through, increasing read retention. Reducing it slightly can retain quality while increasing the number of mapped reads.
Adjust the Gap Penalty (-g):
Current Setting: -g 40
Suggestion: Try using a slightly lower gap penalty, such as -g 20 or -g 30.
Explanation: Lowering the gap penalty allows reads with minor gaps to align more easily, which may be beneficial for viral genomes with regions that might induce small indels. This adjustment should increase read retention without sacrificing too much mapping quality.
Lower the Mismatch Penalty (-x):
Current Setting: -x 20
Suggestion: Try reducing this to -x 15 or -x 10.
Explanation: A lower mismatch penalty allows more reads with minor mismatches to map, increasing the number of mapped reads. For viral genomes, this can be helpful because some variability is expected, especially in variant-calling workflows.
Experiment with the Random Alignment Option (-r):
Current Setting: -r Random
Suggestion: If applicable, you might try other random alignment settings in NovoAlign or disable it temporarily to see if deterministic behavior (i.e., -r All) provides more reads without sacrificing quality.
Explanation: This option controls how NovoAlign treats random alignments. Testing with or without it may affect read retention, especially if many reads align equally well to multiple sites.
Increase the Soft-Clipping Parameter (-l):
Current Setting: -l 20
Suggestion: Try increasing to -l 30 or -l 40.
Explanation: Higher soft-clipping allows NovoAlign to discard low-quality or mismatched bases at the read ends, which can improve alignment quality and retention by allowing reads that otherwise would be discarded due to terminal mismatches.
Consider Using Paired-End Data:
Suggestion: If you have paired-end data available, align the reads as paired rather than single-ended.
Explanation: Paired-end alignment can improve mapping quality and retention since the additional read information helps resolve ambiguous mappings. It also increases the reliability of alignments by adding context from both read ends.
Sample Adjusted Command
Here’s an example of a modified NovoAlign command incorporating the above suggestions:
novoalign -d reference.nix -f reads.fq -r Random -l 40 -g 30 -x 15 -t 80 -k > output.sam
Additional Steps for High-Quality Variant Calling:
Use a Quality Filter Post-Alignment: After alignment, apply a quality filter on the mapped reads using a tool like Samtools to ensure that only high-confidence alignments are used for variant calling.
Optimize Read Processing: Trim low-quality bases and remove adapters prior to alignment to ensure that only high-quality reads are used. This will increase both the retention rate and the quality of the alignments.
Review Variant Calling Parameters: After alignment, check the variant-calling parameters to ensure they are suitable for low-complexity viral genomes and that high-quality mapping is prioritized.
These adjustments should help you retain more reads while still maintaining alignment quality suitable for variant calling. Adjust these parameters iteratively, evaluating the resulting alignments and variant calls to find the best balance between read count and quality.
# (TODO) look which configuration can reach the similar results as the freebayes?
vim bin/tools/vphaser2.py # edit '-w 22000'
mkdir 04_intrahost
#[E::idx_find_and_load] Could not retrieve index file for 'data/02_align_to_self/HSV-Klinik_S2.mapped.bam'
#[E::idx_find_and_load] Could not retrieve index file for 'data/02_align_to_self/HSV-Klinik_S2.mapped.bam'
samtools index data/02_align_to_self/HSV1_S1.mapped.bam
samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w22000.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 1000000 --loglevel DEBUG
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w22000.txt --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
#---- (Maybe next time, this time, it is not necessary): running once for l20_g40_x20_t100, once for l40_g30_x15_t80, which is option for novoalign in config.yaml, Note that we need rerun rerun 02_align_to_self.
# -- 04_intrahost_--removeDoublyMappedReads_--minReadsEach5_--maxBias10 --
mkdir data/04_intrahost
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
#bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_removeDoublyMappedReads_minReadsEach5_maxBias10
cd data/04_intrahost_l20_g40_x20_t100_removeDoublyMappedReads_minReadsEach5_maxBias10
gunzip isnvs.annot.txt.gz
~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
cut -d$'\t' filtered_isnvs.annot.txt -f1-7
chr pos sample patient time alleles iSNV_freq
OP297860 13203 HSV1_S1 HSV1_S1 T,C,A 1.0
OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
OP297860 13522 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 13522 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008905554253573941
OP297860 13659 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 13659 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008383233532934131
OP297860 47109 HSV1_S1 HSV1_S1 T,G 0.0
OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.9929754139488208
OP297860 47969 HSV1_S1 HSV1_S1 C,T,A 1.0
OP297860 47969 HSV-Klinik_S2 HSV-Klinik_S2 C,T,A 0.017707985299031073
OP297860 47989 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.053730321697467484
OP297860 53691 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 53691 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.023529411764705882
OP297860 55501 HSV1_S1 HSV1_S1 T,C,A 1.0
OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
OP297860 55807 HSV1_S1 HSV1_S1 C,A 0.0
OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.062176165803108814
OP297860 63248 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 63248 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.016983016983016984
OP297860 64328 HSV1_S1 HSV1_S1 C,A 1.0
OP297860 64328 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.008469449485783423
OP297860 65225 HSV1_S1 HSV1_S1 G,A 0.0
OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.8915304606240714
OP297860 65402 HSV1_S1 HSV1_S1 C,A 0.0
OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.10222222222222224
OP297860 66570 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05144291091593475
OP297860 94750 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.052851182197496516
OP297860 95302 HSV1_S1 HSV1_S1 C,A 1.0
OP297860 95302 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.01276595744680851
#mv data/04_intrahost data/04_intrahost_l40_g30_x15_t80_removeDoublyMappedReads_minReadsEach5_maxBias10
#cd data/04_intrahost_l40_g30_x15_t80_removeDoublyMappedReads_minReadsEach5_maxBias10
#gunzip isnvs.annot.txt.gz
#Keep groups where at least one record has iSNV_freq >= 0.05
#~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
#cut -d$'\t' filtered_isnvs.annot.txt -f1-7
# -- 04_intrahost_--minReadsEach5_--maxBias10 --
mkdir data/04_intrahost
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10
#bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias10
cd data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias10
gunzip isnvs.annot.txt.gz
~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
cut -d$'\t' filtered_isnvs.annot.txt -f1-7
chr pos sample patient time alleles iSNV_freq
OP297860 13203 HSV1_S1 HSV1_S1 T,C,A 1.0
OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
OP297860 13522 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 13522 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008888888888888889
OP297860 13659 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 13659 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008359207069500836
OP297860 47109 HSV1_S1 HSV1_S1 T,G 0.0
OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.9930174563591022
OP297860 47969 HSV1_S1 HSV1_S1 C,T,A 1.0
OP297860 47969 HSV-Klinik_S2 HSV-Klinik_S2 C,T,A 0.01828457446808511
OP297860 47989 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.053474114441416885
OP297860 53691 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 53691 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.02342786683107275
OP297860 55501 HSV1_S1 HSV1_S1 T,C,A 1.0
OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
OP297860 55807 HSV1_S1 HSV1_S1 C,A 0.0
OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.061538461538461535
OP297860 63248 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 63248 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.016815034619188922
OP297860 64328 HSV1_S1 HSV1_S1 C,A 1.0
OP297860 64328 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.008433734939759036
OP297860 65225 HSV1_S1 HSV1_S1 G,A 0.0
OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.8916728076639646
OP297860 65402 HSV1_S1 HSV1_S1 C,A 0.0
OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.1018149623727313
OP297860 66570 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05112219451371571
OP297860 94750 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.052851182197496516
OP297860 95302 HSV1_S1 HSV1_S1 C,A 1.0
OP297860 95302 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.012725344644750796
# -- 04_intrahost_--minReadsEach5_--maxBias1000000 --
mkdir data/04_intrahost
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000
#bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias1000000
cd data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias1000000
gunzip isnvs.annot.txt.gz
~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
cut -d$'\t' filtered_isnvs.annot.txt -f1-7
chr pos sample patient time alleles iSNV_freq
OP297860 13203 HSV1_S1 HSV1_S1 T,C,A 1.0
OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
OP297860 13522 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 13522 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008888888888888889
OP297860 13659 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 13659 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008359207069500836
OP297860 47109 HSV1_S1 HSV1_S1 T,G 0.0
OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.9930174563591022
OP297860 47778 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 47778 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05263157894736842
OP297860 47969 HSV1_S1 HSV1_S1 C,T,A 1.0
OP297860 47969 HSV-Klinik_S2 HSV-Klinik_S2 C,T,A 0.01828457446808511
OP297860 47989 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.053474114441416885
OP297860 53691 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 53691 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.02342786683107275
OP297860 55501 HSV1_S1 HSV1_S1 T,C,A 1.0
OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
OP297860 55807 HSV1_S1 HSV1_S1 C,A 0.0
OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.061538461538461535
OP297860 63248 HSV1_S1 HSV1_S1 G,T 1.0
OP297860 63248 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.016815034619188922
OP297860 64328 HSV1_S1 HSV1_S1 C,A 1.0
OP297860 64328 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.008433734939759036
OP297860 65225 HSV1_S1 HSV1_S1 G,A 0.0
OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.8916728076639646
OP297860 65402 HSV1_S1 HSV1_S1 C,A 0.0
OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.1018149623727313
OP297860 66570 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05112219451371571
OP297860 94750 HSV1_S1 HSV1_S1 G,T 0.0
OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.052851182197496516
OP297860 95302 HSV1_S1 HSV1_S1 C,A 1.0
OP297860 95302 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.012725344644750796
Install a new viral-ngs including interhost and annotation steps (Failed!)
#https://viral-ngs.readthedocs.io/en/latest/install.html
wget https://raw.githubusercontent.com/broadinstitute/viral-ngs/master/easy-deploy-script/easy-deploy-viral-ngs.sh && chmod a+x ./easy-deploy-viral-ngs.sh && reuse UGER && qrsh -l h_vmem=10G -cwd -N "viral-ngs_deploy" -q interactive ./easy-deploy-viral-ngs.sh setup
source ./easy-deploy-viral-ngs.sh load
./easy-deploy-viral-ngs.sh create-project HSV1_Capture
#docker installation
sudo usermod -aG docker jhuang
#newgrp docker
groups jhuang
docker pull quay.io/broadinstitute/viral-ngs
docker run -it quay.io/broadinstitute/viral-ngs /bin/bash
Note that the intrahost results does not include the interhost results. Checking process.
#Under data/02_assembly
cp ../../ref_genome/reference.fasta HSV1_S1.fasta #>HSV1_S1-1
cp ../../ref_genome/reference.fasta HSV-Klinik_S2.fasta #>HSV-Klinik_S2-1
samtools faidx HSV1_S1.fasta
picard CreateSequenceDictionary R=HSV1_S1.fasta O=HSV1_S1.dict
~/Tools/novocraft_v3/novoindex HSV1_S1.nix HSV1_S1.fasta
samtools faidx HSV-Klinik_S2.fasta
picard CreateSequenceDictionary R=HSV-Klinik_S2.fasta O=HSV-Klinik_S2.dict
~/Tools/novocraft_v3/novoindex HSV-Klinik_S2.nix HSV-Klinik_S2.fasta
#total 128140
#-rw-rw-r-- 1 jhuang jhuang 76693037 Nov 13 09:59 HSV1_S1.bam
#-rw-rw-r-- 1 jhuang jhuang 34590 Nov 13 09:59 HSV1_S1.mapped.bam
#-rw-rw-r-- 1 jhuang jhuang 48946378 Nov 13 10:03 HSV-Klinik_S2.bam
#-rw-rw-r-- 1 jhuang jhuang 5537247 Nov 13 10:03 HSV-Klinik_S2.mapped.bam
# vs
#total 128140
#-rw-rw-r-- 1 jhuang jhuang 76693095 Nov 15 12:47 HSV1_S1.bam
#-rw-rw-r-- 1 jhuang jhuang 34587 Nov 15 12:47 HSV1_S1.mapped.bam
#-rw-rw-r-- 1 jhuang jhuang 48946337 Nov 15 12:48 HSV-Klinik_S2.bam
#-rw-rw-r-- 1 jhuang jhuang 5537246 Nov 15 12:48 HSV-Klinik_S2.mapped.bam
#Manually generate the aligned_1.fasta due to too long runtime.
cat ../../ref_genome/reference.fasta ../02_assembly/HSV1_S1.fasta ../02_assembly/HSV-Klinik_S2.fasta > aligned_1.fasta
#>OP297860.1 Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les, complete genome
#>HSV1_S1-1
#>HSV-Klinik_S2-1
#If this results is similar to freebayes, means the results successfully include interhost-results.
#TODO: In next step, we should feed another bam-files, e.g. the cleaned bam-file into the pipelines!
#DOESN'T WORK: snakemake --cleanup-metadata data/03_multialign_to_ref/sampleNameList.txt data/03_multialign_to_ref/aligned_1.fasta --cores 1
snakemake --printshellcmds --cores all
mkdir data/04_intrahost
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
mv data/04_intrahost data/04_intrahost_including_interhost
cd data/04_intrahost_including_interhost
gunzip isnvs.annot.txt.gz
~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
cut -d$'\t' filtered_isnvs.annot.txt -f1-7
bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
awk '$7 >= 5' vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000.txt > vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000_0.05.txt
awk '$7 >= 50' vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000.txt > vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000_0.5.txt
# How many SNPs?
#bin/intrahost.py vphaser_one_sample data_v2/02_align_to_self/HSV-Klinik_S2.mapped.bam data_v2/02_assembly/HSV-Klinik_S2.fasta data_v2/04_intrahost/vphaser2.HSV-Klinik_S2_v2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000 --loglevel DEBUG
#mv vphaser2.HSV-Klinik_S2.txt.gz
# How many SNPs?
awk '$7 >= 5' vphaser2.HSV-Klinik_S2_v2.txt > vphaser2.HSV-Klinik_S2_v2_.txt
bin/intrahost.py vphaser_one_sample data_v2/02_align_to_self/HSV-Klinik_S2.mapped.bam data_v2/02_assembly/HSV-Klinik_S2.fasta data_v2/04_intrahost/vphaser2.HSV-Klinik_S2_v3.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10 --loglevel DEBUG
# How many SNPs?
awk '$6 >= 0.05' isnvs.annot.txt > isnvs.annot_.txt
-------
#I used the viral-ngs get a table as follows:
chr pos sample patient time alleles iSNV_freq Hw Hs eff_type eff_codon_dna eff_aa eff_aa_pos eff_prot_len eff_gene eff_protein
OP297860 9012 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.0155954631379962 0.0307044893350152 1 intergenic_region n.9012C>A RL2-UL1 Gene_1996_5580-Gene_9025_10636
OP297860 9017 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.0408905043162199 0.0784369419459701 1 intergenic_region n.9017C>A RL2-UL1 Gene_1996_5580-Gene_9025_10636
In the process, the intrahost.py was used.
intrahost.py - within-host genetic variation (iSNVs) The output has only contains
I can so understand, the intrahost variants only reported. The chr OP297860 is only for the annotation. If a position in my clinical sample HSV-Klinik_S2 is different to OP297860, it will be not reported and not exists in the table.
Column Descriptions in the Output Table
The output table generated by this script will contain the following columns:
chr: Chromosome or contig where the variant is located.
pos: Position on the chromosome/contig of the variant.
sample: The sample identifier for this variant.
patient: Patient ID extracted from the sample name (assumes the format sample.patient).
time: Time point of sample collection, extracted from the sample name (if present).
alleles: The alleles involved in the variant. For example, C,A means Cytosine (C) and Adenine (A).
iSNV_freq: Frequency of the variant in the sample. This is the sum of the frequencies of the variant alleles.
Hw: Hardy-Weinberg equilibrium p-value for the variant. This is calculated from the genotype frequencies in the sample and indicates how well they conform to random mating expectations.
Hs: Heterozygosity in the population based on consensus genotypes. It measures genetic diversity based on observed genotypes.
eff_type: The type of effect the variant has on the gene, such as intergenic_region, start_lost, etc.
eff_codon_dna: The effect of the variant at the DNA level (e.g., n.9012C>A).
eff_aa: The amino acid effect of the variant (e.g., a change from one amino acid to another or a frameshift).
eff_aa_pos: The position of the amino acid affected by the variant.
eff_prot_len: The length of the protein after the variant is applied, which may be truncated if the variant causes a frameshift or a stop codon.
eff_gene: The gene affected by the variant.
eff_protein: The protein affected by the variant (e.g., a protein identifier like UXY89132.1).
b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
-------
2024-11-12 13:22:47,892 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
[GCC 7.3.0]
2024-11-12 13:22:47,893 - cmd:195:main_argparse - INFO - command: bin/intrahost.py merge_to_vcf refFasta=ref_genome/reference.fasta outVcf=data/04_intrahost/isnvs.vcf.gz samples=['HSV-Klinik_S2'] isnvs=['data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz'] alignments=['data/03_multialign_to_ref/aligned_1.fasta'] strip_chr_version=True naive_filter=False parse_accession=True loglevel=INFO
2024-11-12 13:22:47,893 - intrahost:476:merge_to_vcf - INFO - loaded CoordMapper for all genomes, starting VCF merge...
Traceback (most recent call last):
File "bin/intrahost.py", line 1152, in <module>
util.cmd.main_argparse(__commands__, __doc__)
File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 221, in main_argparse
ret = args.func_main(args)
File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 102, in _main
mainfunc(**args2)
File "bin/intrahost.py", line 530, in merge_to_vcf
raise LookupError("Not all reference sequences found in alignments.")
LookupError: Not all reference sequences found in alignments.
[Tue Nov 12 13:22:47 2024]
Error in rule isnvs_vcf:
jobid: 0
output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi
RuleException:
CalledProcessError in line 61 of /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules:
Command 'set -euo pipefail; bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession' returned non-zero exit status 1.
File "/mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules", line 61, in __rule_isnvs_vcf
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Exiting because a job execution failed. Look above for error message
Columns Breakdown:
Ref_Pos (e.g., 55, 104, 210):
This refers to the position in the genome where the variant occurs. In this example, the variants occur at positions 55, 104, and 210.
Var (e.g., T C, G A, T C):
This is the variant observed at that position. It shows the reference base (before the variant) and the alternate base (after the variant). For example:
At position 55, the reference base is T and the alternate base is C.
At position 104, the reference base is G and the alternate base is A.
At position 210, the reference base is T and the alternate base is C.
Cons (e.g., 0.8156, 0.1674, 0.1065):
This represents the variant frequency (or proportion) in the sample, expressed as a decimal. It shows the fraction of reads supporting the alternate base (C, A, etc.). For example:
At position 55, 81.56% of the reads support the alternate base C.
At position 104, 16.74% of the reads support the alternate base A.
At position 210, 10.65% of the reads support the alternate base C.
Strd_bias_pval (e.g., 0.8156, 0.1674, 0.1065):
This represents the strand bias p-value for the variant. It tests if there is an uneven distribution of reads between the forward and reverse strands for the variant. A higher p-value suggests no significant strand bias. A lower p-value suggests a possible strand bias, meaning the variant might be incorrectly called due to a bias in sequencing reads from one strand.
Type (e.g., snp):
This indicates the type of variant. In this case, it's a SNP (single nucleotide polymorphism). It means that a single nucleotide in the genome has been altered.
Var_perc (e.g., 16.1, 14.07, 10.58):
This represents the percentage of variants for each alternate base, which is very similar to the Cons column but expressed as a percentage. For example:
At position 55, the alternate base C is observed in 16.1% of the reads.
At position 104, the alternate base A is observed in 14.07% of the reads.
At position 210, the alternate base C is observed in 10.58% of the reads.
SNP_or_LP_Profile (e.g., C:65:34 T:13:6):
This contains information on the read counts for the reference base (T, G, etc.) and the alternate base (C, A, etc.). The format is:
Reference base count (forward strand : reverse strand)
Alternate base count (forward strand : reverse strand)
For example, at position 55:
C (alternate base) has 65 reads on the forward strand and 34 on the reverse strand.
T (reference base) has 13 reads on the forward strand and 6 on the reverse strand.
Summary: SNPV and LPV
The last line of the output gives a summary of the total number of SNPs and LPs (likely Low-Quality Polymorphisms or Low Probability Variants):
# Summary: SNPV: 132; LPV: 0
SNPV: 132:
This indicates the total number of SNP variants detected in the data. In this case, there are 132 SNPs identified.
LPV: 0:
This indicates the number of Low Probability Variants (LPVs). A value of 0 means no low-quality variant calls were detected, indicating that the analysis did not identify any variants with low confidence.
# Minimum number of reads on each strand
vphaser_min_reads_each: 5
# Maximum allowable ratio of number of reads on the two
# strands. Ignored if vphaser_max_bins=0.
vphaser_max_bins: 10
# A simple filter for the VCF merge step.
# If set to true, keep only the alleles that have at least two
# independent libraries of support and
# allele freq > 0.005. If false, no filtering is performed.
vcf_merge_naive_filter: false
(Optional)
152526
GapFiller.pl -l libraries_p2564.txt -s data/02_assembly/p2564.fasta
#parainfluenza bwa /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R1.fastq.gz /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R2.fastq.gz 300 1.0 FR
#since HSV1 and HSV-Klinik_S2 has different regions covered --> multialign_to_ref is none!
bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.mapped.bam
162156 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
162156 + 0 mapped (100.00% : N/A)
162156 + 0 paired in sequencing
81048 + 0 read1
81108 + 0 read2
161068 + 0 properly paired (99.33% : N/A)
161630 + 0 with itself and mate mapped
526 + 0 singletons (0.32% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/01_per_sample$ samtools flagstat HSV-Klinik_S2.taxfilt.bam
800454 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
800454 + 0 paired in sequencing
400227 + 0 read1
400227 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.bam
885528 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
191932 + 0 duplicates
354088 + 0 mapped (39.99% : N/A)
885528 + 0 paired in sequencing
442764 + 0 read1
442764 + 0 read2
323502 + 0 properly paired (36.53% : N/A)
324284 + 0 with itself and mate mapped
29804 + 0 singletons (3.37% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Summarize statistics from snakemake-output
samples-runs.txt
samtools flagstat data/02_align_to_self/838_S1.mapped.bam
samtools flagstat data/02_align_to_self/840_S2.mapped.bam
samtools flagstat data/02_align_to_self/820_S3.mapped.bam
samtools flagstat data/02_align_to_self/828_S4.mapped.bam
samtools flagstat data/02_align_to_self/815_S5.mapped.bam
samtools flagstat data/02_align_to_self/834_S6.mapped.bam
samtools flagstat data/02_align_to_self/808_S7.mapped.bam
samtools flagstat data/02_align_to_self/811_S8.mapped.bam
samtools flagstat data/02_align_to_self/837_S9.mapped.bam
samtools flagstat data/02_align_to_self/768_S10.mapped.bam
samtools flagstat data/02_align_to_self/773_S11.mapped.bam
samtools flagstat data/02_align_to_self/767_S12.mapped.bam
samtools flagstat data/02_align_to_self/810_S13.mapped.bam
samtools flagstat data/02_align_to_self/814_S14.mapped.bam
samtools flagstat data/02_align_to_self/10121-16_S15.mapped.bam --> 3c
Origin of hepatitis C virus genotype 3 in Africa as estimated
through an evolutionary analysis of the full-length genomes of nine
subtypes, including the newly sequenced 3d and 3e
samtools flagstat data/02_align_to_self/7510-15_S16.mapped.bam -->
samtools flagstat data/02_align_to_self/828-17_S17.mapped.bam
samtools flagstat data/02_align_to_self/8806-15_S18.mapped.bam
samtools flagstat data/02_align_to_self/9881-16_S19.mapped.bam
samtools flagstat data/02_align_to_self/8981-14_S20.mapped.bam
Consensus sequences of each and of all isolates
cp data/02_assembly/*.fasta ./
for sample in 838_S1 840_S2 820_S3 828_S4 815_S5 834_S6 808_S7 811_S8 837_S9 768_S10 773_S11 767_S12 810_S13 814_S14 10121-16_S15 7510-15_S16 828-17_S17 8806-15_S18 9881-16_S19 8981-14_S20; do
for sample in p953-84660-tsek p938-16972-nra p942-88507-nra p943-98523-nra p944-103323-nra p947-105565-nra p948-112830-nra; do \
mv ${sample}.fasta ${sample}.fa
cat all.fa ${sample}.fa >> all.fa
done
cat RSV_dedup.fa all.fa > RSV_all.fa
mafft --adjustdirection RSV_all.fa > RSV_all.aln
snp-sites RSV_all.aln -o RSV_all_.aln
Finding the next strain with Phylogenetics: send both HCV231_all.png and HCV231_all.pdf to the Nicole
#1, generate tree
cat SARS-CoV-2_len25000_w60_newheader.fa ~/rtpd_files/2029-AW_S5/idba_ud_assembly/gapped_contig.fa > CoV2_all.fa
mafft --adjustdirection CoV2_all.fa > CoV2_all.aln
snp-sites CoV2_all.aln -o CoV2_all_.aln
fasttree -gtr -nt RSV_all_.aln > RSV_all.tree
fasttree -gtr -nt Ortho_SNP_matrix_RAxML.fa > Ortho_SNP_matrix_RAxML.tree
raxml-ng --all --model GTR+G+ASC_LEWIS --prefix CoV2_all_raxml.aln --threads 1 --msa CoV2_all_.aln --bs-trees 1000 --redo
#raxml-ng --all --model GTR+G+ASC_LEWIS --prefix raxml-ng/snippy.core.aln --threads 1 --msa variants/snippy.core.aln --bs-trees 1000 --redo
#2, open tree on Dendroscope, from phylogenetic tree, get genotype-refs as follows,
1a: S10, S11, 814_S14(3-->1a?), S18 --> 1a_EF407457
1b: S12 --> 1b_M58335
2a: 815_S5(3-->2a?) --> 2a_D00944
2c: S20 --> 2c_D50409
3a: S3, S7, S8, S13, S15, S16, S19 --> 3c_KY620605
4d: S1, S2, S9 --> 4d_EU392172
4k: S4, S6 --> 4k_EU392173
--> KX249682.1
--> KX765935.1
--> KM517573.1
cd data/02_assembly/
cat p2.fasta p3e.fasta p4e.fasta p5e.fasta > all.fasta
sed -i -e 's/-1//g' all.fasta
#sed -i -e 's/e-1//g' all.fasta
mafft --adjustdirection --clustalout all.fasta > all.aln
# MANUALLY CORRECTION!
##POLISH the assembled contigs
#for sample in p953 p938 p942 p943 p944 p947 p948 p955 p954 p952 p951 p946 p945 p940; do
# rm ${sample}_polished.fa
# #seqtk sample ../../trimmed/${sample}_R1.fastq.gz 0.1 > ${sample}_0.1_R1.fastq
# #seqtk sample ../../trimmed/${sample}_R2.fastq.gz 0.1 > ${sample}_0.1_R2.fastq
# polish_viral_ref.sh -1 ../../trimmed/${sample}_R1.fastq.gz -2 ../../trimmed/${sample}_R2.fastq.gz -r ${sample}.fasta -o ${sample}_polished.fa -t 6
#done
for sample in p946 p954 p952 p948 p945 p947 p955 p943 p951 p942; do #all.aln
for sample in p944 p938 p953 p940; do #all2.aln
for sample in p2 p3 p4 p5; do
grep "${sample}" all.aln > REF${sample}.fasta
#cut -f2-2 -d$'\t' REF${sample}.fasta > REF${sample}.fast
sed -i -e "s/${sample} //g" REF${sample}.fasta
sed -i -e "s/${sample}-1 //g" REF${sample}.fasta
sed -i -e 's/-//g' REF${sample}.fasta
echo ">REF${sample}" > REF${sample}.header
cat REF${sample}.header REF${sample}.fasta > REF${sample}.fas
seqkit seq -u REF${sample}.fas -o REF${sample}.fa
cp REF${sample}.fa ${sample}.fa
mv REF${sample}.fa ../..
sed -i -e "s/REF//g" ${sample}.fa #still under data/02_assembly/
done
#ReferenceSeeker determines closely related reference genomes
#https://github.com/oschwengers/referenceseeker
(referenceseeker) jhuang@hamburg:~/DATA/Data_Holger_Efaecium$ ~/Tools/referenceseeker/bin/referenceseeker -v ~/REFs/bacteria-refseq/ shovill/noAB_wildtype/contigs.fasta
# Annotating the fasta using VAPiD
makeblastdb -in *.fasta -dbtype nucl
python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta p946R.fa ~/REFs/template_Less.sbt
python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta REFp944.fa ~/REFs/template_Less.sbt # KT581445.1 selected!
python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta contigs_final.fasta ~/REFs/template_Amir.sbt
python ~/Tools/VAPiD/vapid3.py --online contigs_final.fasta ~/REFs/template_Amir.sbt
All packages under the viral-ngs4 env, note that novoalign is not installed. The used Novoalign path: /home/jhuang/Tools/novocraft_v3/novoalign; the used gatk: /usr/local/bin/gatk using /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar (see the point 9).
mamba remove viral-ngs --all
mamba remove viral-ngs-env --all
conda remove viral-ngs-java7 --all
conda remove viral-ngs-java8 --all
conda remove viral-ngs-py36 --all
conda remove viral-ngs2 --all
conda remove viral-ngs3 --all
jhuang@WS-2290C:~$ conda activate viral-ngs4
(viral-ngs4) jhuang@WS-2290C:~$ conda list
# packages in environment at /home/jhuang/miniconda3/envs/viral-ngs4:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
_r-mutex 1.0.1 anacondar_1 conda-forge
alsa-lib 1.2.3.2 h166bdaf_0 conda-forge
bamtools 2.5.2 hdcf5f25_5 bioconda
bedtools 2.31.1 hf5e1c6e_2 bioconda
binutils_impl_linux-64 2.43 h4bf12b8_2 conda-forge
binutils_linux-64 2.43 h4852527_2 conda-forge
biopython 1.79 py36h8f6f2f9_0 conda-forge
blast 2.6.0 boost1.64_2 bioconda
bmfilter 3.101 h4ac6f70_5 bioconda
bmtagger 3.101 h470a237_4 bioconda
bmtool 3.101 hdbdd923_5 bioconda
boost 1.64.0 py36_4 conda-forge
boost-cpp 1.64.0 1 conda-forge
bowtie 1.3.1 py36h769816f_3 bioconda
bowtie2 2.5.4 h7071971_4 bioconda
bwa 0.7.18 he4a0461_1 bioconda
bwidget 1.9.14 ha770c72_1 conda-forge
bzip2 1.0.8 h4bc722e_7 conda-forge
c-ares 1.34.2 heb4867d_0 conda-forge
ca-certificates 2024.9.24 h06a4308_0
cairo 1.16.0 h18b612c_1001 conda-forge
cd-hit 4.8.1 h43eeafb_10 bioconda
cd-hit-auxtools 4.8.1 h4ac6f70_3 bioconda
certifi 2021.5.30 py36h5fab9bb_0 conda-forge
curl 7.68.0 hf8cf82a_0 conda-forge
cycler 0.11.0 pyhd8ed1ab_0 conda-forge
dbus 1.13.6 hfdff14a_1 conda-forge
diamond 2.1.10 h43eeafb_2 bioconda
expat 2.6.4 h5888daf_0 conda-forge
extract_fullseq 3.101 h4ac6f70_5 bioconda
fastqc 0.12.1 hdfd78af_0 bioconda
font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge
fontconfig 2.14.1 hef1e5e3_0
freetype 2.12.1 h267a509_2 conda-forge
fribidi 1.0.10 h36c2ea0_0 conda-forge
future 0.18.2 py36h5fab9bb_3 conda-forge
gap2seq 2.1 boost1.64_1 bioconda
gatk 3.6 hdfd78af_11 bioconda
gcc_impl_linux-64 14.2.0 h6b349bd_1 conda-forge
gcc_linux-64 14.2.0 h5910c8f_5 conda-forge
gettext 0.22.5 he02047a_3 conda-forge
gettext-tools 0.22.5 he02047a_3 conda-forge
gfortran_impl_linux-64 14.2.0 hc73f493_1 conda-forge
gfortran_linux-64 14.2.0 hda50785_5 conda-forge
giflib 5.2.2 hd590300_0 conda-forge
glib 2.66.3 h58526e2_0 conda-forge
graphite2 1.3.13 h59595ed_1003 conda-forge
gsl 2.4 h294904e_1006 conda-forge
gst-plugins-base 1.14.5 h0935bb2_2 conda-forge
gstreamer 1.14.5 h36ae1b5_2 conda-forge
gxx_impl_linux-64 14.2.0 h2c03514_1 conda-forge
gxx_linux-64 14.2.0 h9423afd_5 conda-forge
harfbuzz 2.4.0 h37c48d4_1 conda-forge
icu 58.2 hf484d3e_1000 conda-forge
jpeg 9e h0b41bf4_3 conda-forge
kernel-headers_linux-64 3.10.0 he073ed8_18 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
kiwisolver 1.3.1 py36h605e78d_1 conda-forge
kmer-jellyfish 2.3.1 h4ac6f70_2 bioconda
krb5 1.16.4 h2fd8d38_0 conda-forge
last 876 py36_0 bioconda
lcms2 2.12 hddcbb42_0 conda-forge
ld_impl_linux-64 2.43 h712a8e2_2 conda-forge
libasprintf 0.22.5 he8f35ee_3 conda-forge
libasprintf-devel 0.22.5 he8f35ee_3 conda-forge
libblas 3.9.0 25_linux64_openblas conda-forge
libcblas 3.9.0 25_linux64_openblas conda-forge
libcurl 7.68.0 hda55be3_0 conda-forge
libdeflate 1.21 h4bc722e_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 hd590300_2 conda-forge
libexpat 2.6.4 h5888daf_0 conda-forge
libffi 3.2.1 he1b5a44_1007 conda-forge
libgcc 14.2.0 h77fa898_1 conda-forge
libgcc-devel_linux-64 14.2.0 h41c2201_101 conda-forge
libgcc-ng 14.2.0 h69a702a_1 conda-forge
libgettextpo 0.22.5 he02047a_3 conda-forge
libgettextpo-devel 0.22.5 he02047a_3 conda-forge
libgfortran-ng 7.5.0 h14aa051_20 conda-forge
libgfortran4 7.5.0 h14aa051_20 conda-forge
libgfortran5 14.2.0 hd5240d6_1 conda-forge
libglib 2.66.3 hbe7bbb4_0 conda-forge
libgomp 14.2.0 h77fa898_1 conda-forge
libiconv 1.17 hd590300_2 conda-forge
libidn11 1.33 h7b6447c_0
liblapack 3.9.0 25_linux64_openblas conda-forge
libnghttp2 1.51.0 hdcd2b5c_0 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libopenblas 0.3.28 pthreads_h94d23a6_0 conda-forge
libpng 1.6.43 h2797004_0 conda-forge
libsanitizer 14.2.0 h2a3dede_1 conda-forge
libsqlite 3.46.0 hde9e2c9_0 conda-forge
libssh2 1.10.0 haa6b8db_3 conda-forge
libstdcxx 14.2.0 hc0a3c3a_1 conda-forge
libstdcxx-devel_linux-64 14.2.0 h41c2201_101 conda-forge
libstdcxx-ng 14.2.0 h4852527_1 conda-forge
libtiff 4.2.0 hf544144_3 conda-forge
libuuid 1.0.3 h7f8727e_2
libwebp-base 1.4.0 hd590300_0 conda-forge
libxcb 1.17.0 h8a09558_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libxml2 2.9.14 h74e7548_0
libzlib 1.2.13 h4ab18f5_6 conda-forge
llvm-openmp 8.0.1 hc9558a2_0 conda-forge
mafft 7.221 0 bioconda
make 4.4.1 hb9d3cd8_2 conda-forge
matplotlib 3.3.4 py36h5fab9bb_0 conda-forge
matplotlib-base 3.3.4 py36hd391965_0 conda-forge
mummer4 4.0.0rc1 pl5321hdbdd923_7 bioconda
muscle 3.8.1551 h7d875b9_6 bioconda
mvicuna 1.0 h4ac6f70_10 bioconda
ncurses 6.5 he02047a_1 conda-forge
numpy 1.19.5 py36hfc0c790_2 conda-forge
olefile 0.46 pyh9f0ad1d_1 conda-forge
openjdk 8.0.412 hd590300_1 conda-forge
openjpeg 2.4.0 hb52868f_1 conda-forge
openmp 8.0.1 0 conda-forge
openssl 1.1.1w hd590300_0 conda-forge
pandas 1.1.5 py36h284efc9_0 conda-forge
pango 1.42.4 h7062337_4 conda-forge
parallel 20240922 ha770c72_0 conda-forge
pcre 8.45 h9c3ff4c_0 conda-forge
perl 5.32.1 7_hd590300_perl5 conda-forge
picard 3.0.0 hdfd78af_0 bioconda
pigz 2.6 h27cfd23_0
pillow 8.2.0 py36ha6010c0_1 conda-forge
pip 21.3.1 pyhd8ed1ab_0 conda-forge
pixman 0.38.0 h516909a_1003 conda-forge
prinseq 0.20.4 hdfd78af_5 bioconda
pthread-stubs 0.4 hb9d3cd8_1002 conda-forge
pybedtools 0.9.0 py36h7281c5b_1 bioconda
pyparsing 3.1.4 pyhd8ed1ab_0 conda-forge
pyqt 5.9.2 py36hcca6a23_4 conda-forge
pysam 0.16.0 py36h873a209_0 bioconda
python 3.6.7 h381d211_1004 conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python_abi 3.6 2_cp36m conda-forge
pytz 2023.3.post1 pyhd8ed1ab_0 conda-forge
pyyaml 5.4.1 py36h8f6f2f9_1 conda-forge
qt 5.9.7 h52cfd70_2 conda-forge
r-assertthat 0.2.1 r36h6115d3f_2 conda-forge
r-backports 1.2.1 r36hcfec24a_0 conda-forge
r-base 3.6.1 h9bb98a2_1
r-bitops 1.0_7 r36hcfec24a_0 conda-forge
r-brio 1.1.2 r36hcfec24a_0 conda-forge
r-callr 3.7.0 r36hc72bb7e_0 conda-forge
r-catools 1.18.2 r36h03ef668_0 conda-forge
r-cli 2.5.0 r36hc72bb7e_0 conda-forge
r-colorspace 2.0_1 r36hcfec24a_0 conda-forge
r-crayon 1.4.1 r36hc72bb7e_0 conda-forge
r-desc 1.3.0 r36hc72bb7e_0 conda-forge
r-diffobj 0.3.4 r36hcfec24a_0 conda-forge
r-digest 0.6.27 r36h03ef668_0 conda-forge
r-ellipsis 0.3.2 r36hcfec24a_0 conda-forge
r-evaluate 0.14 r36h6115d3f_2 conda-forge
r-fansi 0.4.2 r36hcfec24a_0 conda-forge
r-farver 2.1.0 r36h03ef668_0 conda-forge
r-ggplot2 3.3.3 r36hc72bb7e_0 conda-forge
r-glue 1.4.2 r36hcfec24a_0 conda-forge
r-gplots 3.1.1 r36hc72bb7e_0 conda-forge
r-gsalib 2.1 r36_1002 conda-forge
r-gtable 0.3.0 r36h6115d3f_3 conda-forge
r-gtools 3.8.2 r36hcdcec82_1 conda-forge
r-isoband 0.2.4 r36h03ef668_0 conda-forge
r-jsonlite 1.7.2 r36hcfec24a_0 conda-forge
r-kernsmooth 2.23_20 r36h742201e_0 conda-forge
r-labeling 0.4.2 r36h142f84f_0 conda-forge
r-lattice 0.20_44 r36hcfec24a_0 conda-forge
r-lifecycle 1.0.0 r36hc72bb7e_0 conda-forge
r-magrittr 2.0.1 r36hcfec24a_1 conda-forge
r-mass 7.3_54 r36hcfec24a_0 conda-forge
r-matrix 1.3_3 r36he454529_0 conda-forge
r-mgcv 1.8_35 r36he454529_0 conda-forge
r-munsell 0.5.0 r36h6115d3f_1003 conda-forge
r-nlme 3.1_152 r36h859d828_0 conda-forge
r-pillar 1.6.1 r36hc72bb7e_0 conda-forge
r-pkgconfig 2.0.3 r36h6115d3f_1 conda-forge
r-pkgload 1.2.1 r36h03ef668_0 conda-forge
r-plyr 1.8.6 r36h0357c0b_1 conda-forge
r-praise 1.0.0 r36h6115d3f_1004 conda-forge
r-processx 3.5.2 r36hcfec24a_0 conda-forge
r-ps 1.6.0 r36hcfec24a_0 conda-forge
r-r6 2.5.0 r36hc72bb7e_0 conda-forge
r-rcolorbrewer 1.1_2 r36h6115d3f_1003 conda-forge
r-rcpp 1.0.6 r36h03ef668_0 conda-forge
r-rematch2 2.1.2 r36h6115d3f_1 conda-forge
r-reshape 0.8.8 r36hcdcec82_2 conda-forge
r-rlang 0.4.11 r36hcfec24a_0 conda-forge
r-rprojroot 2.0.2 r36hc72bb7e_0 conda-forge
r-rstudioapi 0.13 r36hc72bb7e_0 conda-forge
r-scales 1.1.1 r36h6115d3f_0 conda-forge
r-testthat 3.0.2 r36h03ef668_0 conda-forge
r-tibble 3.1.2 r36hcfec24a_0 conda-forge
r-utf8 1.2.1 r36hcfec24a_0 conda-forge
r-vctrs 0.3.8 r36hcfec24a_1 conda-forge
r-viridislite 0.4.0 r36hc72bb7e_0 conda-forge
r-waldo 0.2.5 r36hc72bb7e_0 conda-forge
r-withr 2.4.2 r36hc72bb7e_0 conda-forge
readline 7.0 hf8c457e_1001 conda-forge
salmon 0.14.2 ha0cc327_0 bioconda
samtools 1.6 h244ad75_5 bioconda
setuptools 58.0.4 py36h5fab9bb_2 conda-forge
sip 4.19.8 py36hf484d3e_1000 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
snpeff 4.1l hdfd78af_8 bioconda
spades 3.15.5 h95f258a_1 bioconda
sqlite 3.28.0 h8b20d00_0 conda-forge
srprism 2.4.24 h6a68c12_5 bioconda
sysroot_linux-64 2.17 h4a8ded7_18 conda-forge
tbb 2020.3 hfd86e86_0
tbl2asn 25.7 h9ee0642_1 bioconda
tk 8.6.13 noxft_h4845f30_101 conda-forge
tktable 2.10 h8bc8fbc_6 conda-forge
tornado 6.1 py36h8f6f2f9_1 conda-forge
trimmomatic 0.39 hdfd78af_2 bioconda
trinity 2.8.5 h8b12597_5 bioconda
tzdata 2024b hc8b5060_0 conda-forge
unzip 6.0 h611a1e1_0
vphaser2 2.0 h7a259b3_14 bioconda
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
xorg-libice 1.0.10 h7f98852_0 conda-forge
xorg-libsm 1.2.2 h470a237_5 conda-forge
xorg-libx11 1.8.10 h4f16b4b_0 conda-forge
xorg-libxau 1.0.11 hb9d3cd8_1 conda-forge
xorg-libxdmcp 1.1.5 hb9d3cd8_0 conda-forge
xorg-libxext 1.3.6 hb9d3cd8_0 conda-forge
xorg-libxfixes 6.0.1 hb9d3cd8_0 conda-forge
xorg-libxi 1.8.2 hb9d3cd8_0 conda-forge
xorg-libxrender 0.9.11 hb9d3cd8_1 conda-forge
xorg-libxtst 1.2.5 hb9d3cd8_3 conda-forge
xorg-xorgproto 2024.1 hb9d3cd8_1 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
yaml 0.2.5 h7f98852_2 conda-forge
zlib 1.2.13 h4ab18f5_6 conda-forge
zstd 1.5.6 ha6fb4c9_0 conda-forge
commands of viral-ngs
bin/interhost.py
Enter a subcommand to view additional information:
interhost.py snpEff [...]
Annotate variants in VCF file with translation consequences
using snpEff.
interhost.py align_mafft [...]
Run the mafft alignment on the input FASTA file.
interhost.py multichr_mafft [...]
Run the mafft alignment on a series of chromosomes provided
in sample-partitioned FASTA files. Output as FASTA. (i.e.
file1.fasta would contain chr1, chr2, chr3; file2.fasta
would also contain chr1, chr2, chr3)
bin/ncbi.py
Enter a subcommand to view additional information:
ncbi.py tbl_transfer [...]
This function takes an NCBI TBL file describing features on
a genome(genes, etc) and transfers them to a new genome.
ncbi.py tbl_transfer_prealigned [...]
This breaks out the ref and alt sequences into separate
fasta files, and thencreates unified files containing the
reference sequence first and the alt second. Each of these
unified filesis then passed as a cmap to
tbl_transfer_common. This function expects to receive one
fasta file containing a multialignment of a single
segment/chromosome alongwith the respective reference
sequence for that segment/chromosome. It also expects a
reference containing allreference segments/chromosomes, so
that the reference sequence can be identified in the input
file by name. Italso expects a list of reference tbl files,
where each file is named according to the ID present for
itscorresponding sequence in the refFasta. For each non-
reference sequence present in the inputFasta, two files
arewritten: a fasta containing the segment/chromosome for
the same, along with its corresponding feature table
ascreated by tbl_transfer_common.
ncbi.py fetch_fastas [...]
This function downloads and saves the FASTA filesfrom the
Genbank CoreNucleotide database given a given list of
accession IDs.
ncbi.py fetch_feature_tables [...]
This function downloads and savesfeature tables from the
Genbank CoreNucleotide database given a given list of
accession IDs.
ncbi.py fetch_genbank_records [...]
This function downloads and savesfull flat text records from
Genbank CoreNucleotide database given a given list of
accession IDs.
ncbi.py prep_genbank_files [...]
Prepare genbank submission files. Requires .fasta and .tbl
files as input,as well as numerous other metadata files for
the submission. Creates adirectory full of files (.sqn in
particular) that can be sent to GenBank.
ncbi.py prep_sra_table [...]
This is a very lazy hack that creates a basic table that can
bepasted into various columns of an SRA submission
spreadsheet. It probablydoesn't work in all cases.
~/Scripts/check_sequence_differences.py
#!/usr/bin/env python3
from Bio import AlignIO
import sys
# Check if correct arguments are provided
if len(sys.argv) != 2:
print("Usage: python check_sequence_differences.py <input_clustal>")
sys.exit(1)
# Get the input file name from the command-line arguments
input_file = sys.argv[1]
# Read the alignment from the input CLUSTAL file
alignment = AlignIO.read(input_file, "clustal")
# Extract the sequences for easy comparison
seq_op = alignment[0].seq
seq_hsv1 = alignment[1].seq
seq_hsv_klinik = alignment[2].seq
# Initialize a list to store positions with differences
differences = []
# Iterate over each position in the alignment
for i in range(len(seq_op)):
op_base = seq_op[i]
hsv1_base = seq_hsv1[i]
hsv_klinik_base = seq_hsv_klinik[i]
# Compare the sequences at the current position
if op_base != hsv1_base or op_base != hsv_klinik_base or hsv1_base != hsv_klinik_base:
differences.append((i + 1, op_base, hsv1_base, hsv_klinik_base))
# Print the differences
if differences:
print("Differences found at the following positions:")
for diff in differences:
pos, op_base, hsv1_base, hsv_klinik_base = diff
print(f"Position {pos}: OP297860.1 = {op_base}, HSV1_S1-1 = {hsv1_base}, HSV-Klinik_S2-1 = {hsv_klinik_base}")
else:
print("No differences found between the sequences.")
~/Scripts/summarize_snippy_res.py
import pandas as pd
import glob
import argparse
import os
#python3 summarize_snps_indels.py snippy_HDRNA_01/snippy
#The following record for 2365295 is wrong, since I am sure in the HDRNA_01_K010, it should be a 'G', since in HDRNA_01_K010.csv
#CP133676,2365295,snp,A,G,G:178 A:0
#
#The current output is as follows:
#CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,None,,,,,,None,None
#CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan
#grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
#BUG: CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan
import pandas as pd
import glob
import argparse
import os
def main(base_directory):
# List of isolate identifiers
isolates = ["HSV1_S1", "HSV-Klinik_S2"]
expected_columns = ["CHROM", "POS", "REF", "ALT", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"]
# Find all CSV files in the directory and its subdirectories
csv_files = glob.glob(os.path.join(base_directory, '**', '*.csv'), recursive=True)
# Create an empty DataFrame to store the summary
summary_df = pd.DataFrame()
# Iterate over each CSV file
for file_path in csv_files:
# Extract isolate identifier from the file name
isolate = os.path.basename(file_path).replace('.csv', '')
df = pd.read_csv(file_path)
# Ensure all expected columns are present, adding missing ones as empty columns
for col in expected_columns:
if col not in df.columns:
df[col] = None
# Extract relevant columns
df = df[expected_columns]
# Ensure consistent data types
df = df.astype({"CHROM": str, "POS": int, "REF": str, "ALT": str, "TYPE": str, "EFFECT": str, "LOCUS_TAG": str, "GENE": str, "PRODUCT": str})
# Add the isolate column with the ALT allele
df[isolate] = df["ALT"]
# Drop columns that are not needed for the summary
df = df.drop(["ALT"], axis=1)
if summary_df.empty:
summary_df = df
else:
summary_df = pd.merge(summary_df, df, on=["CHROM", "POS", "REF", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"], how="outer")
# Fill missing values with the REF allele for each isolate column
for isolate in isolates:
if isolate in summary_df.columns:
summary_df[isolate] = summary_df[isolate].fillna(summary_df["REF"])
else:
summary_df[isolate] = summary_df["REF"]
# Rename columns to match the required format
summary_df = summary_df.rename(columns={
"CHROM": "CHROM",
"POS": "POS",
"REF": "REF",
"TYPE": "TYPE",
"EFFECT": "Effect",
"LOCUS_TAG": "Gene_name",
"GENE": "Biotype",
"PRODUCT": "Product"
})
# Replace any remaining None or NaN values in the non-isolate columns with empty strings
summary_df = summary_df.fillna("")
# Add empty columns for Impact, Functional_Class, Codon_change, Protein_and_nucleotide_change, Amino_Acid_Length
summary_df["Impact"] = ""
summary_df["Functional_Class"] = ""
summary_df["Codon_change"] = ""
summary_df["Protein_and_nucleotide_change"] = ""
summary_df["Amino_Acid_Length"] = ""
# Reorder columns
cols = ["CHROM", "POS", "REF", "TYPE"] + isolates + ["Effect", "Impact", "Functional_Class", "Codon_change", "Protein_and_nucleotide_change", "Amino_Acid_Length", "Gene_name", "Biotype"]
summary_df = summary_df.reindex(columns=cols)
# Remove duplicate rows
summary_df = summary_df.drop_duplicates()
# Save the summary DataFrame to a CSV file
output_file = os.path.join(base_directory, "summary_snps_indels.csv")
summary_df.to_csv(output_file, index=False)
print("Summary CSV file created successfully at:", output_file)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Summarize SNPs and Indels from CSV files.")
parser.add_argument("directory", type=str, help="Base directory containing the CSV files in subdirectories.")
args = parser.parse_args()
main(args.directory)
~/Scripts/merge_snps_indels.py
import pandas as pd
import argparse
import os
def merge_files(input_file1, input_file2, output_file):
# Read the input files
df1 = pd.read_csv(input_file1)
df2 = pd.read_csv(input_file2, sep='\t')
# Merge the dataframes on the 'POS' column, keeping only the rows that have common 'POS' values
merged_df = pd.merge(df2, df1[['POS']], on='POS', how='inner')
# Remove duplicate rows
merged_df.drop_duplicates(inplace=True)
# Save the merged dataframe to the output file
merged_df.to_csv(output_file, index=False)
print("Merged file created successfully at:", output_file)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Merge two SNP and Indel files based on the 'POS' column.")
parser.add_argument("input_file1", type=str, help="Path to the first input file (summary_snps_indels.csv).")
parser.add_argument("input_file2", type=str, help="Path to the second input file (All_SNPs_indels_annotated.txt).")
parser.add_argument("output_file", type=str, help="Path to the output file.")
args = parser.parse_args()
merge_files(args.input_file1, args.input_file2, args.output_file)
~/Scripts/convert_fasta_to_clustal.py
#!/usr/bin/env python3
from Bio import AlignIO
import sys
# Check if the correct number of arguments is provided
if len(sys.argv) != 3:
print("Usage: python convert_fasta_to_clustal.py <input_fasta> <output_clustal>")
sys.exit(1)
# Get the input and output file names from command-line arguments
input_file = sys.argv[1]
output_file = sys.argv[2]
# Read the input FASTA file
alignment = AlignIO.read(input_file, "fasta")
# Write the alignment to the output CLUSTAL file
with open(output_file, "w") as out_file:
AlignIO.write(alignment, out_file, "clustal")
print(f"Conversion complete! The CLUSTAL file is saved as {output_file}.")
~/Scripts/convert_clustal_to_clustal.py
#!/usr/bin/env python3
from Bio import AlignIO
import sys
# Check if correct arguments are provided
if len(sys.argv) != 3:
print("Usage: python convert_clustal_to_fasta.py <input_clustal> <output_clustal>")
sys.exit(1)
# Get the input and output file names from command-line arguments
input_file = sys.argv[1]
output_file = sys.argv[2]
# Read the CLUSTAL alignment
alignment = AlignIO.read(input_file, "clustal")
# Extract sequences (assuming three sequences)
op_seq = alignment[0].seq
hsv1_seq = alignment[1].seq
hsv_klinik_seq = alignment[2].seq
# Make sure the sequences have the same length
if len(op_seq) != len(hsv1_seq) or len(op_seq) != len(hsv_klinik_seq):
print("Error: Sequences have different lengths!")
sys.exit(1)
# Prepare new sequences for HSV1 and HSV-Klinik
new_hsv1_seq = []
new_hsv_klinik_seq = []
# Iterate through each position of the sequences
for i in range(len(op_seq)):
op_base = op_seq[i]
hsv1_base = hsv1_seq[i]
hsv_klinik_base = hsv_klinik_seq[i]
# Apply the rules for replacing bases in HSV1_S1-1 and HSV-Klinik_S2-1
if hsv1_base in ['N', '-']:
# Replace with OP297860.1 base
new_hsv1_seq.append(op_base)
else:
# Otherwise, keep the original base
new_hsv1_seq.append(hsv1_base)
if hsv_klinik_base in ['N', '-']:
# Replace with OP297860.1 base
new_hsv_klinik_seq.append(op_base)
else:
# Otherwise, keep the original base
new_hsv_klinik_seq.append(hsv_klinik_base)
# Update the sequences in the alignment
alignment[1].seq = "".join(new_hsv1_seq)
alignment[2].seq = "".join(new_hsv_klinik_seq)
# Write the modified alignment back to a file in CLUSTAL format
with open(output_file, "w") as out_file:
AlignIO.write(alignment, out_file, "clustal")
print(f"Conversion complete! The modified CLUSTAL file is saved as {output_file}.")
~/Scripts/convert_clustal_to_fasta.py
#!/usr/bin/env python3
from Bio import AlignIO
import sys
# Check if the correct number of arguments is provided
if len(sys.argv) != 3:
print("Usage: python convert_clustal_to_fasta.py <input_clustal> <output_fasta>")
sys.exit(1)
# Get the input and output file names from command-line arguments
input_file = sys.argv[1]
output_file = sys.argv[2]
# Read the input CLUSTAL file
alignment = AlignIO.read(input_file, "clustal")
# Write the alignment to the output FASTA file
with open(output_file, "w") as out_file:
AlignIO.write(alignment, out_file, "fasta")
print(f"Conversion complete! The FASTA file is saved as {output_file}.")
~/Scripts/filter_isnv.py
#!/usr/bin/env python3
import sys
import pandas as pd
# Check for correct command-line arguments
if len(sys.argv) != 3:
print("Usage: python filter_isnv.py <input_file> <min_freq>")
sys.exit(1)
input_file = sys.argv[1]
min_freq = float(sys.argv[2])
# Load the data into a pandas DataFrame
data = pd.read_csv(input_file, sep='\t')
# Filter out records where all records at the same position have iSNV_freq < min_freq
def filter_isnv(data, min_freq):
# Group data by 'chr' and 'pos' to check records at each position
grouped = data.groupby(['chr', 'pos'])
# Keep groups where at least one record has iSNV_freq >= min_freq
filtered_data = grouped.filter(lambda x: any(x['iSNV_freq'] >= min_freq))
return filtered_data
# Apply the filter
filtered_data = filter_isnv(data, min_freq)
# Output the filtered data
output_file = "filtered_" + input_file
filtered_data.to_csv(output_file, sep='\t', index=False)
print(f"Filtered data saved to {output_file}")
点赞本文的读者
还没有人对此文章表态
没有评论
RNA-seq Tam on CP059040.1 (Acinetobacter baumannii strain ATCC 19606)
Typing of 81 S. epidermidis samples (Luise)
Co-Authorship Network Generator using scraped data from Google Scholar via SerpAPI
© 2023 XGenes.com Impressum