Prepare the databases for vrap

I used an strategy, at first annotate the contigs using the virus-speicific data and bacteria-speicific data, then using more general databases nt and nr. The results are as attached. For some samples, for examples S5, which we can detected several contigs as gammaherpesvius. For the bacteria, it is more conversed.

 # -- txid10239 (Virus) and Taxonomy ID: 2 (Bacteria) --
 # -- Virus --
 #TODO: from 1,100,000 --> 1,288,629 (up to 2020/07/01); bacteria we can use refseq (up to 2020/07/01)!
 #--virus bacteria-refseq-fasta, then virus sequences, virus protein as default database, then nt and nr!
 #TODO!: download bact_nt_db and use in '--virus bact_nt_db'!
 #  pip install ncbi-genome-download
 #  ncbi-genome-download -F fasta bacteria
 #  ncbi-genome-download -F fasta virus
 #  https://www.ncbi.nlm.nih.gov/genome/microbes/
 #  https://www.biostars.org/p/9503245/

download bacteria refseq with datasets

 #https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/
 The NCBI Datasets datasets command line tools are datasets and dataformat .

 #datasets download genome bacteria --assembly-source refseq --dehydrated --filename bacteria_refseq.zip
 ~/Tools/datasets download genome bacteria --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename bacteria_refseq_fasta.zip
 ~/Tools/datasets download genome taxon bacteria #2,231,190
 ~/Tools/datasets download genome taxon bacteria --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename bacteria_refseq.zip  #325,471
 #~/Tools/datasets download genome taxon virus #97,281 records
 #~/Tools/datasets download genome taxon virus --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename virus_refseq.zip #14,992

 #Unzip the file
 unzip bacteria_refseq.zip -d bacteria_refseq
 unzip virus_refseq.zip -d virus_refseq

 #Rehydrate the file: I'm recommending the dehydrated option because it's actually faster and more reliable, despite the additional steps. By default, the data package includes genomic, transcript, protein and cds sequences, in addition to gff3. If you only need the genomic fasta sequences, you can use this command instead:
 ~/Tools/datasets rehydrate --directory bacteria_refseq/
 ~/Tools/datasets rehydrate --directory virus_refseq/  #29984

run vrap.py with –host genome.fa –virus bacteria_refseq [default viral_db up to 2020_07_01] -n nt -r nr

 # -- Virus --
 vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R2_001.fastq.gz -o CMV_S4_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
 vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R2_001.fastq.gz -o EBV_S5_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200

 # -- Control --
 vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R2_001.fastq.gz -o neg_control_S2_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200   

 # -- Bacteria --
 vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R2_001.fastq.gz -o E_faecium_S1_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
 vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R2_001.fastq.gz -o S_aureus_epidermidis_S3_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
 #END

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Leave a Reply Cancel reply