Author Archives: gene_x

Transposable element (TEs) and structural variant (SV) detection in bacterial genomes

For a bacterial genome such as Acinetobacter baumannii, the pipeline would be slightly different than the one used for the human due to the simpler genome structure, the absence of introns, and the different nature of repetitive elements compared to eukaryotes. Here is a pipeline tailored to bacterial genomes:

Quality Control of Sequencing Reads: Use FastQC for quality control checks on raw sequence data.
```
fastqc your_reads.fastq.gz
```

Read Trimming: Trim adapters and low-quality bases using Trimmomatic.

trimmomatic PE your_reads_R1.fastq.gz your_reads_R2.fastq.gz \
  your_reads_R1_paired.fastq.gz your_reads_R1_unpaired.fastq.gz \
  your_reads_R2_paired.fastq.gz your_reads_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:36

De Novo Assembly or Reference Alignment: If a closely related reference genome is available:

bwa mem reference_genome.fasta your_reads_R1_paired.fastq.gz your_reads_R2_paired.fastq.gz > aligned_reads.sam

Or for de novo assembly of the bacterial genome:

spades.py -1 your_reads_R1_paired.fastq.gz -2 your_reads_R2_paired.fastq.gz --careful -o spades_output

Then, continue with the contigs (if de novo assembly was performed):

bwa mem reference_genome.fasta spades_output/contigs.fasta > aligned_contigs.sam

SAM/BAM Conversion and Sorting: Use SAMtools to convert and sort the alignment files.

samtools view -bS aligned_reads.sam > aligned_reads.bam
samtools sort aligned_reads.bam -o aligned_reads_sorted.bam
samtools index aligned_reads_sorted.bam

Detection of Transposable Elements: For bacterial genomes, tools like ISfinder, TnpPred, or custom scripts utilizing BLAST can be used to identify Insertion Sequences (IS) and other mobile elements.
```
# For ISfinder, it's usually a web-based tool or database search.

# For TnpPred:
TnpPred.py -i spades_output/contigs.fasta -o TnpPred_output
```
Annotation of Assembled Genome or Contigs: Use Prokka to annotate the assembled genome or contigs.
```
prokka --outdir prokka_output --prefix ab_contigs spades_output/contigs.fasta
```
Structural Variant Detection: For structural variants including transposable elements, you can use tools like MUMmer for comparing the assembled contigs to the reference genome.
```
nucmer --mum reference_genome.fasta spades_output/contigs.fasta -p output_prefix
```
Visualization: Tools like Artemis or IGV can be used to visualize the annotated genome and identify regions with transposable elements.
```
# Launch Artemis or IGV and load your BAM files and annotations for visualization.
```

Tools for detection of Transposable Elements (namely the step 5 above)

ISfinder:

# ISfinder does not come with a command-line tool. It's an online resource.
# You would download your sequence's IS annotations from ISfinder after submitting your sequences on their website.

RepeatMasker:

RepeatMasker -species bacteria -pa 4 -xsmall your_genome_sequence.fasta

TnpPred:

# First, download the TnpPred tool, then:
perl TnpPred.pl -i your_genome_sequence.fasta -o output_directory

MobileElementFinder (MEF):

# Assuming you have installed MobileElementFinder, the basic command would be:
python MobileElementFinder.py -i your_contigs.fasta -o mef_output -d mef_database_path

OASIS:

# OASIS is an online service, so you would use the web interface to submit your sequence data.

Meta-Mobilome:

# Similarly, Meta-Mobilome is an online tool, you would need to upload your data through their web portal.

ICEberg:

# ICEberg is also used via a web interface for annotation and detection of ICEs.

MobilomeFINDER:

# This service is web-based as well. You will need to interact with the MobilomeFINDER platform through the browser.

For structural variant (SV) detection in bacterial genomes (namely the step 7 above), we can consider using a range of tools designed to detect large genomic rearrangements, such as insertions, deletions, inversions, and translocations. Note that SV detection tools often require pre-processed data, such as alignment files (e.g., BAM files), which you need to create by mapping your reads to a reference genome with tools like BWA or Bowtie2. Some of these tools are also designed with eukaryotic genomes in mind, so their default settings might not be optimal for bacterial genomes, and you may need to adjust parameters accordingly. Here are some command-line examples for some of the tools that can be used for SV detection:
- MUMmer (particularly nucmer and show-diff for comparing assemblies):
```
nucmer --mum reference.fasta query.fasta -p output_prefix
show-diff -r output_prefix.delta > output_prefix.diffs
```
- DELly (originally designed for human genomes, but can be adapted for bacteria with long-read data):
```
delly call -g reference.fasta -o output.bcf input.bam
```
- Pindel (can detect large deletions and insertions):
```
pindel -f reference.fasta -i config.txt -c ALL -o output_prefix
```
- LUMPY (a probabilistic framework for SV discovery):
```
lumpyexpress -B input.bam -S input.splitters.bam -D input.discordants.bam -o output.vcf
```
- BreakDancer (can predict various types of SVs):
```
breakdancer-max config.txt > output.ctx
```
- breseq (especially good for short-read data in bacteria):
```
breseq -r reference.gbk input_reads.fastq -o output_directory
```

LT manuscript TODOs

Leave a reply

Define “promoter types” by examining the frequency and distribution of the ‘GRGGC’ motif on both + and – strands within the promoter region, and categorize promoters based on the identified “promoter types.”
Investigate the reasons for observing only a small proportion of peaks in the promoter regions in my analysis (e.g., 203 out of 1938 peaks in NHDF samples).
Evaluate the consistency of peaks between replicates for both NHDF and K331A samples.
Identify the overlap between peaks and integration sites.

Perform enhancer analysis using the additional data for H3K4me1 and H3K27ac marks, obtained from public repositories. /home/jhuang/DATA/Data_Denise_sT_immune_evasion_PUBLISHING/Data_Denise_ChIPSeq_Protocol2/Data_H3K27ac/Raw_Data

#H3K4me3_H3K27ac__H3K27me3_H3K9me3
V_8_1_6_p601_d8_D1_H3K4me3.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K4me3_D1
V_8_1_5_p601_d8_D2_H3K4me3.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K4me3_D2
V_8_1_6_p604_d8_D1_H3K4me3.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K4me3_D1
V_8_1_5_p604_d8_D2_H3K4me3.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K4me3_D2
V_8_1_6_p601_d8_D1_H3K27me3.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K27me3_D1
V_8_1_5_p601_d8_D2_H3K27me3.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K27me3_D2
V_8_1_6_p604_d8_D1_H3K27me3.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K27me3_D1
V_8_1_5_p604_d8_D2_H3K27me3.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K27me3_D2
V_8_1_7_p601_d8_D1_H3K9me3.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K9me3_D1
V_8_1_7_p601_d8_D2_H3K9me3.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K9me3_D2
V_8_1_7_p604_d8_D1_H3K9me3.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K9me3_D1
V_8_1_7_p604_d8_D2_H3K9me3.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K9me3_D2
V_8_1_8_p601_d8_D1_H3K27ac.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K27ac_D1
V_8_1_8_p601_d8_D2_H3K27ac.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K27ac_D2
V_8_1_8_p604_d8_D1_H3K27ac.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K27ac_D1
V_8_1_8_p604_d8_D2_H3K27ac.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K27ac_D2

#H3K27ac_H3K4me1_public
#https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM733662
#https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003526
NHDF_H3K4me1_r1.fastq.gz
NHDF_H3K4me1_r2.fastq.gz
NHDF_H3K27ac_r1.fastq.gz
NHDF_H3K27ac_r2.fastq.gz
NHDF_Control_r1.fastq.gz
NHDF_Control_r2.fastq.gz

In the PCA plot, adjust the x-axis, y-axis and z-axis to their actual scales, and exclude d3 samples.
Identify the significant genes by conducting analyses focused on the specific sequencing data of interest (for example, only input the LT_d8 and control_d8 data for the LT_vs_control analysis), and then compare the lists of significant genes derived from these targeted analyses with those from the comprehensive analysis.
```
#Error during "macs2 callpeak"
(rnaseq) jhuang@hamm:~/DATA/Data_Denise_LT_DNA_Binding/ChIPseq_histone_hg38/H3K27ac_H3K4me1_public$ nextflow run NGI-ChIPseq/main.nf --reads '/mnt/h1/jhuang/DATA/Data_Denise_LT_DNA_Binding/ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/Raw_Data/*.fastq.gz' --genome hg38 --macsconfig macs.config --saveReference --saveAlignedIntermediates --singleEnd --blacklist_filtering -profile standard --project NHDF_enhancer_analysis_hg38 -resume
```
new

Cell line T antigens Nr of peaks

293 Early region 2219

293 LTtr ?

PFSK1 Early region 7795

PFSK1 LTtr ?

nHDF LT 1938

nHDF LTtr 381

WaGa sT+LTtr (endogenous) ?

old

Cell line T antigens Nr of peaks

293 Early region 2219

293 LTtr ?

PFSK1 Early region 7795

nHDF LT 1938

nHDF LTtr ?

nHDF LTtr 381

WaGa sT+LTtr (endogenous) ?
A brief overview of each cell line and their relevance to MCPyV:
- HEK293:
  - Origin: Human embryonic kidney cells, but with characteristics of immature neuronal cells.
  - MCPyV: These cells are often used in virology research due to their high transfection efficiency. They are not naturally permissive to MCPyV infection, but they can be used to study the virus by introducing viral genes through transfection.
- PFSK-1:
  - Origin: Derived from a primary cerebral tumor of a 22-month-old child, these cells have neuronal properties.
  - MCPyV: Not typically associated with MCPyV research, as these cells are of neural origin. MCC typically does not arise from neural tissue.
- nHDF (Normal Human Dermal Fibroblasts):
  - Origin: These are primary cells derived from normal human dermal skin.
  - MCPyV: While not derived from Merkel cells, fibroblasts like nHDF can be used in studies to understand the impact of MCPyV on skin cells or to provide a non-cancerous control in MCPyV studies.
- WaGa:
  - Origin: A cell line derived from human Merkel cell carcinoma.
  - MCPyV: WaGa cells are MCPyV-positive, meaning they contain the virus, and are used in research to understand the biology of MCPyV-associated MCC.
- MKL-1:
  - Origin: Also a cell line derived from human Merkel cell carcinoma.
  - MCPyV: MKL-1 cells are MCPyV-positive and have been used extensively in studies to understand MCPyV’s role in Merkel cell carcinoma pathogenesis.
Each of these cell lines offers a different context for studying the biology of MCPyV, its infection mechanism, the cellular response to the virus, and the development of MCC. The HEK293 line is often used for molecular cloning and gene expression studies due to its ease of transfection and high growth rate. In contrast, WaGa and MKL-1 cells, being derived from MCC, provide a more disease-relevant system to study the virus’ oncogenic mechanisms.

In virology and cancer research, choosing the appropriate cell line is crucial, as it can significantly affect the relevance and applicability of the findings. The HEK293 cells would provide a basic understanding of the virus-host interactions at the molecular level, whereas WaGa and MKL-1 cells would be used to study the virus in the context of its natural disease state. PFSK-1 and nHDF might serve as non-MCC comparators or be used in MCPyV research for more specialized purposes.
Sample code explanation

Cell line	T antigens	Nr of peaks
293	Early region	2219
293	LTtr	?
PFSK1	Early region	7795
PFSK1	LTtr	?
nHDF	LT	1938
nHDF	LTtr	381
WaGa	sT+LTtr (endogenous)	?

Cell line	T antigens	Nr of peaks
293	Early region	2219
293	LTtr	?
PFSK1	Early region	7795
nHDF	LT	1938
nHDF	LTtr	?
nHDF	LTtr	381
WaGa	sT+LTtr (endogenous)	?

Markdown	Less	Pretty
Still	`renders`	nicely
1	2	3

Item	Price	# In stock
Juicy Apples	1.99	7
Bananas	1.89	5234

    | Sample Code | Sample Name |
    | ----------- | ----------- |
    | untreated_d0_DonorI | untreated_Day0 |
    | untreated_d0_DonorII | untreated_Day0 |
    | p601_d3_DonorII | mCherry-control_Day3 | 
    | p604_d3_DonorII | sT_Day3 |
    | p601_d8_DonorII | mCherry-control_Day8 |
    | p604_d8_DonorII | sT_Day8 |
    | p601_d3_DonorI | mCherry-control_Day3 |
    | p604_d3_DonorI | sT_Day3 |
    | p601_d8_DonorI | mCherry-control_Day8 |
    | p604_d8_DonorI | sT_Day8 |
    | p600_d3_DonorII | GFP-control_Day3 |
    | p605_d3_DonorII | LTtr_Day3 |
    | p600_d8_DonorII | GFP-control_Day8 |
    | p605_d8_DonorII | LTtr_Day8 |
    | p600_d3_DonorI | GFP-control_Day3 |
    | p605_d3_DonorI | LTtr_Day3 |
    | p600_d8_DonorI | GFP-control_Day8 |
    | p605_d8_DonorI | LTtr_Day8 |
    | p602_d8_DonorII | LT_Day8 |
    | p602_d8_DonorI | LT_Day8 |
    | p600and601_d12_DonorI | GFP+mCherry-control_Day12 |
    | p604and605_d12_DonorI | sT+LTtr_Day12 |
    | p600and601_d9_DonorII | GFP+mCherry-control_Day9 |
    | p604and605_d9_DonorII | sT+LTtr_Day9 |
    | p602_d3_DonorI | LT_Day3 |
    | p602_d3_DonorII | LT_Day3 |
    | p602and604_d3_DonorI | sT+LT_Day3 |
    | p602and604_d3_DonorII | sT+LT_Day3 |

Plasmid Design for Genetic Engineering

Leave a reply

Plasmid design is a critical aspect of molecular biology and genetic engineering, where a circular DNA molecule (plasmid) is engineered to replicate within a host organism independently. Here’s an overview of the plasmid design process:

Define Purpose:

Determine the goal of your plasmid (e.g., protein expression, gene knockout, CRISPR-Cas9 genome editing, reporter gene expression).
Select a Backbone:

Choose a plasmid backbone suitable for your host organism (e.g., pUC19 for E. coli, pYES2 for yeast). Consider the copy number, origin of replication, and antibiotic resistance markers.
Insert Selection:

Decide on the gene or DNA fragment you want to insert (e.g., a gene of interest, sgRNA, reporter gene). Ensure that your insert has the proper regulatory elements for expression (promoters, enhancers, terminators).
Restriction Sites:

Choose restriction enzymes for cloning that are compatible with your insert and plasmid backbone. Make sure the restriction sites are unique within the plasmid and insert.
Multiple Cloning Site (MCS):

Incorporate a MCS if you plan to use the plasmid for multiple projects or inserts.
Tagging:

If protein expression and purification are your goals, include tags for detection or purification (e.g., His-tag, FLAG-tag).
Sequences for Stability:

Add sequences that ensure plasmid stability, such as centromere sequences for yeast plasmids or partitioning sequences for low-copy-number plasmids.
Selection Markers:

Choose an antibiotic resistance gene that allows for selection in your host organism.
Verification Elements:

Include reporter genes (like GFP) or lacZ alpha for blue/white screening if you need to verify cloning success.
Sequencing Primers:

Design sequencing primers for confirmation of the insert after cloning.
Software Tools:

Use plasmid design software (e.g., SnapGene, Benchling, Geneious) to assemble your plasmid map and verify that all elements are correctly oriented.
Synthesis or Assembly:

Depending on the complexity, synthesize the DNA directly or assemble it using molecular cloning techniques (restriction cloning, Gibson assembly, Golden Gate cloning, or TOPO cloning).
Validation:

Validate your plasmid construct through restriction digestion, PCR, and sequencing.
Documentation:

Keep detailed records of your plasmid design and the reasoning behind each element to facilitate future troubleshooting and replication of your work.

Remember that successful plasmid design also depends on a clear understanding of the biological context in which the plasmid will be used, such as the specific needs of the host organism and the experimental application.

Scaffolding and finishing an assembly with a reference genome

Leave a reply

Below are example commands for each step of scaffolding and finishing an assembly with a reference genome. Please note that the specific parameters may need to be adjusted based on your data and computational resources. Replace reference.fasta, contigs.fasta, reads_1.fq, and reads_2.fq with the paths to your reference genome, your contig sequences, and your paired-end sequencing reads, respectively. Furthermore, software like RagTag and GapFiller may require additional setup steps or files, like read mapping beforehand or specific configuration files. Always refer to the documentation of each software for the most accurate and effective usage.

Alignment of Contigs to Reference:

nucmer --maxgap=500 --mincluster=100 --prefix=out reference.fasta contigs.fasta

Visualization: For visualizing the alignment, you can create a plot using MUMmer’s mummerplot or view it using a genome browser like IGV.
```
mummerplot --png --layout --filter --prefix=plot_out out.delta
```
Scaffolding: Using RagTag to scaffold contigs against a reference.
```
ragtag.py scaffold reference.fasta contigs.fasta
```
Gap Filling: Using GapFiller to fill gaps within the scaffolds. Here config.txt will include paths to your reads and the initial assembly.
```
GapFiller.pl -b config.txt -m 50 -o 2 -r 0.7 -n 10 -d 50 -i 3 -g 10 -t 30 -T 10 -B 200
```

Validation: Re-mapping reads to your assembly to check for consistency, using Bowtie2 for example:

bowtie2-build assembly.fasta assembly_index
bowtie2 -x assembly_index -1 reads_1.fq -2 reads_2.fq -S aligned.sam

Manual Curation: There is no direct command for this step as it involves manually inspecting and editing the assembly, but tools like Consed or Gap5 from the Staden package can be used for manual editing.

Annotation: Using Prokka for bacterial genome annotation:

prokka --outdir my_annotation --prefix my_bacteria assembly.fasta

Examples

ragtag.py correct ATCC17978.fasta  shovill/A19_17978_HQ/contigs.fa

spades.py -1 results_ATCC19606/raw_data/A6_WT_HQ_R1.fastq.gz -2 results_ATCC19606/raw_data/A6_WT_HQ_R2.fastq.gz --careful --trusted-contigs ATCC19606.fasta -o spades_A6_ATCC19606
spades.py -1 results_ATCC19606/raw_data/A10_CraA_HQ_R1.fastq.gz -2 results_ATCC19606/raw_data/A10_CraA_HQ_R2.fastq.gz --careful --trusted-contigs ATCC19606.fasta -o spades_A10_ATCC19606
spades.py -1 results_AYE/raw_data/A12_AYE_HQ_R1.fastq.gz -2 results_AYE/raw_data/A12_AYE_HQ_R2.fastq.gz --careful --trusted-contigs AYE.fasta -o spades_A12_AYE
spades.py -1 raw_data/A19_17978_HQ_R1.fastq.gz -2 raw_data/A19_17978_HQ_R2.fastq.gz --careful --trusted-contigs ATCC17978.fasta -o spades_A19_ATCC17978

In SPAdes, the --careful and --trusted-contigs flags serve different purposes:

--careful:

    This flag is used to reduce the number of mismatches and short indels. SPAdes performs additional mismatch correction and indel searching within the assembly algorithm when this flag is on. It makes the assembly process slower but can improve the quality of the final assembly by being more conservative in the handling of errors.
    The careful mode is particularly useful when you have high-quality sequencing data and you want to minimize the number of errors in the final assembly.

--trusted-contigs:

    The --trusted-contigs option is used to provide SPAdes with high-quality contig sequences that you trust to be correct. This can be useful when you have a reference genome or partial assembly that is known to be of high quality, and you want to use this information to guide the assembly process.
    SPAdes will use these trusted contigs to assist in the assembly of the reads, essentially using them as a scaffold to fill in gaps and extend contigs. This can help in resolving repetitive regions or complex areas of the genome that are difficult to assemble using reads alone.

Using --trusted-contigs does not make SPAdes any more conservative in terms of error correction; it simply provides additional information to help guide the assembly. The --careful mode, on the other hand, actively seeks to minimize errors throughout the assembly process.

When combined, --careful and --trusted-contigs can be used to create a high-quality assembly that benefits from both error minimization and the use of trusted reference sequences.

#debug the conda environment bengal3_ac3 on hamm
mamba install -c anaconda openjdk=11

Calling peaks using findPeaks of HOMER

Leave a reply

https://fonseca.lab.mcgill.ca/resources/20220810_ATAC_Analysis_Demo/guide.html#peaks

Introduction
Accessing the data
Peak calling
Annotating peak files
Filtering the data
Visualizing the data
Gene set enrichment analysis
Principal component analysis
Motif analysis

H3K4me3 is associated with active promoters and is commonly found in narrow peaks. Thus, you might consider using the “-style factor” parameter when using Homer’s findPeaks command, which is designed for transcription factors and histone modifications with narrow peaks.

Here’s a basic usage of findPeaks for H3K4me3:

findPeaks <tag directory> -style factor -o auto -i <input tag directory>
#makeTagDirectory <alignment file>
conda activate myperl
#makeTagDirectory <output directory> <input file> -genome hg38
for sample in p783_ChIP_DonorI p783_ChIP_DonorII p783_input_DonorI p783_input_DonorII; do
   makeTagDirectory ${sample} ../picard/${sample}.dedup.sorted.bam -genome hg38
done

The “-style factor” option sets several parameters automatically (like the size of the sliding window, the required local fold enrichment over the background, and others) that are suitable for factor-style peaks.

Remember that peak calling can be somewhat subjective and might require additional parameter tuning and downstream filtering depending on the specific characteristics of your dataset, your experiment, and your research question.

In some cases, you may want to adjust parameters such as the -size (to specify the size of the region used for peak finding), -minDist (to specify the minimum distance between peaks), or others. However, for most typical use cases, the default parameters set by the “-style factor” option should work well for H3K4me3.

H3K27me3 (trimethylation of the 27th lysine residue on the H3 protein) is a histone modification associated with transcriptional repression, and it is often found in broad domains rather than narrow peaks. Thus, you might want to use the “-style histone” parameter when using Homer’s findPeaks command, which is designed for histone modifications with broad peaks.

Here’s a basic usage of findPeaks for H3K27me3:

findPeaks <tag directory> -style histone -o auto -i <input tag directory>
#<tag directory> should be replaced with the name of your tag directory.

The “-style histone” option sets several parameters automatically that are suitable for histone-style peaks, including larger window sizes and a more stringent fold-enrichment requirement over local background.

H3K27ac (acetylation of the 27th lysine residue on the H3 protein) is a histone modification typically associated with active promoters and enhancers. Unlike H3K27me3, which tends to occur in broader regions, H3K27ac often forms relatively sharp peaks similar to those of transcription factor binding sites.

Here’s a basic usage of findPeaks for H3K27ac:

findPeaks <tag directory> -style histone -o auto -i <input tag directory>
#<tag directory> should be replaced with the name of your tag directory.

The “-style histone” option is designed to work well with many histone modifications, including H3K27ac. This setting automatically adjusts several parameters to values that are typically appropriate for histone modifications.

However, peak calling is not an exact science and can be somewhat subjective. Depending on the specific characteristics of your dataset and your research question, you might need to adjust the parameters. Some parameters that you may want to experiment with include -size (to specify the size of the region used for peak finding), -minDist (to specify the minimum distance between peaks), or -region (to call broader enriched regions rather than sharp peaks).

In some cases, using “-style factor” might be more appropriate for H3K27ac data, given the sharpness of the peaks. You can try both and see which set of peaks makes more sense in your specific experimental context. It’s also always a good idea to visualize some of the called peaks in a genome browser to ensure that the peaks look reasonable.

H3K4me1 (monomethylation of the fourth lysine residue on the H3 protein) is a histone modification usually associated with enhancer regions. The peaks of H3K4me1 often span a broad region.

Here’s a basic usage of findPeaks for H3K4me1:

findPeaks <tag directory> -style histone -o auto -i <input tag directory>
#<tag directory> should be replaced with the name of your tag directory.

The -style histone option is designed to work well with many histone modifications, including H3K4me1. This setting automatically adjusts several parameters to values that are typically appropriate for histone modifications.

However, peak calling is not an exact science and the “best” parameters can depend on various factors including the specifics of your experiment and your research question. You might need to adjust parameters based on your specific needs. Some parameters that you may want to adjust include -size (to specify the size of the region used for peak finding), -minDist (to specify the minimum distance between peaks), or -region (to call broader enriched regions rather than sharp peaks).

In the case of H3K4me1, the peaks are generally broader than other histone modifications such as H3K4me3. Therefore, you might want to consider increasing the -size parameter or use the -region option to better capture these broad regions of enrichment.

As always, after peak calling, you should visualize some of the called peaks in a genome browser to ensure that they look reasonable and align well with the experimental data.

H3K9me3 is a histone modification often associated with transcriptional repression and heterochromatin, and it tends to form large domains rather than sharp peaks.

Here’s a basic usage of findPeaks for H3K9me3:

findPeaks <tag directory> -style histone -o auto -i <input tag directory>
#<tag directory> should be replaced with the name of your tag directory.

The -style histone option is designed to work well with many histone modifications. This setting automatically adjusts several parameters to values that are typically appropriate for histone modifications.

However, for modifications like H3K9me3 that often cover broad domains, you might want to use the -region option to call enriched regions rather than individual peaks. This will capture the broader domains characteristic of H3K9me3:

findPeaks <tag directory> -style histone -region -o auto -i <input tag directory>

Peak calling is not an exact science and the “best” parameters can depend on various factors including the specifics of your experiment and your research question. Therefore, you may need to adjust parameters based on your specific needs.

Finally, it’s always good to check the output of your peak calling by visualizing some of the called peaks in a genome browser. This will ensure they align well with the experimental data and look like what you would expect given what is known about the biology of the histone mark.

#Default Parameters:
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -size 150 -minDist 370 > output.txt (i.e. defaults)

#Recommend Parameters for fixed width peaks (i.e. for motif finding):
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -size 1000 -minDist 2500 > output.txt

#Default Parameters for variable length peaks.
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 150 -minDist 370 > output.txt

#Effect on variable length peaks if we increase minDist to 1000.
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 150 -minDist 1000 > output.txt

#Recommend Parameters for variable length peaks (H3K4me1 at least).
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 1000 -minDist 2500 > output.txt

#Effect on variable peaks if we increase minDist to 10000 (H3K4me1 at least).
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 1000 -minDist 10000 > output.txt

A decade of advances in transposon-insertion sequencing

Leave a reply

摘要

自现代转座子插入测序（TIS）方法引入以来已有10年的时间了。这些方法结合了全基因组转座子突变和高通量测序，用于估计细菌基因组中每个遗传组分的适应性贡献或必需性。2009年，发布了四种TIS变体：转座子测序（Tn-Seq）、转座子定向插入位点测序（TraDIS）、插入测序（INSeq）和高通量插入跟踪深度测序（HITS）。此后，TIS已成为分子微生物学家的重要工具之一，它是少数直接将表型与基因型联系起来并最终能够指派基因功能的全基因组技术之一。在本综述中，我们讨论了TIS最近应用于回答重大生物学问题的情况。我们还探讨了基于TIS的新兴和跨学科方法，着眼于未来的应用前景。

引言

转座子插入测序（TIS）方法结合了大规模转座子突变和下一代测序技术，以同时估计细菌基因组中每个遗传特征的必需性和/或适应性贡献。TIS的一个优势在于，实验是使用混合的转座子文库进行的，这使得能够以高通量的方式直接将表型与基因型联系起来。TIS的最终目标是阐明每个基因组特征的功能，因此是帮助解读不断增加的基因组测序数据的关键工具。在足够的密度下，TIS方法不仅对检测突变体适应性的微小变化足够敏感，而且足够精确，能够检验基因以及基因间区域、启动子区域和编码区域内的必需蛋白质域。2009年，公布了四种TIS方法的变体：转座子测序（Tn-Seq）、转座子定向插入位点测序（TraDIS）、插入测序（INSeq）和高通量插入跟踪深度测序（HITS）。自那以后，TIS已成为我们分子生物学工具包中的宝贵工具，其全部用途仍在探索中。

TIS工作流程的基本过程概述在图1中。简而言之，它开始于构建饱和突变体文库（图1A），方法是通过转化或接合等方式将随机插入的转座子（通常是Tn5或mariner转座子）引入感兴趣的菌株中。目标是创建一个细菌群体，其中每个细胞在基因组中携带一个单一的转座子插入，当细胞混合在一起时，每个遗传组分在不同的位点被多次破坏。通过直接测序初始文库中转座子侧翼区域，可以识别那些不容许插入的潜在必需特征。或者，可以将文库置于选择性条件下，例如抗生素应激（图1B），以查询在该环境中生存和生长所涉及的非必需特征。这些条件重要组分是通过插入频率在选择过程中显著变化而定义的，这是通过在选择前后进行测序确定的。在实验选择中频率降低的具有破坏性转座子插入的基因组特征被认为对适应性重要；这些特征可能包括在抗生素选择期间的抗生素抗性基因或在感染模型中的毒力因子。插入频率增加的特征被认为在测试条件下具有不利影响，包括对增强适应性特征的负调控因子，或在那些条件下不必要的代谢成本高昂的系统。

How to run the pipeline vrap?

Leave a reply

download viral database

cp download_db.py down_db.py.backup
sed -i -e 's/39733/1396/g' download_db.py
#NOTE the download_db.py cannot run alone, rather than via "vrap.py -u"
(damian) jhuang@hamburg:~/DATA/Data_Tam_Acinetobacter_baumannii$ ./vrap/vrap.py -u
mv vrap/database/viral_db vrap/database/viral_db_Bacillus_cereus

run vrap.py

#--host=/home/jhuang/REFs/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa 
vrap.py -1 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_2.fastq.gz -o A10_vrap_out --virus=/home/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/ATCC19606.fasta -n /media/jhuang/Titisee/GAMOLA2/Blast_db/nt -a /media/jhuang/Titisee/GAMOLA2/Blast_db/nr -t 4
#vrap/vrap.py -1 220211_M03701_0272_000000000-JWYW9/p12568/BovHepV_Bulgaria_19_S3_R1_001.fastq.gz -2 220211_M03701_0272_000000000-JWYW9/p12568/BovHepV_Bulgaria_19_S3_R2_001.fastq.gz -o BovHepV_Bulgaria_19_on_Hepacivirus --host=/home/jhuang/REFs/Bos_taurus.ARS-UCD1.2.dna.toplevel.fa  -t 3
#IMPORTANT, it doesn't work, since the update will always downlod the repeated sequences causing ERROR! It needs to delete it as follows!!

#In summary: --(optional) host --> bowtie2/host (bowtie_database for host-substraction) + blast/db/host (in nt_dbs, custom blastn database before downloaded_db!)
#            --(optional) virus --> blast/db/virus (in nt_dbs, custom blastn database before downloaded_db!)
#            --(always) defined in download_db.py (setting by modifying the file) and run default --> database/viral_db/viral_nucleotide (nt_dbs)
#                                                                                        --> database/viral_db/viral_protein (prot_db)
#            --(optional) nt+nr
#Therefore, the results substracted host should not need the blast with host, however it it in nt_dbs!

#Under (vrap) jhuang@hamm:~/DATA/Data_Tam_Acinetobacter_baumannii$ 
#DEBUG: {path}/external_tools/Lighter-master/lighter-->/home/jhuang/miniconda3/envs/vrap/bin/lighter
#DEBUG: {path}/external_tools/SPAdes-3.11.1-Linux/bin/spades.py-->/home/jhuang/miniconda3/envs/vrap/bin/spades.py

#A6: --host=/home/jhuang/REFs/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa  --virus=/home/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/vrap/database/viral_db_Bacillus_cereus/nucleotide_Bacillus_cereus_0.01.fasta
vrap/vrap.py -1 ./results_ATCC19606/trimmed/A6_WT_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC19606/trimmed/A6_WT_HQ_trimmed_P_2.fastq.gz -o A6_vrap_out  -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

#A10　
#START checking blast db: no host and with virus --> mapping to 5 dbs: blast/db/virus, viral_nucleotide, viral_protein, nt, nr
vrap/vrap.py -1 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_2.fastq.gz -o A10_vrap_out --virus=/home/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/vrap/database/viral_db_Bacillus_cereus/nucleotide_Bacillus_cereus_0.01.fasta   -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

#A12
vrap/vrap.py -1 ./results_AYE/trimmed/A12_AYE_HQ_trimmed_P_1.fastq.gz -2 ./results_AYE/trimmed/A12_AYE_HQ_trimmed_P_2.fastq.gz -o A12_vrap_out  -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

#A19 
vrap/vrap.py -1 ./results_ATCC17978/trimmed/A19_17978_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC17978/trimmed/A19_17978_HQ_trimmed_P_2.fastq.gz -o A19_vrap_out  -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

#START VrAP
#START checking blast db
#START lighter
#START estimate k-mer size
#k-mer size: 5964136
#/home/jhuang/miniconda3/envs/vrap/bin/lighter -r ./results_ATCC19606/trimmed/A6_WT_HQ_trimmed_P_1.fastq.gz -r ./results_ATCC19606/trimmed/A6_WT_HQ_trimmed_P_2.fastq.gz -K 20 5964136 -od /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A6_vrap_out/lighter -t 55 2>> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A6_vrap_out/vrap.log
#START flash
#/home/jhuang/Tools/vrap/external_tools/FLASH-1.2.11/flash /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A6_vrap_out/lighter/A6_WT_HQ_trimmed_P_1.cor.fq.gz /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A6_vrap_out/lighter/A6_WT_HQ_trimmed_P_2.cor.fq.gz -o flash -d /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A6_vrap_out/flash -M 1000 >> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A6_vrap_out/vrap.log
#START spades

vrap.py (code)

#!/usr/bin/env python3

from orf import ORF
from optparse import OptionParser
import os
import glob
from Bio import SeqIO
from pathlib import Path
# see blast.py
# .ncbirc --> BLASTDB=/media/jhuang/Titisee/GAMOLA2/Blast_db/ --> /mnt/Samsung_T5/Blast_db/
from blast import Blast
from hmmer import HMMER
from bowtie import Bowtie
# download_db.py only for txid2697049 [Severe acute respiratory syndrome coronavirus 2]
from download_db import Database
import glob
import logging
from math import exp
from scipy.optimize import brentq

#old = "_2010"
old = ""

class VrAP:
    """ VrAP """

    def __init__ (self):
        """ Class initialiser """
        self.version = "1.0.5"
        option = self.option_parser()
        self.path = os.path.dirname(os.path.realpath(__file__))

        self.paired_end = True
        self.p1  = option.p1
        self.p2  = option.p2
        self.single = ""
        self.cor1 = self.p1
        self.cor2 = self.p2
        self.cpu = option.cpu
        self.contigs = ""
        self.contig_length = option.clength
        self.host = option.host
        self.virus = option.virus
        self.nt = option.nt
        self.nr = option.nr
        self.bt2idx = option.bt2idx
        self.update = option.update
        self.gbt2 = option.gbowtie
        self.pre = option.pre
        self.assembly = option.assembly
        self.noblast = option.noblast

        if option.out!=None:
            self.out = os.path.abspath(option.out)
        else:
            self.out = "./"

        if self.update:
            self.update_database()

        if self.cpu==None:
            self.cpu = 1

        if self.p2==None:
            self.single = self.p1
            self.paired_end = False

        self.create_output_dir()

        open("{}/vrap.log".format(self.out),"w").close()
        self.log("START VrAP")

        with open("{}/vrap.log".format(self.out),"a") as log_opt:
            for opt, value in option.__dict__.items():
                log_opt.write("{}:\t{}\n".format(opt,value))

        if self.pre:
            self.log("START only read correction and assembly!")

        self.check_input()

    def check_input (self):
        """ check input """

        bt2idx = self.bt2idx
        if self.bt2idx!=None:
            bt2idx = self.bt2idx+".1.bt2"

        input_files = [self.p1,self.p2,self.host,self.virus,bt2idx]

        for f in input_files:
            #print(f)
            if f != None:
                if not os.path.exists(f):

                    print("{} don't exists\nPlease check input files\nVrAP quit\n".format(f))
                    quit()

        if self.p1 == None:
            self.parser.print_help()
            print("-1/--p1 is mandatory\nPlease check input file\nVrAP quit\n")
            quit()

        #EXPLANATION: in total, mapping to 4 dbs: viral_nucleotide, viral_protein, nt, nr
        dbs = ["{path}/database/viral_db/viral_nucleotide".format(path=self.path),"{path}/database/viral_db/viral_protein".format(path=self.path),self.nr,self.nt]
        status = []
        self.log("START checking blast db")

        for db in dbs:
            x = os.system("{path}/external_tools/blast/blastdbcmd -db {db} -info >> {out}/vrap.log 2>> {out}/vrap.log".format(path=self.path,db=db,out=self.out))
            status.append(x)

        #COMMENTED
        if status[0] != 0 or status[1] != 0:
            self.update_database()

        if self.nr != None and status[2] != 0:
            print("No alias or index file found for {}".format(self.nr))
            quit()

        if self.nt != None and status[3] != 0:
            print("No alias or index file found for {}".format(self.nt))
            quit()

    def log (self,message,err=False):
        """ log bowtie run """
        print(message)
        LOG_FILENAME = "{}/vrap.log".format(self.out)
        FORMAT = '\n[VrAP] %(asctime)-15s %(message)s\n'

        logging.basicConfig(filename=LOG_FILENAME,format=FORMAT,level=logging.DEBUG)
        if err:
            logging.exception(message)
        else:
            logging.info(message)
        #logging.exception()

    def option_parser(self):
        """ Read input options """
        usage = "usage: %prog -1 [FILE] [options]"
        parser = OptionParser(usage=usage,version="%prog {}".format(self.version))

        parser.add_option("-1","--p1", dest="p1",help="first paired end file" , metavar="FILE")
        parser.add_option("-2","--p2", dest="p2",help="second paired end file", metavar="FILE")
        parser.add_option("-t","--threads", dest="cpu",help="number of threads to use", metavar="INT",type="int",default=1)
        parser.add_option("-r","--host", dest="host",help="reference host sequences", metavar="FASTA")
        parser.add_option("-v","--virus", dest="virus",help="reference viral sequences", metavar="FASTA")
        parser.add_option("-n","--nt", dest="nt",help="nucleotide blast database (e.g. nt)", metavar="DATABASE")
        parser.add_option("-a","--nr", dest="nr",help="protein blast database (e.g. nr)", metavar="DATABASE")
        parser.add_option("-o","--outdir", dest="out",help="output directory", metavar="DIR")
        parser.add_option("-b","--bt2idx", dest="bt2idx",help="bowtie2 index", metavar="FILE")
        parser.add_option("-l","--clength", dest="clength",help="minimum contig length", metavar="INT",default=500)
        parser.add_option("-u","--update", dest="update",help="update viral database",  default=False, action="store_true")
        parser.add_option("-g","--gbt2", dest="gbowtie",help="use global PATH of bowtie2",  default=False, action="store_true")
        parser.add_option("-p","--pre", dest="pre",help="read correction only",  default=False, action="store_true")
        parser.add_option("-q","--assembly", dest="assembly",help="assembly only",  default=False, action="store_true")
        parser.add_option("-k","--noblast", dest="noblast",help="skip blast search",  default=False, action="store_true")
        self.parser = parser
        (options, args) = parser.parse_args()

        return options

    def create_output_dir(self):
        """ Create output directory """
        if not os.path.exists(self.out):
            os.makedirs(self.out)

    def EMfit2(self,F0, f1, F1, k):
        epstol = 1e-8
        x = F1 / F0
        e = 0
        e0 = 1
        x0 = 0
        maxit = 1000
        it = 0
        while abs(x - x0) / x + abs(e - e0) > epstol and it < maxit:
            it = it + 1
            #print x, e, abs(x-x0)/x
            x0 = x
            e0 = e
            x1 = -100
            while abs(x - x1) / x > epstol / 2:
                x1 = x
                x  = F1 / F0 * (3 * k * (1 - exp(-x * e / (3 * k))) + 1 - exp(-x * (1 - e)))
            def func(t):
                return f1 / F1 - (t * exp(-x * t / (3 * k)) + (1 - t) * exp(-x * (1 - t)))
            print(func(0))
            print(func(1.2))
            e = brentq(func, 0, 1.1)
        return x, e

    def genome_size (self):
        """ estimate genome size """
        self.log("START estimate k-mer size")

        lighter_path = "{}/lighter".format(self.out)
        if not os.path.exists(lighter_path):
            os.makedirs(lighter_path)

        if self.paired_end:
            input_data = "{} {}".format(os.path.abspath(self.p1),os.path.abspath(self.p2),self.out)
        else:
            input_data = "{}".format(os.path.abspath(self.p1))

        command = "{path}/external_tools/KmerStream-master/KmerStream --tsv -k 20 -t {cpu} {input_data} -o {lighter}/kmer  2>> {out}/vrap.log".format(path=self.path,lighter=lighter_path,cpu=self.cpu,out=self.out,input_data=input_data)

        os.system(command)
        size = 30000

        try:
            kmer = "{lighter}/kmer".format(lighter=lighter_path)
            if os.path.exists(kmer):
                with open(kmer) as f:
                    head = next(f)
                    value = next(f).split()

                value = [int(i) for i in value] #list comprehension
                size = value[2]-value[3]
        except Exception as e:
            vrap.log(e,True)

        self.log("k-mer size: {}".format(size))

        return size

    def lighter(self):
        """ Run ligther for read correction """
        self.log("START lighter")

        if self.paired_end:
            input_data = "-r {} -r {}".format(self.p1,self.p2)
        else:
            input_data = "-r {}".format(self.p1)

        size = self.genome_size()
        if size<30000:
            size = 30000
        #MODIFIED, since lighter in external_tool doesn't work.
        #{path}/external_tools/Lighter-master/lighter
        command = "/home/jhuang/anaconda3/bin/lighter {input} -K 20 {size} -od {out}/lighter -t {cpu} 2>> {out}/vrap.log".format(path=self.path,input=input_data,out=self.out,cpu=self.cpu,size=size)
        print(command)
        os.system(command)

        if self.paired_end:
            b1 = os.path.basename(self.p1).split(".")[0]
            self.cor1 = glob.glob("{}/lighter/{}*cor*".format(self.out,b1))[0]
            b2 = os.path.basename(self.p2).split(".")[0]
            self.cor2 = glob.glob("{}/lighter/{}*cor*".format(self.out,b2))[0]
        else:
            b1 = os.path.basename(self.single).split(".")[0]
            self.single = glob.glob("{}/lighter/{}*cor*".format(self.out,b1))[0]

    def flash (self):
        """ Run flash for paired end reads """
        self.log("START flash")

        if self.paired_end:
            command = "{path}/external_tools/FLASH-1.2.11/flash {cor1} {cor2} -o flash -d {out}/flash -M 1000 >> {out}/vrap.log".format(path=self.path,cor1=self.cor1,cor2=self.cor2,out=self.out)
            print(command)
            os.system(command)

            self.cor1   = "{}/flash/flash.notCombined_1.fastq".format(self.out)
            self.cor2   = "{}/flash/flash.notCombined_2.fastq".format(self.out)
            self.single = "{}/flash/flash.extendedFrags.fastq".format(self.out)

    def bowtie (self):
        """ Map reads to reference sequence """
        bowtie2 = Bowtie(self.gbt2,self.out,self.cpu)

        bowtie_path = "{}/bowtie".format(self.out)
        if not os.path.exists(bowtie_path):
            os.makedirs(bowtie_path)

        if self.bt2idx==None and self.host!=None:
            self.log("START bowtie2-build")
            self.bt2idx = "{}/bowtie/host".format(self.out)
            bowtie2.build(self.host)

        if self.bt2idx != None:
            if self.paired_end:
                self.log("START bowtie2 pe")
                bowtie2.pe(self.cor1,self.cor2,self.single,self.bt2idx)
                self.cor1 = "{}/bowtie/bowtie.un.1.fastq".format(self.out)
                self.cor2 = "{}/bowtie/bowtie.un.2.fastq".format(self.out)
                self.single = "{}/bowtie/bowtie.un.fastq".format(self.out)
            else:
                self.log("START bowtie2 se")
                bowtie2.se(self.single,self.bt2idx)
                self.single = "{}/bowtie/bowtie.un.fastq".format(self.out)

    def update_database (self):
        """ Update viral DB"""
        #self.log("START update viral DB")
        #EXPLANATION: maybe it needs to DELETE the directory viral_db.
        #typs = ["nucleotide"]
        #typs = ["protein"]
        typs = ["nucleotide","protein"]
        #path = os.path.dirname(os.path.realpath(__file__))
        db_path = "{}/database/viral_db/".format(self.path)

        if not os.path.exists(db_path):
                os.makedirs(db_path)

        for typ in typs:
            self.log("START update viral {} DB".format(typ))
            database = Database(typ,db_path,self.cpu)

            database.download_id()
            database.start_download()

            blast = Blast(self.cpu, self.out)
            db_type = "nucl"
            if "protein" in typ:
                db_type = "prot"

            self.log("\nSTART {} makeblastdb".format(typ))
            blast.makeblastdb(db_path+typ+".fa", "{}viral_{}".format(db_path,typ), db_type)
            #print(db_path+typ+".fa", "{}viral_{}".format(db_path,typ), db_type)
            #/home/jhuang/Tools/vrap/database/viral_db/nucleotide.fa /home/jhuang/Tools/vrap/database/viral_db/viral_nucleotide nucl
            #MANUAL_RUNNING
            #V5 didn't suitable
            #/home/jhuang/DATA/RNAHiSwitch/ncbi-blast-2.11.0+/bin/makeblastdb -in nuleotide.fa -dbtype nucl
            #/home/jhuang/DATA/RNAHiSwitch/ncbi-blast-2.11.0+/bin/makeblastdb -in protein.fa -dbtype prot
            #USING_V4
            #https://www.biostars.org/p/200742/
            #vrap/external_tools/blast/makeblastdb -in nucleotide.fa -dbtype nucl -parse_seqids -out virus_nucleotide  #-taxid_map nucl_gb.accession2taxid 
            #vrap/external_tools/blast/makeblastdb -in protein.fa -dbtype prot -parse_seqids -out virus_protein     #-taxid_map nucl_gb.accession2taxid
            #search_term="txid2697049[Organism:exp] NOT txid2[Organism:exp] NOT txid2759[Organism:exp] NOT txid2157[Organism:exp] {}".format(date)
            #virus_search_term="txid10239[Organism:exp] NOT txid2[Organism:exp] NOT txid2759[Organism:exp] NOT txid2157[Organism:exp] {}".format(date)
            #txid2=Bacteria, txid2759=Eukaryota, txid2157=Archaea, txid10239=Viruses, txid2697049=Severe acute respiratory syndrome coronavirus 2

    def spades (self):
        """ Run spades """
        self.log("START spades")
        paired = ""

        if self.paired_end:
            paired  = "-1 {} -2 {}".format(self.cor1,self.cor2)

        k       = "-k 33,55,77,99,127"
        options = "--cov-cutoff off --only-assembler --careful"
        #options = "--cov-cutoff off --only-assembler"

        single  = self.single
        if len(self.single)>0:
            single  = "--s1 {}".format(self.single)

        cpu     = "-t {}".format(self.cpu)
        output  = "-o {}/spades".format(self.out)
        os.system("{path}/external_tools/SPAdes-3.11.1-Linux/bin/spades.py {paired} {single} {k} {options} {cpu} {output} >> {out}/vrap.log".format(path=self.path, paired=paired, single=single, k=k, options=options, cpu=cpu, output=output,out=self.out))

        #for assembly only
        self.contigs = "{}/spades/contigs.fasta".format(self.out)
        self.exclude = set()
        self.scaffold = {}

    def cap3 (self):
        """ Runs cap3 """
        self.log("START cap3")
        self.contigs = "{}/spades/contigs.fasta".format(self.out)
        cap3_out = "{}/cap3".format(self.out)
        #print(cap3_out)
        if not os.path.exists(cap3_out):
            os.makedirs(cap3_out)

        os.system("{}/external_tools/cap3 '{}' -y 100 >{}/cap3/cap3.out".format(self.path,self.contigs,self.out))

        b = False
        i = 0
        consensi = {}
        consensus = ""
        group = {}
        name = ""
        exclude = set()

        with open("{}/cap3/cap3.out".format(self.out)) as f:
            for line in f:

                if not b:

                    if "*******************" in line:
                        name = line.split("* ")[1].split(" *")[0]
                        #print(name)
                        group[name] = set()

                    if "NODE" in line:
                        ary = line.split()
                        for n in ary:
                            if "NODE" in n:
                                tmp = n.replace("+","").replace("-","")
                                group[name].add(tmp)
                                exclude.add(tmp)

                if "DETAILED" in line:
                    b = True

                if b:

                    if "*******************" in line:
                        name = line.split("* ")[1].split(" *")[0]
                        #print(name)

                    ary = line.split()
                    #print(ary)
                    if len(ary) > 0:

                        if "consensus" in line:
                            seq = ary[1]
                            tmp = consensi.get(name,"")
                            tmp+=seq.replace("-","N")
                            consensi[name] = tmp

        self.scaffold = consensi
        self.exclude = exclude

    def clean_output (self):
        """ Clean spades output and include scaffolds """

        fasta_sequences = SeqIO.parse(open(self.contigs),'fasta')
        self.result = "{}/vrap_contig.fasta".format(self.out)
        output = open(self.result,"w")

        for fasta in fasta_sequences:
            if fasta.name not in self.exclude and len(fasta.seq)>=self.contig_length:
                output.write(">{}\n{}\n".format(fasta.name,fasta.seq))
        i = 1
        for name,seq in self.scaffold.items():
            if len(seq)>=self.contig_length:
                output.write(">CAP_{}_length_{}\n{}\n".format(i,len(seq),seq))
                i+=1
        output.close()

    def calculate_orf_density (self):
        """ calculate orf density """
        self.log("START calculating orf density")
        self.result = "{}/vrap_contig.fasta".format(self.out)
        orf = ORF(self.result,self.out)
        self.orf_density = orf.orf_density()

    def label_as_virus (self):
        """ Add virus tag to name """
        out_name = "{}/blast/custom_viral_seq.fa".format(self.out)
        output = open(out_name,"w")

        with open(self.virus, "r") as handle:
            for record in SeqIO.parse(handle, "fasta"):
                output.write(">{}\n{}\n".format(record.name+" [virus_user_db]",record.seq))

        output.close()

        return out_name

    def blast (self):
        """ blast """
        self.log("START blast")
        self.result = "{}/vrap_contig.fasta".format(self.out)

        blast_path = "{}/blast".format(self.out)
        blast = Blast(self.cpu, self.out)

        blast.read_fasta(self.result)

        #EXPLANATION: all virus and host-databases are saved in nt_dbs
        #             blast performed only after assembly(self.result=vrap_contig.fasta) against host+virus --> 
        #             since the host-reads are not deleted before the assembly, 
        #             vrap_contigs.fasta should contain contigs of host and virus
        #sort blast/blastn.csv and blast/blastn.fa
        nt_dbs = "{}/database/viral_db{}/viral_nucleotide".format(self.path,old)
        e_val = "1e-4"
        max_target = 1

        if not os.path.exists(blast_path):
            os.makedirs(blast_path)

        if not os.path.exists(blast_path+"/db"):
            os.makedirs(blast_path+"/db")

        #EXPLANATION
        #if the option --virus is set, make the given fasta blast-format and add it to vrap_out/blast/db/virus 
        #NO record added! nt_dbs should keep only 1 record!
        if self.virus != None:
            custom_viral_seq = self.label_as_virus()
            blast.makeblastdb(custom_viral_seq,"{}/blast/db/virus".format(self.out),"nucl")
            nt_dbs += " {}/blast/db/virus".format(self.out)
        #MODIFIED, add the self.host causing error. COMMENTED them!
        #EXPLANATION: if the option --host is set, make the given fasta blast-format and add it to vrap_out/blast/db/host
        #There are in total 3 nt_dbs, 1 prot_db, +[nt][nr], and 1 bowtie database:
        # nt_dbs /home/jhuang/Tools/vrap/database/viral_db/viral_nucleotide, ./blast/db/virus, ./blast/db/host
        # prot_db /home/jhuang/Tools/vrap/database/viral_db/viral_protein
        # bowtie_database: ./bowtie/host for substraction host-reads
        #DEBUG: /home/jhuang/Tools/vrap/external_tools/blast/blastn -num_threads 4 -query /home/jhuang/DATA/Data_Becher_Picornavirus_BovHepV/F21_1400_Liquor_vrap_out/vrap_contig.fasta -db "/home/jhuang/Tools/vrap/database/viral_db/viral_nucleotide /home/jhuang/DATA/Data_Becher_Picornavirus_BovHepV/F21_1400_Liquor_vrap_out/blast/db/host"  -evalue 1e-4 -outfmt "6 qseqid qstart qend sstart send evalue length pident sseqid stitle qcovs qcovhsp sacc slen qlen" -max_target_seqs 1 -out /home/jhuang/DATA/Data_Becher_Picornavirus_BovHepV/F21_1400_Liquor_vrap_out/blast/blastn.csv
        #--> it works!
        #Warning: [blastn] Examining 5 or more matches is recommended
        #grep -v "scrofa" blastn.csv > blastn_.csv;
        #DEBUG: BLAST Database creation error: Error: Duplicate seq_ids are found: GB|MF170920.1
        #if self.host != None:
        #   blast.makeblastdb(self.host,"{}/blast/db/host".format(self.out),"nucl")
        #   nt_dbs += " {}/blast/db/host".format(self.out)

        #EXPLANATION: blastn.csv is empty means blastn finds nonthing!
        ##nucleotide db search
        blast_out = "{}/blast/blastn.csv".format(self.out)
        blast.blastn(self.result, nt_dbs, e_val, max_target, blast_out)
        no_match_fa = "{}/blast/blastn.fa".format(self.out)
        no_results_found = blast.no_match(blast_out,no_match_fa)

        #EXPLANATION: blastx -num_threads 14 -query /home/jhuang/DATA/Data_Becher_Picornavirus_BovHepV/F21_1399_1400_Serum_vrap_out/blast/blastn.fa -db /home/jhuang/Tools/vrap/database/viral_db/viral_protein -evalue 1e-6 -outfmt 6 qseqid qstart qend sstart send evalue le+ 
        ##protein db search
        if no_results_found > 0:
            e_val = "1e-6"
            blast_out = "{}/blast/blastx.csv".format(self.out)
            prot_db = "{}/database/viral_db{}/viral_protein".format(self.path,old)
            blast.blastx(no_match_fa, prot_db, e_val, max_target, blast_out)
            no_match_fa = "{}/blast/blastx.fa".format(self.out)
            no_results_found = blast.no_match(blast_out,no_match_fa)

        #EXPLANATION: since self.nt and self.nr are not given, the following code won't run!
        if self.nt != None and no_results_found > 0:
            e_val = "1e-4"
            blast_out = "{}/blast/blastn_nt.csv".format(self.out)
            blast.blastn(no_match_fa, self.nt, e_val, max_target, blast_out)
            no_match_fa = "{}/blast/blastn_nt.fa".format(self.out)
            no_results_found = blast.no_match(blast_out,no_match_fa)

        if self.nr != None and no_results_found > 0:
            e_val = "1e-6"
            blast_out = "{}/blast/blastx_nr.csv".format(self.out)
            blast.blastx(no_match_fa, self.nr, e_val, max_target, blast_out)
            no_match_fa = "{}/blast/blastx_nr.fa".format(self.out)
            no_results_found = blast.no_match(blast_out,no_match_fa)

    def hmmer (self):
        """ hidden markov modell search """
        self.log("START HMMER")
        hmmer = HMMER(self.cpu, self.out)
        hmmer.run_hmmer()

    def summary (self):
        """ summarize output """

        if len(self.orf_density)>0:
            blast = Blast(self.cpu, self.out)
            self.blast_results = blast.summary(self.out)

            hmmer = HMMER(self.cpu, self.out)
            hmmer_output = hmmer.read_hmmer_output()

        with open("{}/vrap_summary{}.csv".format(self.out,old),"w") as f:
            f.write("name;qleng;orf_d;hmmer_eval;hmm_model;virus?;ident;qcov;tcov;tlength;tid;tname;mean_eval\n")
            for name,orf_d in self.orf_density.items():
                results = {"qlength":0,"tid":"-","tname":"-","tlength":0,"qcovarage":0,"ident":0,"tcovarage":0,"eval":1}

                if name in self.blast_results:
                    results = self.blast_results[name]

                hmmer = 1
                hmm_model = ""

                if name in hmmer_output:
                    hmmer,hmm_model = max(hmmer_output[name],key=lambda x:x[0])

                qleng = int(name.split("_")[3])
                virus = "host"

                if "virus" in results["tname"] or "phage" in results["tname"]:
                    virus = "virus"
                else:
                    if (len(results["tname"])<2):
                        if ((orf_d>0.9 and qleng <=1000) or (orf_d>0.7 and qleng >1000)):
                            virus = "potential virus"
                        else:
                            virus = "-"

                    if hmmer<=1e-5:
                        virus = "virus"

                tmp ="{0};{1:d};{2:.2f};{3:.2e};{4};{5};{6:.2f};{7:.2f};{8:.2f};{9:d};{10};{11};{12:.2e}\n".format(name,qleng,orf_d,hmmer,hmm_model,virus,results["ident"],results["qcovarage"],results["tcovarage"],results["tlength"],results["tid"],results["tname"].replace(";",","),results["eval"])
                f.write(tmp)

vrap = VrAP()

try:
    if not vrap.assembly:
        vrap.lighter()
        vrap.flash()

    if not vrap.pre:
        if not vrap.assembly:
            vrap.bowtie()

        vrap.spades()

        if not vrap.assembly:
            vrap.cap3()

        vrap.clean_output()

        vrap.calculate_orf_density()

        if len(vrap.orf_density)>0:
            vrap.hmmer()
            if not vrap.noblast:
                vrap.blast()

        vrap.summary()

    vrap.log("VrAP pipeline finished.\nThank you for using VrAP!")
except Exception as e:
    vrap.log(e,True)

Using vrap and bacto to process Acinetobacter baumannii isolates

Leave a reply

input data

#Deciphering Evolutionary Trajectories in Acinetobacter baumannii: A Comparative Study of Isolates ATCC 19606, AYE, and ATCC 17978

#PRJNA1041744

A6 WT - Acinetobacter baumannii ATCC 19606, CP059040
A10 CraA - Acinetobacter baumannii ATCC 19606, CP059040
BIT33_RS17375: adeJ
BIT33_RS17370: adeI
BIT33_RS11855: craA (MKNIQTTALNRTTLMFPLALVLFEFAVYIGNDLIQPAMLAITED
             FGVSATWAPSSMSFYLLGGASVAWLLGPLSDRLGRKKVLLSGVLFFALCCFLILLTRQ
             IEHFLTLRFLQGIGLSVISAVGYAAIQENFAERDAIKVMALMANISLLAPLLGPVLGA
             FLIDYVSWHWGFVAIALLALLSWVGLKKQMPSHKVSVTKQPFSYLFDDFKKVFSNRQF
             LGLTLALPLVGMPLMLWIALSPIILVDELKLTSVQYGLAQFPVFLGLIVGNIVLIKII
             DRLALGKTVLIGLPIMLTGTLILILGVVWQAYLIPCLLIGMTLICFGEGISFSVLYRF
             ALMSSEVSKGTVAAAVSMLLMTSFFAMIELVRYLYTQFHLWAFVLSAFAFIALWFTQP
             RLALKREMQERVAQDLH) MFS transporter --> H0N29_01695
364663 - 365770

A12 AYE - Acinetobacter baumannii AYE,         CU459141
A19 17978 - Acinetobacter baumannii ATCC 17978  CP033110
A1S_2736: adeJ
A1S_2735: adeI
A1S_3146: craA       

mv A6WT_HQ_R1.fq.gz A6_WT_HQ_R1.fastq.gz 
mv A6WT_HQ_R2.fq.gz A6_WT_HQ_R2.fastq.gz 
mv A10CraA_HQ_R1.fq.gz A10_CraA_HQ_R1.fastq.gz 
mv A10CraA_HQ_R2.fq.gz A10_CraA_HQ_R2.fastq.gz 
mv A12AYE_HQ_R1.fq.gz A12_AYE_HQ_R1.fastq.gz 
mv A12AYE_HQ_R2.fq.gz A12_AYE_HQ_R2.fastq.gz 
mv A1917978_HQ_R1.fq.gz A19_17978_HQ_R1.fastq.gz 
mv A1917978_HQ_R2.fq.gz A19_17978_HQ_R2.fastq.gz

#Nucleotide Accession Prefixes
#AP,BS                                                      DDBJ           Genome project data
#AL,BX,CR,CT,CU,FP,FQ,FR                                    EMBL           Genome project data
#AE,CP,CY                                                   GenBank        Genome project data

#transposon detection using gubbins

#TODOs: Submit A6, A12 and A19 to NCBI, A10 not, since the assembly is very bad! Could do mapping plot of A10 on A6, and three genebank-files to Tam!     
#Wuen Ee Foong
#Jiabin Huang
#Heng-Keat Tam, Hengyang Medical College, University of South China
#pA12AYE1

prokka

for sample in A19_17978_HQ; do
prokka --force --outdir prokka/${sample} --cpus 2 --usegenus --genus Acinetobacter --kingdom Bacteria --species baumannii --addgenes --addmrna --prefix ${sample} --locustag ${sample} shovill/${sample}/contigs.fa -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
done

(prokka on titisee) for the remaining steps (f,f,f,f,f,t,f(variants_calling),t,t,t in bacto-0.1.json)

#TODO: annotate the SNPs with antimicrobial gene database!
#TODO: ask Qianjie die sample are directly gotten from the patient, oder did some motification on the reference genome?

3.1. vrap (TODO: deploy vrap on xgenes.com, it does not use the database. under (vrap) jhuang@hamm[hamburg])

    #In summary: --(optional) host --> bowtie/host (bowtie_database for host-substraction)
    #            --(optional) virus --> blast/db/virus (blast1, in nt_dbs, custom blastn database before downloaded_db!)
    #            --(always) defined in download_db.py (setting by modifying the file) and run default --> database/viral_db/viral_nucleotide (blast2, in nt_dbs)
    #                                                                                                 --> database/viral_db/viral_protein (blast3, in prot_db)
    #            --(optional) nt+nr (blast4 and blast5)

    #Under (vrap) jhuang@hamm or (bengal3_ac3) jhuang@hamburg
    #DEBUG: {path}/external_tools/Lighter-master/lighter-->/home/jhuang/miniconda3/envs/vrap/bin/lighter
    #DEBUG: {path}/external_tools/SPAdes-3.11.1-Linux/bin/spades.py-->/home/jhuang/miniconda3/envs/vrap/bin/spades.py

    #gi|1690449040|gb|CP029454.1|    Bacillus cereus strain FORC087 chromosome, complete genome
    #gi|1621560499|gb|CP039269.1|    Bacillus cereus strain MH19 chromosome, complete genome
    #gi|1783154430|gb|CP040340.1|    Bacillus cereus strain DLOU-Tangshan chromosome, complete genome
    #CP059040        3980852 10      70      71
    #CU459141        3936291 4037743 70      71
    #CP033110        3902037 8030278 70      71

    #CP059040 --> A6 and A10
    #CU459141 --> A12 
    #CP033110 --> A19

    #Command line: /home/jhuang/miniconda3/envs/bengal3_ac3/bin/spades.py -1 A6_WT_HQ_R1.fastq.gz -2 A6_WT_HQ_R2.fastq.gz --careful --trusted-contigs CP059040.fasta -o spades_A6_ATCC19606

    #CP059040_RagTag 3866625 17      3866625 3866626
    #CAP_1_length_17300      17300   3866663 17300   17301
    #CAP_2_length_928        928     3883982 928     929
    #NODE_17_length_11061_cov_141.020669     11061   3884948 11061   11062
    #NODE_19_length_1654_cov_105.893255      1654    3896046 1654    1655
    #NODE_20_length_1102_cov_153.073846      1102    3897737 1102    1103
    #NODE_21_length_1040_cov_172.078861      1040    3898876 1040    1041

    #NOTE that we need to adjust the taxid in download_db.py to download viral_nucleotide and viral_protein before running vrap/vrap.py.

    #A10　
    #Blast with 5 DBs when --virus (blast/db/virus), viral_nucleotide (download_db.rb), viral_protein (download_db.rb), nt, nr
    #Bowtie2 with 1 DB when --host (bowtie/host)
    #NOTE the fasta files given with --host and --virus should be deduplicated!
    #NOTE that in vrap_CP059040.py the trusted-contigs is used "--trusted-contigs /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/CP074710.fasta"
    vrap/vrap_CP059040.py -1 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_2.fastq.gz -o A10_vrap_out --host Bacillus_cereus.fasta --virus=Acinetobacter_baumannii.fasta   -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

    #A6
    vrap/vrap_CP059040.py -1 ./results_ATCC19606/trimmed/A6_WT_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC19606/trimmed/A6_WT_HQ_trimmed_P_2.fastq.gz -o A6_vrap_out  -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

    #A12
    vrap/vrap_CU459141.py -1 ./results_AYE/trimmed/A12_AYE_HQ_trimmed_P_1.fastq.gz -2 ./results_AYE/trimmed/A12_AYE_HQ_trimmed_P_2.fastq.gz -o A12_vrap_out  -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

    #A19 
    vrap/vrap_CP033110.py -1 ./results_ATCC17978/trimmed/A19_17978_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC17978/trimmed/A19_17978_HQ_trimmed_P_2.fastq.gz -o A19_vrap_out  -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

    # -- A10 LOG --
    START VrAP
    START checking blast db
    START lighter
    START estimate k-mer size
    k-mer size: 15037634
    /home/jhuang/miniconda3/envs/vrap/bin/lighter -r A10_CraA_HQ_trimmed_P_1.fastq.gz -r A10_CraA_HQ_trimmed_P_2.fastq.gz -K 20 15037634 -od A10_vrap_out/lighter -t 55
    START flash
    /home/jhuang/Tools/vrap/external_tools/FLASH-1.2.11/flash A10_vrap_out/lighter/A10_CraA_HQ_trimmed_P_1.cor.fq.gz A10_vrap_out/lighter/A10_CraA_HQ_trimmed_P_2.cor.fq.gz -o flash -d A10_vrap_out/flash -M 1000

    START bowtie2-build
    START bowtie2 pe
    #MAPPING to 1 DB: bowtie/host

    START spades    
    START cap3
    #/home/jhuang/Tools/vrap/external_tools/cap3 A10_vrap_out/spades/contigs.fasta -y 100  
    START calculating orf density
    START HMMER

    START blast
    #grep ">" A10_vrap_out/blast/custom_viral_seq.fa == "--virus Acinetobacter_baumannii.fasta" 
    #MAPPING to 5 DBs: blast/db/virus[1], viral_nucleotide (defined in the download_db.py)[2], viral_protein (defined in the download_db.py)[3], nt[4], nr[5]
    #INPUT_SEQ: 
    #(Assuming, if annotated with the previous database, will not translate to the next level). For the example below: we have no record finding in [2,1], then we have 1317-0 in blast/blastn.fa; In [3] we find 'cut -f1 blastx.csv | sort -u | wc -l = 567', then we have 1317-567=750; In [4], we 'cut -f1 blastn_nt.csv | sort -u | wc -l=710', then we have 750-710=40 in blast/blastn_nt.fa; In [5], we find 'cut -f1 blastn_nr.csv | sort -u | wc -l=39', then we remain 40-39=1 record in blastx_nr.fa.
    #grep ">" vrap_contig.fasta | wc -l   #1317 
    #grep ">" blast/blastn.fa | wc -l     #1317
    #grep ">" blast/blastx.fa | wc -l     #750
    #grep ">" blast/blastn_nt.fa | wc -l  #40
    #grep ">" blast/blastx_nr.fa | wc -l  #1
    #--
    #grep ">" vrap_contig.fasta | wc -l   #2370 
    #grep ">" blast/blastn.fa | wc -l     #2306 --> 64 contigs should belong to A. baumannii! can be annotated with --custom_virus (3 speicies of Acinetobacter baumannii) and downloaded [virus_nt_db] (2.6 G Acinetobacter baumannii)
    #grep ">" blast/blastx.fa | wc -l     #1245: 1061 contigs can be annotated with downloaded [virus_aa_db]
    #grep ">" blast/blastn_nt.fa | wc -l  #41: 1204 contigs can be annotated with nt
    #grep ">" blast/blastx_nr.fa | wc -l  #1: 40 contigs can be annotated with nr 
    #--> extract 64 contigs using 'python3 extract_unique_sequences.py A10_vrap_out/vrap_contig.fasta A10_vrap_out/blast/blastn.fa A10_extracted_contigs.fa'

    [1_INDEX]/home/jhuang/Tools/vrap/external_tools/blast/makeblastdb -in /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/custom_viral_seq.fa -dbtype nucl -parse_seqids -out /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/db/virus >> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/vrap.log 2>> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/vrap.log

    [2,1]/home/jhuang/Tools/vrap/external_tools/blast/blastn -num_threads 55 -query /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/vrap_contig.fasta -db "/home/jhuang/Tools/vrap/database/viral_db/viral_nucleotide /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/db/virus"  -evalue 1e-4 -outfmt "6 qseqid qstart qend sstart send evalue length pident sseqid stitle qcovs qcovhsp sacc slen qlen" -max_target_seqs 1 -out /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/blastn.csv >> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/vrap.log

    [3]/home/jhuang/Tools/vrap/external_tools/blast/blastx -num_threads 55 -query /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/blastn.fa -db "/home/jhuang/Tools/vrap/database/viral_db/viral_protein"  -evalue 1e-6 -outfmt "6 qseqid qstart qend sstart send evalue length pident sseqid stitle qcovs qcovhsp sacc slen qlen" -max_target_seqs 1 -out /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/blastx.csv >> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/vrap.log

    [4]/home/jhuang/Tools/vrap/external_tools/blast/blastn -num_threads 55 -query /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/blastx.fa -db "/mnt/h1/jhuang/blast/nt"  -evalue 1e-4 -outfmt "6 qseqid qstart qend sstart send evalue length pident sseqid stitle qcovs qcovhsp sacc slen qlen" -max_target_seqs 1 -out /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/blastn_nt.csv >> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/vrap.log

    [5]/home/jhuang/Tools/vrap/external_tools/blast/blastx -num_threads 55 -query /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/blastn_nt.fa -db "/mnt/h1/jhuang/blast/nr"  -evalue 1e-6 -outfmt "6 qseqid qstart qend sstart send evalue length pident sseqid stitle qcovs qcovhsp sacc slen qlen" -max_target_seqs 1 -out /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/blast/blastx_nr.csv >> /mnt/h1/jhuang/DATA/Data_Tam_Acinetobacter_baumannii/A10_vrap_out/vrap.log

    VrAP pipeline finished.
    Thank you for using VrAP!

3.2. ragtag.py ((bengal3_ac3) jhuang@hamm)

    #ragtag.py patch ragtag_output_scaffold/ragtag.scaffold.fasta CP059040.fasta

    ragtag.py scaffold CP059040.fasta A6_vrap_out/vrap_contig.fasta
        Acinetobacter baumannii strain ATCC 19606 plasmid unnamed, complete sequence    Acinetobacter baumannii     19653   39248   62%     0.0     99.97%  18598   CP074586.1

    python3 extract_unique_sequences.py vrap_contig.fasta blast/blastn.fa A10_extracted_contigs.fa
    ragtag.py scaffold CP059040.fasta A10_extracted_contigs.fa
    python3 filter_contigs.py A10_ragtag_output/ragtag.scaffold.fasta A10_ragtag_contigs_2000.fa
    # Check the converage to determine if they are chr or plasmids!

    ragtag.py scaffold CU459141.fasta A12_vrap_out/vrap_contig.fasta
        Acinetobacter baumannii str. AYE plasmid p2ABAYE, complete genome   Acinetobacter baumannii AYE     16835   18307   100%    0.0     99.99%  9661    CU459138.1
        Acinetobacter baumannii str. AYE plasmid p1ABAYE, complete genome   Acinetobacter baumannii AYE     9494    10856   100%    0.0     100.00%     5644    CU459137.1
        Acinetobacter baumannii str. AYE plasmid p4ABAYE, complete genome   Acinetobacter baumannii AYE     3709    5265    100%    0.0     99.95%  2726    CU459139.1
        Acinetobacter baumannii str. AYE plasmid p3ABAYE, complete genome   Acinetobacter baumannii AYE     1.384e+05   1.838e+05   100%    0.0     100.00%     94413   CU459140.1

    ragtag.py scaffold CP033110.fasta A19_vrap_out/vrap_contig.fasta
    #using CP074710.1 in the round2 (see below)
        Acinetobacter baumannii strain ATCC 17978 plasmid unnamed1, complete sequence   Acinetobacter baumannii     2.373e+05   2.463e+05   100%    0.0     100.00%     148956  CP074709.1
        Acinetobacter baumannii strain ATCC 17978 plasmid unnamed5, complete sequence   Acinetobacter baumannii     24842   50362   100%    0.0     100.00%     38118   CP074711.1
        Acinetobacter baumannii strain ATCC 17978 plasmid unnamed3, complete sequence   Acinetobacter baumannii     20953   37147   100%    0.0     100.00%     20351   CP074707.1
        Acinetobacter baumannii ATCC 17978 chromosome, complete genome  Acinetobacter baumannii ATCC 17978  13245   16527   100%    0.0     100.00%     4005343     CP053098.1

3.3 vrap round2

    #A10　
    vrap/vrap_A6_chr.py -1 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_2.fastq.gz -o A10_vrap_out --host Bacillus_cereus.fasta --virus=Acinetobacter_baumannii.fasta   -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

    #A19 
    vrap/vrap_CP074710.py -1 ./results_ATCC17978/trimmed/A19_17978_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC17978/trimmed/A19_17978_HQ_trimmed_P_2.fastq.gz -o A19_vrap_out  -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 55

3.4 ragtag round2

    python3 extract_unique_sequences.py A10_vrap_out/vrap_contig.fasta A10_vrap_out/blast/blastn.fa A10_vrap_out/A10_extracted_contigs.fa
    ragtag.py scaffold A6_chr.fasta A10_vrap_out/A10_extracted_contigs.fa
    mv ragtag_output A10_ragtag_output
    python3 filter_contigs.py A10_ragtag_output/ragtag.scaffold.fasta A10_ragtag_contigs_2000.fa

    ragtag.py scaffold CP074710.fasta A19_vrap_out/vrap_contig.fasta
    Acinetobacter baumannii strain ATCC 17978 plasmid unnamed5, complete sequence   Acinetobacter baumannii     24842   50362   100%    0.0     100.00%     38118   CP074711.1
    Acinetobacter baumannii strain ATCC 17978 plasmid unnamed3, complete sequence   Acinetobacter baumannii     20953   37147   100%    0.0     100.00%     20351   CP074707.1

3.5 ragtag round3

    ragtag.py patch CP059040.fasta A6_vrap_out/vrap_contig.fasta
    ragtag.py scaffold CP059040.fasta A6_vrap_out/vrap_contig.fasta

    python3 extract_unique_sequences.py vrap_contig.fasta blast/blastn.fa A10_extracted_contigs.fa
    ragtag.py scaffold CP059040.fasta A10_vrap_out/A10_extracted_contigs.fa
    python3 filter_contigs.py A10_ragtag_output/ragtag.scaffold.fasta A10_ragtag_contigs_2000.fa
    # Check the converage to determine if they are chr or plasmids!
    #>CAP_12_length_2972
    #Bacillus paranthracis strain Bc006 chromosome, complete genome     Bacillus paranthracis   4691    5472    100%    0.0     99.73%  5304537     CP119629.1
    #Bacillus paranthracis strain Bc006 chromosome, complete genome     Bacillus paranthracis   9762    11557   97%     0.0     99.87%  5304537     CP119629.1
    #Bacillus paranthracis strain Bc006 chromosome, complete genome     Bacillus paranthracis   14883   1.353e+05   100%    0.0     99.98%  5304537     CP119629.1
    #CP119629.1 Submitted (10-MAR-2023) College of Animal Science, Ahau, Chagnjianxilu, Anhui 230000, China

    ragtag.py patch CU459141.fasta A12_vrap_out/vrap_contig.fasta
    ragtag.py scaffold CU459141.fasta A12_vrap_out/vrap_contig.fasta

    ragtag.py patch CP074710.fasta A19_vrap_out/vrap_contig.fasta
    ragtag.py scaffold CP074710.fasta A19_vrap_out/vrap_contig.fasta

    A10:
    CP059040_RagTag 3941335 17      60      61
    CAP_11_length_2972      2972    4007061 60      61
    CAP_1_length_17300      17300   4010103 60      61
    samtools faidx A10_ragtag_contigs_2000.fa CAP_1_length_17300 > A10_plasmid.fasta

3.6 To map reads from one isolate (isolate 2) to the genome of another isolate (isolate 1), check coverage, and generate a consensus FASTA from the alignment, with regions not covered by the mapping masked as ‘N’, you’ll need to follow a series of steps using bioinformatics tools.

    # -- Mapping to CP059040 --
    5447038 + 0 in total (QC-passed reads + QC-failed reads)
    5447038 + 0 primary
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 primary duplicates
    3346945 + 0 mapped (61.45% : N/A)
    3346945 + 0 primary mapped (61.45% : N/A)
    4361512 + 0 paired in sequencing
    2180756 + 0 read1
    2180756 + 0 read2
    2579038 + 0 properly paired (59.13% : N/A)
    2681114 + 0 with itself and mate mapped
    21988 + 0 singletons (0.50% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    # -- Mapping to A6 --
    5447044 + 0 in total (QC-passed reads + QC-failed reads)
    5447044 + 0 primary
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 primary duplicates
    3437574 + 0 mapped (63.11% : N/A)
    3437574 + 0 primary mapped (63.11% : N/A)
    4361524 + 0 paired in sequencing
    2180762 + 0 read1
    2180762 + 0 read2
    2646686 + 0 properly paired (60.68% : N/A)
    2753738 + 0 with itself and mate mapped
    22055 + 0 singletons (0.51% : N/A)
    1484 + 0 with mate mapped to a different chr
    1390 + 0 with mate mapped to a different chr (mapQ>=5)

    #- Sort and Index the BAM File:
    #    samtools sort -o sorted.bam input.bam
    #    samtools index sorted.bam
    #- Call Variants:
    #    bcftools mpileup -Ou -f reference.fasta sorted.bam | bcftools call -mv -Oz -o calls.vcf.gz
    #    bcftools index calls.vcf.gz
    #- Generate Consensus FASTA:
    #    bcftools consensus -f reference.fasta calls.vcf.gz -o consensus.fasta

    Step 1: Align Reads to Reference Genome
        First, align the reads from isolate 2 to the genome of isolate 1. For example, using BWA:

            bwa mem reference_genome.fasta isolate2_reads.fastq > aligned.sam

        or, use vrap to generate input.bam

            #for bowtie (1 DB): --host Bacillus_cereus.fasta 
            #for blast (5 DBs): --virus=Acinetobacter_baumannii.fasta -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr, and two default download_db.py
            #A6_chr_plasmid.fasta reads filtered, should only the contamination reads remaining! 
            vrap/vrap.py -1 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_1.fastq.gz -2 ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_2.fastq.gz -o A10_vrap_host_A6 --host A6_chr_plasmid.fasta --noblast -t 55

    Step 2: Convert SAM to BAM, Sort and Index
        Convert the SAM file to BAM, sort it, and create an index:

            samtools view -Sb mapped > aligned.bam
            samtools sort aligned.bam -o sorted.bam
            samtools index sorted.bam

    Step 3: Coverage Analysis
        Check the coverage of the genome regions. Regions with no coverage will be masked later.

            #bedtools genomecov -ibam sorted.bam -bg > coverage.bed
            #BUG above: By default, Bedtools doesn't explicitly list regions with zero coverage in the output of genomecov. To get around this, you can use the -bga option with genomecov. This option ensures that regions with zero coverage are also included in the output. Here's how you can modify your command:
            bedtools genomecov -ibam sorted.bam -bga > coverage.bed

    Step 4: Create a Bed File for Masking
        Create a BED file that specifies regions with no or low coverage. Assume that any region with coverage less than a certain threshold (e.g., 1x) should be masked:

            awk '$4 < 1 {print $1 "\t" $2 "\t" $3}' coverage.bed > low_coverage.bed

    Step 5: Generate Consensus Sequence
        Generate a consensus sequence, using bcftools, and mask regions with low or no coverage:

            bcftools mpileup -Ou -f ../../CP059040.fasta sorted.bam | bcftools call -mv --ploidy 1 -Oz -o calls.vcf.gz
            bcftools index calls.vcf.gz
            bcftools consensus -f ../../CP059040.fasta -m low_coverage.bed calls.vcf.gz -o consensus.fasta
            #The site CP059040:4 overlaps with another variant, skipping...
            #The site CP059040:3125036 overlaps with another variant, skipping...
            #Applied 58 variants

            bcftools mpileup -Ou -f ../../A6_chr_plasmid.fasta sorted.bam | bcftools call -mv --ploidy 1 -Oz -o calls.vcf.gz
            bcftools index calls.vcf.gz
            bcftools consensus -f ../../A6_chr_plasmid.fasta -m low_coverage.bed calls.vcf.gz -o consensus.fasta
            #The site seq00000000:4 overlaps with another variant, skipping...
            #The site seq00000000:3125036 overlaps with another variant, skipping...
            #Applied 61 variants
            #--> TODO: could report the following regions are not covered in A10 comparing to A6.
            #seq00000000     364666  365765
            #seq00000000     2810907 2813323
            #seq00000000     2813552 2813779
            #seq00000000     2813923 2819663
            #seq00000000     2821839 2831921
            #seq00000000     2832750 2832753
            #seq00000000     2833977 2860354
            #seq00000000     2860505 2861780
            #seq00000000     2861931 2861951
            #seq00000000     2862102 2862234
            #seq00000000     2862385 2863434
            #seq00000000     3124939 3124943
            #seq00000000     3125009 3125010

        In this command, the -m option in bcftools consensus is used to specify a BED file for masking low-coverage regions.

varicant calling using spandx

4.2-1. generate genebank in snpEff

    mkdir ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/ATCC17978 
    cp ATCC17978.gb ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/ATCC17978/genes.gbk
    mkdir ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/ATCC19606 
    cp ATCC19606.gb ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/ATCC19606/genes.gbk
    mkdir ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/AYE 
    cp AYE.gb ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/AYE/genes.gbk

    vim ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/snpEff.config
    /home/jhuang/anaconda3/envs/spandx/bin/snpEff build -genbank ATCC17978      -d
    /home/jhuang/anaconda3/envs/spandx/bin/snpEff build -genbank ATCC19606      -d
    /home/jhuang/anaconda3/envs/spandx/bin/snpEff build -genbank AYE      -d

    /home/jhuang/anaconda3/envs/spandx/bin/snpEff build -genbank ATCC17978
    #        Protein check:  ATCC17978       OK: 3644        Not found: 81   Errors: 0       Error percentage: 0.0%
    /home/jhuang/anaconda3/envs/spandx/bin/snpEff build -genbank ATCC19606
    #        Protein check:  ATCC19606       OK: 3702        Not found: 0    Errors: 0       Error percentage: 0.0%
    /home/jhuang/anaconda3/envs/spandx/bin/snpEff build -genbank AYE
    #        Protein check:  AYE     OK: 3607        Not found: 52   Errors: 0       Error percentage: 0.0%

4.3. using spandx calling variants

    #Note that we recall the wild type (isolate 150) SNPs to its own assembly to increase the precision.
    gzip A19_17978_HQ_trimmed_P_1.fastq A19_17978_HQ_trimmed_P_2.fastq
    #ln -s /home/jhuang/Tools/spandx/ spandx
    #-t
    snpEff eff -nodownload -no-downstream -no-intergenic -ud 100 -v CP040849.1 noAB_wildtype_trimmed.PASS.snps.vcf > noAB_wildtype_trimmed.PASS.snps.annotated.vcf
    #
    #               'CP040849'      2659111 Standard
    (spandx) jhuang@hamm:~/DATA/Data_Tam_Acinetobacter_baumannii/results_ATCC17978$ nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq.gz" --ref ATCC17978.fasta --annotation --database ATCC17978 -resume

    gzip A12_AYE_HQ_trimmed_P_1.fastq A12_AYE_HQ_trimmed_P_2.fastq
    ln -s /home/jhuang/Tools/spandx/ spandx
    nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq.gz" --ref AYE.fasta --annotation --database AYE -resume

    gzip A6_WT_HQ_trimmed_P_1.fastq A6_WT_HQ_trimmed_P_2.fastq A10_CraA_HQ_trimmed_P_1.fastq A10_CraA_HQ_trimmed_P_2.fastq
    ln -s /home/jhuang/Tools/spandx/ spandx
    nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq.gz" --ref ATCC19606.fasta --annotation  --database ATCC19606 -resume

    # DEBUG: "--mixtures --structural" does not work well!
    #pindel -f ATCC19606.fasta -T 2 -i pindel.bam.config -o pindel.out                 
    #pindel -f ATCC19606.fasta -T 2 -i pindel.bam.config -o pindel.out

6, filtering process of SPANDx results

    #Input file is Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt (61203)
    #- Step1: filtering $6!=$7 (330)
    awk '{if($6!=$7) print}' < All_SNPs_indels_annotated.txt > All_SNPs_indels_annotated_.txt
    #- Step2: restricted the first 7 columns (240)
    cut -d$'\t' -f1-7 All_SNPs_indels_annotated_.txt > f1_7
    #- Step3: filtering examples as "G/C" (1450)
    grep -v "/" f1_7 > f1_7_
    #- Step4: filtering examples as ".       A" (170)
    grep -v "\." f1_7_ > f1_7__
    #- Step5: filtering examples as "T       *" (162), since we want to have only SNPs differing between BK20399 and GE3138.
    grep -v "*" f1_7__ > f1_7___
    #- Step6: only SNPs (50).
    grep -v "INDEL" f1_7___ > f1_7____
    #- Step7: get the complete rows with the positions
    cut -d$'\t' -f2-2 f1_7____ > f2
    replace "\n" with " All_SNPs_indels_annotated_.txt >> All_SNPs_annotated.txt\ngrep "
    grep "CHROM" All_SNPs_indels_annotated_.txt > All_SNPs_annotated.txt
    grep "46882" All_SNPs_indels_annotated_.txt >> All_SNPs_annotated.txt
    grep "46892" All_SNPs_indels_annotated_.txt >> All_SNPs_annotated.txt
    grep "46936" All_SNPs_indels_annotated_.txt >> All_SNPs_annotated.txt
    ....
    #- Step8: Missense mutations ()
    grep "missense_variant" All_SNPs_annotated.txt

7, indel calling using freebayes from merged bam

    mkdir freebayes
    cd freebayes
    #run picard unter java17
    (java17) picard-tools AddOrReplaceReadGroups I=../Outputs/bams/A12_AYE_HQ_trimmed.dedup.bam O=A12_AYE_readgroup-added.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=A12_AYE RGSM=A12_AYE RGLB=standard RGPU=A12_AYE VALIDATION_STRINGENCY=LENIENT
    #(java17) picard-tools AddOrReplaceReadGroups I=../Outputs/bams/148_trimmed.dedup.bam O=148_readgroup-added.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=148 RGSM=148 RGLB=standard RGPU=148 VALIDATION_STRINGENCY=LENIENT
    #(java17) picard-tools AddOrReplaceReadGroups I=../Outputs/bams/149_trimmed.dedup.bam O=149_readgroup-added.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=149 RGSM=149 RGLB=standard RGPU=149 VALIDATION_STRINGENCY=LENIENT
    #(java17) picard-tools MergeSamFiles INPUT=150_readgroup-added.bam INPUT=148_readgroup-added.bam INPUT=149_readgroup-added.bam OUTPUT=merged_picard.bam
    #(java17) picard-tools CreateSequenceDictionary R=wildtype_150.fasta O=wildtype_150.dict

    #Note that the reference is 'wildtype_150'.
    # Direkt SNP calling is not suggested. It is better if we take a realigning step before SNP and Indel-calling.
    #conda install -c bioconda gatk4 gatk
    #gatk-register /home/jhuang/Tools/GenomeAnalysisTK-3.8-1-0-gf15c1c3ef/GenomeAnalysisTK.jar
    #need wildtype_150.fasta.fai and wildtype_150.dict

    #under the bengal3_ac3 environment
    (bengal3_ac3) samtools index merged_picard.bam 
    (bengal3_ac3) gatk -T RealignerTargetCreator -I merged_picard.bam  -R /media/jhuang/Elements1/Data_Holger_Klebsiella_pneumoniae_SNP/wildtype_150.fasta -o realigned.merged.bam.list
    (bengal3_ac3) gatk -T IndelRealigner -I merged_picard.bam -R /media/jhuang/Elements1/Data_Holger_Klebsiella_pneumoniae_SNP/wildtype_150.fasta -targetIntervals realigned.merged.bam.list -o merged_realigned.bam

    freebayes --fasta-reference /media/jhuang/Elements1/Data_Holger_Klebsiella_pneumoniae_SNP_PUBLISHING/wildtype_150.fasta -b merged_realigned.bam -p 1 -i -X -u | /home/jhuang/Tools/freebayes_git/freebayes/vcflib/bin/vcffilter -f 'QUAL > 30' > merged.realigned.snps_filtered_freebayes.vcf

    freebayes --fasta-reference /media/jhuang/Elements1/Data_Holger_Klebsiella_pneumoniae_SNP_PUBLISHING/wildtype_150.fasta -b merged_realigned.bam -p 1 -I -X -u | /home/jhuang/Tools/freebayes_git/freebayes/vcflib/bin/vcffilter -f 'QUAL > 30' > merged.realigned.indels_filtered_freebayes.vcf
    #BETTER with realigned bam (see commands above): freebayes --fasta-reference /media/jhuang/Elements1/Data_Holger_Klebsiella_pneumoniae_SNP_PUBLISHING/wildtype_150.fasta -b merged_picard.bam   -p 1 -I -X -u | /home/jhuang/Tools/freebayes_git/freebayes/vcflib/bin/vcffilter -f 'QUAL > 30' > merged.indels_filtered_freebayes.vcf

    #freebayes -f {params.reference} \
    #  --ploidy 1 \
    #  --min-coverage {params.mincov} \
    #  --min-mapping-quality {params.mapqual} \
    #  --min-base-quality {params.minbasequal} \
    #  {input.forward} {input.reverse} > {output.vcf}

    #To call structural variants (indels) and single nucleotide polymorphisms (SNPs) using FreeBayes, you'll need to follow a series of steps. FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a read.

    #FreeBayes is primarily designed for calling small-scale variants like SNPs and indels (insertions and deletions) but is not typically used for large-scale structural variant detection (like large insertions, deletions, inversions, or translocations). Structural variants are usually larger in size, often exceeding the length of a single read, which is beyond the scope of what FreeBayes is optimized to handle.
    #For calling structural variants, you would generally use other tools that are specifically designed for this purpose, such as:
        Delly: For large deletions, tandem duplications, inversions, and translocations.
        LUMPY: A probabilistic framework for structural variant discovery.
        Manta: Designed for efficient calling of structural variants in cancer and germline studies.
        CNVnator: For discovering copy number variants.
    #If you're specifically looking to use FreeBayes for larger indels that might be on the edge of what is considered a small-scale variant, you can adjust the parameters to be more permissive with respect to variant size. However, this is not typically recommended for true structural variants.
    For small indels and SNPs, FreeBayes is quite effective, and you would follow the general procedure of aligning your reads, processing the BAM files, and then running FreeBayes with appropriate parameters set for your study's requirements. But again, for structural variant detection, other tools would be more suitable.

8, merge results from two variant calling methods!

    #-- post-processing --
    awk '{if($6!=$7) print}' < All_SNPs_indels_annotated.txt > All_SNPs_indels_annotated_.txt
    cut -d$'\t' -f1-7 All_SNPs_indels_annotated_.txt > f1_7
    grep -v "/" f1_7 > f1_7_
    grep -v "\." f1_7_ > f1_7__
    grep -v "*" f1_7__ > f1_7___

    grep -v "INDEL" f1_7___ > f1_7_SNPs
    cut -d$'\t' -f2-2 f1_7_SNPs > f1_7_SNPs_f2
    \n --> " All_SNPs_indels_annotated_.txt >> All_SNPs_annotated.txt\ngrep "
    #grep "CHROM" All_SNPs_indels_annotated_.txt > All_SNPs_annotated.txt
    #grep "46882" All_SNPs_indels_annotated_.txt >> All_SNPs_annotated.txt
    #grep "46892" All_SNPs_indels_annotated_.txt >> All_SNPs_annotated.txt

    grep -v "SNP" f1_7___ > f1_7_indels
    cut -d$'\t' -f2-2 f1_7_indels > f1_7_indels_f2
    \n --> " All_SNPs_indels_annotated_.txt >> All_indels_annotated.txt\ngrep "
    #grep "CHROM" All_SNPs_indels_annotated_.txt > All_indels_annotated.txt
    #grep "49848" All_SNPs_indels_annotated_.txt >> All_indels_annotated.txt
    #grep "53872" All_SNPs_indels_annotated_.txt >> All_indels_annotated.txt

9 (optional), PubMLST, ResFinder or RGI calling https://cge.cbs.dtu.dk/services/ https://cge.cbs.dtu.dk/services/MLST/ https://pubmlst.org/bigsdb?db=pubmlst_pacnes_seqdef&page=batchProfiles&scheme_id=3 https://cge.food.dtu.dk/services/ResFinder/ https://card.mcmaster.ca/home https://www.pseudomonas.com/ https://www.pseudomonas.com/amr/list https://card.mcmaster.ca/analyze/rgi The antimicrobial resistance genes were predicted by the tool RGI 6.0.0 which utilizes the database CARD 3.2.5 (PMID: 31665441). !!!!#check the difference of SNPs in the two files, and then look for the SNPs in my tables comparing the two isolates!!!!

    cut -d$'\t' -f9-21 BK20399_contigs.fa.txt > f9_21
    cut -d$'\t' -f2-6 BK20399_contigs.fa.txt > f2_6
    #Best_Hit_ARO --> Antibiotic Resistance Ontology (ARO) Term
    paste f9_21 f2_6 > BK20399_.txt
    #grep "CARD_Protein_Sequence" BK20399_.txt > header
    grep -v "CARD_Protein_Sequence" BK20399_.txt > BK20399__.txt
    sort BK20399__.txt > BK20399_sorted.txt

    cut -d$'\t' -f9-21 GE3138_contigs.fa.txt > f9_21
    cut -d$'\t' -f2-6 GE3138_contigs.fa.txt > f2_6
    paste f9_21 f2_6 > GE3138_.txt
    grep -v "CARD_Protein_Sequence" GE3138_.txt > GE3138__.txt
    sort GE3138__.txt > GE3138_sorted.txt

    cut -d$'\t' -f1-6 BK20399_sorted.txt > BK20399_f1_6.txt  
    cut -d$'\t' -f1-6 GE3138_sorted.txt > GE3138_f1_6.txt
    diff BK20399_f1_6.txt GE3138_f1_6.txt
    #-->< APH(3'')-Ib   99.63   3002639 protein homolog model   n/a n/a

    cat header BK20399_sorted.txt > BK20399.txt
    cat header GE3138_sorted.txt > GE3138.txt

    ~/Tools/csv2xls-0.4/csv_to_xls.py BK20399.txt GE3138.txt -d$'\t' -o ARG_calling.xls

    #---- another example ----
    cut -d$'\t' -f9-21 AW27149_before.txt > f9_21
    cut -d$'\t' -f2-6 AW27149_before.txt > f2_6
    #Best_Hit_ARO --> Antibiotic Resistance Ontology (ARO) Term
    paste f9_21 f2_6 > AW27149_before_.txt
    grep "CARD_Protein_Sequence" AW27149_before_.txt > header
    grep -v "CARD_Protein_Sequence" AW27149_before_.txt > AW27149_before__.txt
    sort AW27149_before__.txt > AW27149_before_sorted.txt

    cut -d$'\t' -f9-21 AW48641_after.txt > f9_21
    cut -d$'\t' -f2-6 AW48641_after.txt > f2_6
    paste f9_21 f2_6 > AW48641_after_.txt
    grep -v "CARD_Protein_Sequence" AW48641_after_.txt > AW48641_after__.txt
    sort AW48641_after__.txt > AW48641_after_sorted.txt

    cut -d$'\t' -f1-6 AW27149_before_sorted.txt > AW27149_before_f1_6.txt  
    cut -d$'\t' -f1-6 AW48641_after_sorted.txt > AW48641_after_f1_6.txt
    diff AW27149_before_f1_6.txt AW48641_after_f1_6.txt
    #48c48
    #< Pseudomonas aeruginosa CpxR   100.0   3004054 protein homolog model   n/a     n/a
    #---
    #> Pseudomonas aeruginosa CpxR   99.56   3004054 protein homolog model   n/a     n/a

    cat header AW27149_before_sorted.txt > ../RGI_calling_AW27149_before.txt
    cat header AW48641_after_sorted.txt > ../RGI_calling_AW48641_after.txt

    #mlst add header "isolate   species ST  acsA    aroE    guaA    mutL    nuoD    ppsA    trpE"
    mv 
    ~/Tools/csv2xls-0.4/csv_to_xls.py All_SNPs_annotated.csv RGI_calling_AW27149_before.txt RGI_calling_AW48641_after.txt mlst.txt -d$'\t' -o SNP_ARG_calling.xls

    #Find the common SNP between SNP calling and RGI calling, since the SNP calling has bad annotation, marked yellow.

Kraken2 Installation and Usage Guide

Leave a reply

kraken2 –db Minikraken2_v2 input_sequences.fasta –output output.kraken2

kraken2 –db=/home/jhuang/Tools/k2_standard_20230605 –output=A10_output.txt –report=A10_report.txt –paired ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_1.fastq.gz ./results_ATCC19606/trimmed/A10_CraA_HQ_trimmed_P_2.fastq.gz

Kraken2 is a system for assigning taxonomic labels to short DNA sequences, such as those produced by genome sequencing technologies. It achieves high speeds by utilizing exact alignments of k-mers and a novel classification algorithm.

Here’s a step-by-step guide to installing and running Kraken2:

Install Kraken2: Kraken2 is available on GitHub. You can use git to clone the repository and then compile it.

git clone https://github.com/DerrickWood/kraken2.git
cd kraken2
./install_kraken2.sh .
#export PATH=/home/jhuang/Tools/kraken2:$PATH
#git clone https://github.com/jenniferlu717/Bracken.git
#./install_bracken.sh .

This will install Kraken2 in the current directory.

Download or build a database: Kraken2 requires a database to classify sequences. You can either download a pre-built database or build one yourself. Download a pre-built database:

#https://benlangmead.github.io/aws-indexes/k2
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20230605.tar.gz
#https://ccb.jhu.edu/software/kraken2/index.shtml?t=downloads
#https://lomanlab.github.io/mockcommunity/mc_databases.html

Kraken2 provides some standard databases that can be downloaded. Here’s how you can download the MiniKraken2_v2 database (8GB):

kraken2-build --download-library bacteria --db Minikraken2_v2
kraken2-build --download-library viruses --db Minikraken2_v2
kraken2-build --build --db Minikraken2_v2

Or, build a custom database: Building a database requires sequence data in the form of FASTA files. Here’s an example of building a bacterial database:

kraken2-build --download-taxonomy --db MY_DB
kraken2-build --download-library bacteria --db MY_DB
kraken2-build --build --db MY_DB

Classify sequences with Kraken2: Once you have a database, you can use Kraken2 to classify sequences. Here’s how you can classify a set of sequences in a file named input_sequences.fasta using the MiniKraken2_v2 database:

kraken2 --db Minikraken2_v2 input_sequences.fasta --output output.kraken2
#/home/jhuang/Tools/kraken2/kraken2 --db=/home/jhuang/Tools/k2_standard_20230605 --output=output.txt --report=report.txt --paired trimmed/Rotavirus_S3_R1.fastq.gz trimmed/Rotavirus_S3_R2.fastq.gz
  Loading database information... done.
  11351299 sequences (3196.22 Mbp) processed in 562.717s (1210.3 Kseq/m, 340.80 Mbp/m).
    11271104 sequences classified (99.29%)
    80195 sequences unclassified (0.71%)

This will produce an output file (output.kraken2) with the classification results.

Visualize results (Optional): Kraken2’s output can be complex. Visualization tools, such as Pavian or Krona, can be helpful for interpreting the results. These tools provide interactive pie charts and other graphics that can assist in understanding the taxonomic classifications provided by Kraken2.
```
#upload /home/jhuang/DATA/Data_Rotavirus/report.txt to https://fbreitwieser.shinyapps.io/pavian/
/usr/bin/convert Screenshot\ 2023-09-12\ at\ 11-09-13\ Pavian.png -crop 980x1260+280+300 output.png
```

Notes: Ensure that you have enough disk space. Building and storing databases, especially comprehensive ones, can require a significant amount of space. Kraken2’s speed and accuracy can be influenced by the size and content of the database. It’s always a trade-off between classification speed and the breadth of taxa you wish to detect.

sT manuscript revision: log2FC of sT itself at 3 and 8 dpt

Leave a reply

nextflow

#under hamm
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
#Debug the following error: added "--minAssignedFrags 0 \\" to modules/nf-core/salmon/quant/main.nf option "salmon quant" and added "--min_mapped_reads 0" in the nextflow command
(rnaseq_2021) jhuang@hamm:~/DATA/Data_Denise_sT_RNASeq$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_JN707599 --fasta JN707599.fasta --gtf JN707599.gtf -profile test_full -resume --max_memory 124.GB --max_time 2400.h --save_reference --aligner star_salmon --gtf_extra_attributes gene_id --gtf_group_features transcript_id --featurecounts_group_type gene_id --featurecounts_feature_type transcript --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --min_mapped_reads 0
(rnaseq_2021) jhuang@hamm:~/DATA/Data_Denise_sT_RNASeq$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_LT --fasta JN707599.fasta --gtf LT.gtf -profile test_full -resume --max_memory 124.GB --max_time 2400.h --save_reference --aligner star_salmon --gtf_extra_attributes gene_id --gtf_group_features transcript_id --featurecounts_group_type gene_id --featurecounts_feature_type transcript --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --min_mapped_reads 0
(rnaseq_2021) jhuang@hamm:~/DATA/Data_Denise_sT_RNASeq$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_sT --fasta JN707599.fasta --gtf sT.gtf -profile test_full -resume --max_memory 124.GB --max_time 2400.h --save_reference --aligner star_salmon --gtf_extra_attributes gene_id --gtf_group_features transcript_id --featurecounts_group_type gene_id --featurecounts_feature_type transcript --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --min_mapped_reads 0

#DEBUG
#V_8_4_1_p602_d8_DonorI,V_8_4_2_p602_d8_DonorI.fastq.gz,,auto
#V_8_5_1_p602and604_d3_DonorI,V_8_5_1_602and604_d3_DonorI.fastq.gz,,auto
#V_8_5_2_p602and604_d3_DonorII,V_8_5_2_602and604_d3_DonorII.fastq.gz,,auto
#mamba install -c bioconda rsem
#mamba install fq

construct DESeqDataSet including host+viral gene expressions

# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
setwd("~/DATA/Data_Denise_sT_immune_evasion_PUBLISHING/Data_Denise_RNASeq/results_JN707599/star_salmon")
# Define paths to your Salmon output quantification files

files <- c("V_8_0_mock_DonorI" = "./V_8_0_mock_DonorI/quant.sf",
"V_8_0_mock_DonorII" = "./V_8_0_mock_DonorII/quant.sf",
"V_8_1_5_p601_d3_DonorII" = "./V_8_1_5_p601_d3_DonorII/quant.sf",
"V_8_1_5_p604_d3_DonorII" = "./V_8_1_5_p604_d3_DonorII/quant.sf",
"V_8_1_5_p601_d8_DonorII" = "./V_8_1_5_p601_d8_DonorII/quant.sf",
"V_8_1_5_p604_d8_DonorII" = "./V_8_1_5_p604_d8_DonorII/quant.sf",
"V_8_1_6_p601_d3_DonorI" = "./V_8_1_6_p601_d3_DonorI/quant.sf",
"V_8_1_6_p604_d3_DonorI" = "./V_8_1_6_p604_d3_DonorI/quant.sf",
"V_8_1_6_p601_d8_DonorI" = "./V_8_1_6_p601_d8_DonorI/quant.sf",
"V_8_1_6_p604_d8_DonorI" = "./V_8_1_6_p604_d8_DonorI/quant.sf",
"V_8_2_3_p600_d3_DonorII" = "./V_8_2_3_p600_d3_DonorII/quant.sf",
"V_8_2_3_p605_d3_DonorII" = "./V_8_2_3_p605_d3_DonorII/quant.sf",
"V_8_2_3_p600_d8_DonorII" = "./V_8_2_3_p600_d8_DonorII/quant.sf",
"V_8_2_3_p605_d8_DonorII" = "./V_8_2_3_p605_d8_DonorII/quant.sf",
"V_8_2_4_p600_d3_DonorI" = "./V_8_2_4_p600_d3_DonorI/quant.sf",
"V_8_2_4_p605_d3_DonorI" = "./V_8_2_4_p605_d3_DonorI/quant.sf",
"V_8_2_4_p600_d8_DonorI" = "./V_8_2_4_p600_d8_DonorI/quant.sf",
"V_8_2_4_p605_d8_DonorI" = "./V_8_2_4_p605_d8_DonorI/quant.sf",
"V_8_4_1_p602_d8_DonorII" = "./V_8_4_1_p602_d8_DonorII/quant.sf",
"V_8_4_1_p602_d8_DonorI" = "./V_8_4_1_p602_d8_DonorI/quant.sf",
"V_8_3_1_p600and601_d12_DonorI" = "./V_8_3_1_p600and601_d12_DonorI/quant.sf",
"V_8_3_1_p604and605_d12_DonorI" = "./V_8_3_1_p604and605_d12_DonorI/quant.sf",
"V_8_3_2_p600and601_d9_DonorII" = "./V_8_3_2_p600and601_d9_DonorII/quant.sf",
"V_8_3_2_p604and605_d9_DonorII" = "./V_8_3_2_p604and605_d9_DonorII/quant.sf",
"V_8_4_2_p602_d3_DonorI" = "./V_8_4_2_p602_d3_DonorI/quant.sf",
"V_8_4_2_p602_d3_DonorII" = "./V_8_4_2_p602_d3_DonorII/quant.sf",
"V_8_5_1_p602and604_d3_DonorI" = "./V_8_5_1_sTplusLT_d3_Donor1/quant.sf",
"V_8_5_2_p602and604_d3_DonorII" = "./V_8_5_2_sTplusLT_d3_Donor2/quant.sf")

# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)

column_names <- colnames(txi$counts)
output_string <- paste(column_names, collapse = ", ")
cat(output_string, "\n")
#"V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII","V_8_1_5_p604_d3_DonorII","V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII","V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI","V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII","V_8_2_3_p605_d8_DonorII","V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI","V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI","V_8_3_1_p600and601_d12_DonorI","V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII","V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII","V_8_5_1_p602and604_d3_DonorI","V_8_5_2_p602and604_d3_DonorII"

col_order <- c("V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII","V_8_1_5_p604_d3_DonorII","V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII","V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI","V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII","V_8_2_3_p605_d8_DonorII","V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI","V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI","V_8_3_1_p600and601_d12_DonorI","V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII","V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII","V_8_5_1_p602and604_d3_DonorI","V_8_5_2_p602and604_d3_DonorII")
#reordered.txi <- txi[,col_order]

identical(column_names,col_order)

condition = as.factor(c("untreated","untreated", "p601_d3","p604_d3", "p601_d8","p604_d8",   "p601_d3","p604_d3","p601_d8","p604_d8",  "p600_d3","p605_d3","p600_d8", "p605_d8",  "p600_d3","p605_d3","p600_d8","p605_d8",  "p602_d8","p602_d8",  "p600and601_d9d12", "p604and605_d9d12","p600and601_d9d12","p604and605_d9d12",     "p602_d3","p602_d3",    "p602and604_d3","p602and604_d3"))

batch = as.factor(c("200420", "200420", "190927", "190927",    "190927", "190927", "190228", "190228",    "190228", "190228", "191008", "191008",    "191008", "191008", "190228", "190228",     "190228", "190228", "200817", "200817",       "200420", "200420", "200817", "200817",      "210302", "210302",  "210504","210504"))

ids = as.factor(c("untreated_DonorI","untreated_DonorII", "p601_d3_DonorII","p604_d3_DonorII", "p601_d8_DonorII","p604_d8_DonorII",   "p601_d3_DonorI","p604_d3_DonorI","p601_d8_DonorI","p604_d8_DonorI",  "p600_d3_DonorII","p605_d3_DonorII","p600_d8_DonorII", "p605_d8_DonorII",  "p600_d3_DonorI","p605_d3_DonorI","p600_d8_DonorI","p605_d8_DonorI",  "p602_d8_DonorII","p602_d8_DonorI",  "p600and601_d12_DonorI", "p604and605_d12_DonorI","p600and601_d9_DonorII","p604and605_d9_DonorII",        "p602_d3_DonorI","p602_d3_DonorII",      "p602and604_d3_DonorI","p602and604_d3_DonorII"))

donor = as.factor(c("DonorI","DonorII",  "DonorII","DonorII", "DonorII","DonorII",   "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorII","DonorII","DonorII",  "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorI",   "DonorI", "DonorI","DonorII","DonorII",        "DonorI","DonorII",    "DonorI","DonorII"))

# Define the colData for DESeq2
#colData = data.frame(row.names=column_names, condition=condition, donor=donor, batch=batch, ids=ids)
colData <- data.frame(condition=condition, donor=donor, row.names=names(files))

# -- transcript-level count data (for virus) --
# Create DESeqDataSet object
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition+donor)
write.csv(counts(dds), file="transcript_counts.csv")

# -- gene-level count data (for virus) --
# Read in the tx2gene map from salmon_tx2gene.tsv
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
# Set the column names
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
# Remove the gene_name column if not needed
tx2gene <- tx2gene[,1:2]
# Import and summarize the Salmon data with tximport
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition+donor)
#dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")

# -- merge the raw counts of human and microbe --
setwd("~/DATA/Data_Denise_sT_immune_evasion_PUBLISHING/Data_Denise_RNASeq/results/featureCounts")
d.raw<- read.delim2("merged_gene_counts_2_f3_f5.txt",sep="\t", header=TRUE, row.names=1)
colnames(d.raw)<- c("V_8_0_mock_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_5_p604_d3_DonorII","V_8_1_6_p604_d3_DonorI","V_8_2_3_p605_d8_DonorII","V_8_0_mock_DonorII","V_8_1_5_p601_d8_DonorII","V_8_2_3_p605_d3_DonorII","V_8_1_6_p604_d8_DonorI","V_8_2_3_p600_d8_DonorII","V_8_2_4_p600_d8_DonorI","V_8_2_4_p600_d3_DonorI","V_8_1_5_p604_d8_DonorII","V_8_1_5_p601_d3_DonorII","V_8_1_6_p601_d3_DonorI","V_8_2_3_p600_d3_DonorII","V_8_2_4_p605_d3_DonorI","V_8_4_1_p602_d8_DonorI","V_8_3_2_p604and605_d9_DonorII","V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII","V_8_2_4_p605_d8_DonorI","V_8_3_1_p600and601_d12_DonorI","V_8_4_1_p602_d8_DonorII","V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII", "V_8_5_1_p602and604_d3_DonorI", "V_8_5_2_p602and604_d3_DonorII")  #26364
col_order <- c("V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII", "V_8_1_5_p604_d3_DonorII", "V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII",   "V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI",  "V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII", "V_8_2_3_p605_d8_DonorII",  "V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI",  "V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI",  "V_8_3_1_p600and601_d12_DonorI", "V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII",    "V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII",    "V_8_5_1_p602and604_d3_DonorI", "V_8_5_2_p602and604_d3_DonorII")
reordered.raw <- d.raw[,col_order]
write.csv(reordered.raw, file="merged_gene_counts_2_f3_f5_reordered.txt")

# Confirm the identification of the headers of two files before merging!
#kate ./results/featureCounts/merged_gene_counts_2_f3_f5_reordered.txt
#"","V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII","V_8_1_5_p604_d3_DonorII","V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII","V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI","V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII","V_8_2_3_p605_d8_DonorII","V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI","V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI","V_8_3_1_p600and601_d12_DonorI","V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII","V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII","V_8_5_1_p602and604_d3_DonorI","V_8_5_2_p602and604_d3_DonorII"

#results_JN707599/star_salmon/gene_counts.csv
#"","V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII","V_8_1_5_p604_d3_DonorII","V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII","V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI","V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII","V_8_2_3_p605_d8_DonorII","V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI","V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI","V_8_3_1_p600and601_d12_DonorI","V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII","V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII","V_8_5_1_p602and604_d3_DonorI","V_8_5_2_p602and604_d3_DonorII"

#cat ../Data_Denise_RNASeq/results/featureCounts/merged_gene_counts_2_f3_f5_reordered.txt ../Data_Denise_RNASeq/results_JN707599/star_salmon/gene_counts.csv > merged_gene_counts.csv
#~/Tools/csv2xls-0.4/csv_to_xls.py merged_gene_counts.csv -d',' -o raw_gene_counts.xls;

setwd("~/DATA/Data_Denise_sT_immune_evasion_PUBLISHING/Data_Denise_RNASeq_Merged/")
d.raw <- read.csv("merged_gene_counts.csv", header=TRUE, row.names=1)
dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition+donor)
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)

# draw simple pca and heatmap
library(gplots) 
library("RColorBrewer")
#mat <- assay(rld)
#mm <- model.matrix(~condition, colData(rld))
#mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
#assay(rld) <- mat
# -- pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

(corrected of last step) construct DESeqDataSet including host+viral gene expressions

column_names <- colnames(d.raw)
col_order <- c("V_8_0_mock_DonorI","V_8_0_mock_DonorII","V_8_1_5_p601_d3_DonorII","V_8_1_5_p604_d3_DonorII","V_8_1_5_p601_d8_DonorII","V_8_1_5_p604_d8_DonorII","V_8_1_6_p601_d3_DonorI","V_8_1_6_p604_d3_DonorI","V_8_1_6_p601_d8_DonorI","V_8_1_6_p604_d8_DonorI","V_8_2_3_p600_d3_DonorII","V_8_2_3_p605_d3_DonorII","V_8_2_3_p600_d8_DonorII","V_8_2_3_p605_d8_DonorII","V_8_2_4_p600_d3_DonorI","V_8_2_4_p605_d3_DonorI","V_8_2_4_p600_d8_DonorI","V_8_2_4_p605_d8_DonorI","V_8_4_1_p602_d8_DonorII","V_8_4_1_p602_d8_DonorI","V_8_3_1_p600and601_d12_DonorI","V_8_3_1_p604and605_d12_DonorI","V_8_3_2_p600and601_d9_DonorII","V_8_3_2_p604and605_d9_DonorII","V_8_4_2_p602_d3_DonorI","V_8_4_2_p602_d3_DonorII","V_8_5_1_p602and604_d3_DonorI","V_8_5_2_p602and604_d3_DonorII")
identical(column_names,col_order)

condition = as.factor(c("untreated","untreated", "p601_d3","p604_d3", "p601_d8","p604_d8",   "p601_d3","p604_d3","p601_d8","p604_d8",  "p600_d3","p605_d3","p600_d8", "p605_d8",  "p600_d3","p605_d3","p600_d8","p605_d8",  "p602_d8","p602_d8",  "p600and601_d9d12", "p604and605_d9d12","p600and601_d9d12","p604and605_d9d12",     "p602_d3","p602_d3",    "p602and604_d3","p602and604_d3"))
batch = as.factor(c("200420", "200420", "190927", "190927",    "190927", "190927", "190228", "190228",    "190228", "190228", "191008", "191008",    "191008", "191008", "190228", "190228",     "190228", "190228", "200817", "200817",       "200420", "200420", "200817", "200817",      "210302", "210302",  "210504","210504"))
ids = as.factor(c("untreated_DonorI","untreated_DonorII", "p601_d3_DonorII","p604_d3_DonorII", "p601_d8_DonorII","p604_d8_DonorII",   "p601_d3_DonorI","p604_d3_DonorI","p601_d8_DonorI","p604_d8_DonorI",  "p600_d3_DonorII","p605_d3_DonorII","p600_d8_DonorII", "p605_d8_DonorII",  "p600_d3_DonorI","p605_d3_DonorI","p600_d8_DonorI","p605_d8_DonorI",  "p602_d8_DonorII","p602_d8_DonorI",  "p600and601_d12_DonorI", "p604and605_d12_DonorI","p600and601_d9_DonorII","p604and605_d9_DonorII",        "p602_d3_DonorI","p602_d3_DonorII",      "p602and604_d3_DonorI","p602and604_d3_DonorII"))
donor = as.factor(c("DonorI","DonorII",  "DonorII","DonorII", "DonorII","DonorII",   "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorII","DonorII","DonorII",  "DonorI","DonorI","DonorI","DonorI",  "DonorII","DonorI",   "DonorI", "DonorI","DonorII","DonorII",        "DonorI","DonorII",    "DonorI","DonorII"))
colData <- data.frame(condition=condition, donor=donor, row.names=column_names)

setwd("~/DATA/Data_Denise_sT_PUBLISHING/Data_Denise_RNASeq_Merged_Corrected/")
d.raw <- read.csv("merged_gene_counts_corrected.csv", header=TRUE, row.names=1)
dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition+donor)
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)

# draw simple pca and heatmap
library(gplots) 
library("RColorBrewer")
#mat <- assay(rld)
#mm <- model.matrix(~condition, colData(rld))
#mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
#assay(rld) <- mat
# -- pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

select the differentially expressed genes

#untreated_DonorI,untreated
#untreated_DonorII,untreated
#p601_d3_DonorII,mCherry control
#p604_d3_DonorII,sT
#p601_d8_DonorII,mCherry control
#p604_d8_DonorII,sT
#p601_d3_DonorI,mCherry control
#p604_d3_DonorI,sT
#p601_d8_DonorI,mCherry control
#p604_d8_DonorI,sT
#
#p600_d3_DonorII,GFP control
#p605_d3_DonorII,LTtr 
#p600_d8_DonorII,GFP control
#p605_d8_DonorII,LTtr 
#p600_d3_DonorI,GFP control
#p605_d3_DonorI,LTtr 
#p600_d8_DonorI,GFP control
#p605_d8_DonorI,LTtr 
#p602_d3_DonorI,LT 
#p602_d3_DonorII,LT 
#p602_d8_DonorII,LT 
#p602_d8_DonorI,LT 
#
#p600and601_d12_DonorI,GFP+mCherry control
#p604and605_d12_DonorI,sT+LTtr
#p600and601_d9_DonorII,GFP+mCherry control
#p604and605_d9_DonorII,sT+LTtr
#
#p602and604_d3_DonorI,sT+LT 
#p602and604_d3_DonorII,sT+LT 

#CONSOLE: mkdir /home/jhuang/DATA/Data_Denise_sT_immune_evasion_PUBLISHING/Data_Denise_RNASeq_Merged/degenes
#---- relevel to control ----
dds$condition <- relevel(dds$condition, "p601_d3")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("p604_d3_vs_p601_d3")

dds$condition <- relevel(dds$condition, "p600_d3")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("p605_d3_vs_p600_d3")

dds$condition <- relevel(dds$condition, "p600_d3")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("p602_d3_vs_p600_d3")

dds$condition <- relevel(dds$condition, "p601_d8")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("p604_d8_vs_p601_d8")

dds$condition <- relevel(dds$condition, "p600_d8")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("p605_d8_vs_p600_d8")

dds$condition <- relevel(dds$condition, "p600_d8")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("p602_d8_vs_p600_d8")

library(biomaRt)
listEnsembl()
listMarts()
#--> total 69, 27  GRCh38.p7 and 39  GRCm38.p4
#ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")
#GRCh37.p13
ensembl <- useEnsembl("ensembl","hsapiens_gene_ensembl", host="https://grch37.ensembl.org")
datasets <- listDatasets(ensembl)
# -- 1. export res_df containing both human and virus genes --
#for (i in clist) {
  i<-clist[1]
  contrast = paste("condition", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)
  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
#}

# -- 2. annatete human genes 'geness_res', note that the virus genes in that is ignored in the process since they are not in the database --
#> dim(geness) [1] 21938     9 using GRCh38.p7
#> dim(geness) [1] 23517     9 using GRCh37.p13
#for (i in clist) {
  #i<-clist[1]
  contrast = paste("condition", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
      filters = 'external_gene_name',
      values = rownames(res), 
      mart = ensembl)
  geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
  #merge by column by common colunmn name, in the case "GENEID"
  res$geneSymbol = rownames(res)
  #identical(rownames(res), rownames(geness_uniq))
  res_df <- as.data.frame(res)
  geness_res <- merge(geness_uniq, res_df, by.x="external_gene_name", by.y="geneSymbol")
  dim(geness_res)  #22044    15
  rownames(geness_res) <- geness_res$ensembl_gene_id
  geness_res$ensembl_gene_id <- NULL
#}
# -- 3. prepare annatete virus genes --
virus_genes <- c("LT", "sT", "VP1", "VP2", "VP3")
virus_rows <- res_df[rownames(res_df) %in% virus_genes, ]
virus_rows$external_gene_name <- rownames(virus_rows)
virus_rows$chromosome_name <- "JN707599"
# Define default values based on data type
default_values <- list(
  character = NULL,
  numeric = 0,
  integer = 0L,
  logical = FALSE
)
# Ensure that virus_rows has the same columns as geness_res
for (col in colnames(geness_res)) {
  if (!col %in% colnames(virus_rows)) {
    data_type <- class(geness_res[[col]])[1]
    default_value <- default_values[[data_type]]
    virus_rows[[col]] <- rep(default_value, nrow(virus_rows))
  }
}
missing_cols <- setdiff(colnames(geness_res), colnames(virus_rows))
for (col in missing_cols) {
  virus_rows[[col]] <- NA  # Or another default value as appropriate
}
# Reorder columns in virus_rows to match the order in geness_res
virus_rows <- virus_rows[, colnames(geness_res), drop = FALSE]
# -- 4. merge them together --
#for (i in clist) {
  merged_df <- rbind(geness_res, virus_rows)
  merged_df_sorted <- as.data.frame(merged_df[order(merged_df$padj),])
  write.csv(merged_df_sorted, file = paste(i, "all_annotated.txt", sep="-"))
  up <- subset(merged_df_sorted, padj<=0.05 & log2FoldChange>=2)
  down <- subset(merged_df_sorted, padj<=0.05 & log2FoldChange<=-2)
  viral <- merged_df_sorted[merged_df_sorted$external_gene_name %in% c("LT", "sT", "VP1", "VP2", "VP3"), ]
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up_annotated.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down_annotated.txt", sep="-"))
  write.csv(as.data.frame(viral), file = paste(i, "viral_annotated.txt", sep="-"))
#}

# -- 5. draw graphics --
library(ggrepel)
#geness_res <- read.csv(file = "p605_d8_vs_p600_d8-all_annotated.txt", sep=",", row.names=1)
#i<-"p605_d8_vs_p600_d8"
geness_res <- merged_df_sorted
# Color setting
geness_res$Color <- ifelse(geness_res$padj > 0.05 | abs(geness_res$log2FoldChange) < 2, "gray", 
                          ifelse(geness_res$log2FoldChange > 0, "red", "blue"))
# Predefined genes colored in green
predefined_genes <- c("LT", "sT", "VP1", "VP2", "VP3") 
geness_res$Color[geness_res$external_gene_name %in% predefined_genes] <- "green"
geness_res$invert_Padj <- (-log10(geness_res$padj)) * sign(geness_res$log2FoldChange)
top_g <- unique(c(geness_res[order(geness_res$invert_Padj, decreasing = TRUE), 'external_gene_name'][1:200],
                geness_res[order(geness_res$invert_Padj, decreasing = FALSE), 'external_gene_name'][1:200]))
# Define the original and compressed ranges
original_range <- c(80, 115)
compressed_range <- c(80.0, 90.0)
# Calculate breaks for the y-axis
y_breaks_below <- seq(0, 80, by=5)
y_breaks_compressed <- c(80.0, 90.0)
y_breaks_above <- c()
y_breaks <- c(y_breaks_below, y_breaks_compressed, y_breaks_above)
y_labels_below <- seq(0, 80, by=5)
y_labels_compressed <- c(80, 115)
y_labels_above <- c()
y_labels <- c(y_labels_below, y_labels_compressed, y_labels_above)
# Adjust the p-values based on the ranges
geness_res$adjusted_pvalue <- with(geness_res, 
                                  ifelse(-log10(padj) > original_range[1] & -log10(padj) <= original_range[2],
                                          ((-log10(padj) - original_range[1]) / (original_range[2] - original_range[1])) * (compressed_range[2] - compressed_range[1]) + compressed_range[1],
                                          ifelse(-log10(padj) > original_range[2], 
                                                -log10(padj) - (original_range[2] - original_range[1]) + (compressed_range[2] - compressed_range[1]),
                                                -log10(padj))))
# Create the plot
png(paste(i, "png", sep="."), width=1000, height=1000)
ggplot(geness_res, aes(x = log2FoldChange, y = adjusted_pvalue, color = Color, label = external_gene_name)) + 
  geom_vline(xintercept = c(2, -2), lty = "dashed", size = 1.5) +  
  geom_hline(yintercept = -log10(0.05), lty = "dashed", size = 1.5) +     
  geom_point(size = 3) +
  labs(x = "log2(Fold change)", y = "-log10(P-adj)", color = "Significance") + 
  scale_color_identity() +
  geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & padj < 0.05 & (abs(log2FoldChange) >= 2)), 
                  size = 7,   
                  point.padding = 0.15, 
                  color = "black", 
                  min.segment.length = .1, 
                  box.padding = .2, 
                  lwd = 2) + 
  theme_bw(base_size = 24) +
  theme(legend.position = "bottom") +
  #annotate("rect", xmin = -Inf, xmax = Inf, ymin = compressed_range[1], ymax = compressed_range[1], linetype = "dashed", color = "grey") +
  #annotate("rect", xmin = -Inf, xmax = Inf, ymin = compressed_range[2], ymax = compressed_range[2], linetype = "dashed", color = "grey") +
  #annotate("text", x = -Inf, y = compressed_range[1], label = "/", hjust = 0, size = 10) +
  #annotate("text", x = -Inf, y = compressed_range[2], label = "/", hjust = 0, size = 10) +
  scale_y_continuous(breaks = sort(y_breaks), labels = sort(y_labels))
dev.off()

#Under DIR degenes under KONSOLE and delete "start_position","end_position","strand" and so on.
#~/Tools/csv2xls-0.4/csv_to_xls.py viral_gene_counts.csv p604_d8_vs_p601_d8-viral_annotated.txt p605_d8_vs_p600_d8-viral_annotated.txt p602_d8_vs_p600_d8-viral_annotated.txt p604_d3_vs_p601_d3-viral_annotated.txt p605_d3_vs_p600_d3-viral_annotated.txt p602_d3_vs_p600_d3-viral_annotated.txt -d$',' -o viral_raw_counts_and_comparisons.xls;
#dpt: days post-transfection

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Author Archives: gene_x

Transposable element (TEs) and structural variant (SV) detection in bacterial genomes

LT manuscript TODOs

Plasmid Design for Genetic Engineering

Scaffolding and finishing an assembly with a reference genome

Calling peaks using findPeaks of HOMER

A decade of advances in transposon-insertion sequencing

How to run the pipeline vrap?

Using vrap and bacto to process Acinetobacter baumannii isolates

Kraken2 Installation and Usage Guide

sT manuscript revision: log2FC of sT itself at 3 and 8 dpt