-
install rnaseq on sage
#-- version 2 -- conda install -c conda-forge mamba mamba create -n rnaseq -c conda-forge -c bioconda -c defaults python fastqc trim-galore star hisat2 picard csvtk preseq rseqc samtools conda activate rnaseq pip3 install deeptools pip3 install multiqc mamba install -c conda-forge -c bioconda -c defaults -c r stringtie subread gffread r-data.table r-gplots bioconductor-dupradar bioconductor-edger #(rnaseq) [jhuang@sage ~]$ which nextflow #/usr/local/bin/nextflow conda install -c bionconda fq mamba install -c bioconda ucsc-bedclip ucsc-bedgraphtobigwig mamba install -c bioconda rsem mamba install -c bioconda salmon mamba install -c conda-forge -c bioconda -c defaults -c r r-data.table r-gplots bioconductor-dupradar bioconductor-edger bioconductor-deseq2 mamba install -c conda-forge openjdk=17 conda install -c bioconda bedtools qualimap #under R install.packages("BiocManager") BiocManager::install("tximport") BiocManager::install("tximeta") install.packages("optparse") install.packages(“pheatmap”) (rnaseq2) [jhuang@sage Data_Soeren_RNA-seq_2023]$ nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38 -profile test_full -resume --max_memory 300.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_deseq2_qc (rnaseq) nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m_.gtf -profile test_full -resume --max_memory 200.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0 #--gtf_extra_attributes gene_name #-- version 1 -- #conda env create -f ~/Tools/rnaseq/environment.yml conda create -n rnaseq -c conda-forge -c bioconda -c defaults python=3.6 fastqc trim-galore star=2.6.1d hisat2 conda activate rnaseq conda install -c conda-forge -c bioconda -c defaults picard csvtk preseq rseqc samtools pip3 install deeptools pip3 install multiqc conda install -c bioconda stringtie subread gffread conda install -c conda-forge -c bioconda -c defaults -c r r-data.table r-gplots conda install -c conda-forge -c bioconda -c defaults -c r bioconductor-dupradar bioconductor-edger conda install nextflow=20.04
-
install chipseq
conda create -n chipseq -c conda-forge -c bioconda -c defaults -c r python=3.9 fastqc cutadapt trim-galore bwa samtools conda activate chipseq conda install picard bedtools phantompeakqualtools conda install -c conda-forge -c bioconda -c defaults -c r r-base #ERROR_SINCE_R_NGSPLOT_ONLY_FOR_PYTHON2 conda install -c conda-forge -c bioconda -c defaults -c r r-ngsplot sudo apt install macs conda install -c bioconda multiqc pip3 install deeptools conda install nextflow=20.04 #conda env chipseq2 is still NOT working. conda create -n chipseq2 -c conda-forge -c bioconda -c defaults python=2.7 fastqc cutadapt trim-galore bwa samtools conda activate chipseq2 conda install -c conda-forge -c bioconda -c defaults picard bedtools phantompeakqualtools conda install -c conda-forge -c bioconda -c defaults -c r deeptools r-base r-ngsplot macs multiqc conda install -c conda-forge -c bioconda -c defaults -c r r-ade4 r-assertthat r-base r-bh r-biocmanager r-bit r-bit64 r-bitops r-blob r-catools r-codetools r-crayon r-curl r-dbi r-digest r-domc r-foreach r-formatr r-futile.logger r-futile.options r-glue r-hms r-httr r-hwriter r-idr r-iterators r-jsonlite r-lambda.r r-lattice r-latticeextra r-lazyeval r-magrittr r-mass r-matrix r-matrixstats r-memoise r-mime r-openssl r-pkgconfig r-plogr r-prettyunits r-progress r-r6 r-rcolorbrewer r-rcpp r-rcurl r-rlang r-rsqlite r-segmented r-seqinr r-snow r-spp r-stringi r-stringr r-survival r-venndiagram r-xml conda install -c conda-forge -c bioconda -c defaults -c r bioconductor-annotationdbi bioconductor-annotationfilter bioconductor-biobase bioconductor-biocgenerics bioconductor-biocparallel bioconductor-biomart bioconductor-biostrings bioconductor-bsgenome bioconductor-chippeakanno bioconductor-delayedarray bioconductor-ensembldb bioconductor-genomeinfodb bioconductor-genomeinfodbdata bioconductor-genomicalignments bioconductor-genomicfeatures bioconductor-genomicranges bioconductor-go.db bioconductor-graph bioconductor-iranges bioconductor-limma bioconductor-multtest bioconductor-protgenerics bioconductor-rbgl bioconductor-regioner bioconductor-rsamtools bioconductor-rsubread bioconductor-rtracklayer bioconductor-s4vectors bioconductor-shortread bioconductor-summarizedexperiment bioconductor-xvector bioconductor-zlibbioc conda install nextflow=20.04 #first_try (chipseq) nextflow run NGI-ChIPseq/main.nf --reads '/home/jhuang/DATA/Data_Denise_LT_DNA_Binding/enhancer_analysis/Raw_Data/*.fastq.gz' --genome hg38 --macsconfig macs.config --saveReference --saveAlignedIntermediates --singleEnd --blacklist_filtering -profile standard --project NHDF_enhancer_analysis_hg38 -resume #--notrim (chipseq) nextflow run NGI-ChIPseq/main.nf --reads '/home/jhuang/DATA/Data_Denise_LT_DNA_Binding/enhancer_analysis/Raw_Data/*.fastq.gz' --genome hg38 --macsconfig macs.config --saveAlignedIntermediates --notrim --saveTrimmed --singleEnd --blacklist_filtering -profile standard --project NHDF_enhancer_analysis_hg38 -resume #test (chipseq) nextflow run NGI-ChIPseq/main.nf --reads '/home/jhuang/DATA/Data_Denise_LT_DNA_Binding/Raw_Data/p783_*.fastq.gz' --genome hg38 --macsconfig macs.config --saveReference --saveAlignedIntermediates --singleEnd --blacklist_filtering -profile standard --project NHDF_enhancer_analysis_hg38 -resume
-
install spandx
conda create --name spandx2 python=3.8 conda activate spandx2 conda install -c conda-forge -c bioconda -c defaults art trimmomatic bwa bedtools=2.28.0 seqtk pindel mosdepth samtools=1.9 picard gatk4 snpeff=4.3.1t nextflow=22 fasttree #[genbank copying] mkdir ~/miniconda3/envs/spandx2/share/snpeff-4.3.1t-5/data/WA_plasmid cp WA_plasmid.gb ~/miniconda3/envs/spandx2/share/snpeff-4.3.1t-5/data/WA_plasmid/genes.gbk vim ~/miniconda3/envs/spandx2/share/snpeff-4.3.1t-5/snpEff.config /home/jhuang/miniconda3/envs/spandx2/bin/snpEff build -genbank WA_plasmid -d nextflow run spandx/main.nf --fastq "Raw_Data_RNAseq_K331A_d8_SPANDx/*.fastq.gz" --ref LT_wt.fasta --annotation --database LT_wildtype --pairing SE -resume (spandx2) nextflow run spandx/main.nf --fastq "raw_data/*_R{1,2}.fastq.gz" --ref WA_chr.fasta --annotation --database WA_chr -resume (spandx2) nextflow run spandx/main.nf --fastq "raw_data/*_R{1,2}.fastq.gz" --ref WA_plasmid.fasta --annotation --database WA_plasmid -resume #Note that the files in variants contain only SNPs, but the files in snippy contain SNPs+INDELs which can be used to consensus the results of SPANDx, resulting in the final results of SNPs+INDELs. ~/DATA/Data_Benjamin_Yersinia_SNP/variants_WA_chr$ vim snippy.core.vcf ~/DATA/Data_Benjamin_Yersinia_SNP/snippy_WA_chr/Wacton_S96/Wacton_S96.txt ~/DATA/Data_Benjamin_Yersinia_SNP/snippy_WA_chr/Wacton_S96/Wacton_S96.filt.vcf #FOLLOWING INSTALLATION CANNOT FINISH CALCULATION --> MAYBE IT NOW WORKS, BUT NEEDS TO GIVE AT LEAST SAMPLES, AS SAME AS IN SPANDX2 conda create --name spandx python=3.8 #conda install -c conda-forge -c bioconda -c defaults art trimmomatic bwa bedtools seqtk pindel mosdepth samtools picard gatk4 snpeff nextflow fasttree #-->picard-3.0.0-1, gatk4-4.4.0.0-0, snpeff-5.1-2, trimmomatic-0.39-2 installed! #DEBUG1: Due to using snpeff-5.1-2 by deleting the option '-t' in all 'snpEff eff -t ...' since '-t' is not a option in the new version. #DEBUG2: Don't login spandx after login base, since in the case the env uses /home/jhuang/anaconda3/bin/java. # - /usr/bin/java: openjdk version "11.0.18" 2023-01-17 # - /home/jhuang/anaconda3/bin/java: openjdk version "1.8.0_152-release" # - /home/jhuang/anaconda3/envs/spandx/bin/java: openjdk version "17.0.3-internal" 2022-04-19 #DEBUG3: using '-' instead of '_' # mv V_8_2_4_p600_d8_DonorI.fastq.gz control-d8-DI.fastq.gz
-
install bengal3_ac3
conda create -n bengal3_ac3 python=3.6 conda activate bengal3_ac3 conda install -c conda-forge -c bioconda -c defaults prokka conda install -c conda-forge -c bioconda -c defaults shovill mlst conda install -c conda-forge -c bioconda -c defaults trimmomatic fastqc conda install -c conda-forge -c bioconda -c defaults roary conda install -c conda-forge -c bioconda -c defaults snippy conda install -c conda-forge -c bioconda -c defaults fasttree raxml-ng conda install -c conda-forge -c bioconda -c defaults gubbins snp-sites conda install -c conda-forge -c bioconda -c defaults openjdk=11 TODO: snippy is still NOT working in hamm, we have to run the modul "variants_calling" on hamburg.
-
install qiime1
conda create -n qiime1 -c bioconda qiime
-
export environments
#!/bin/bash # Get a list of conda environments env_list=$(conda env list | awk '{print $1}' | tail -n +4) # Iterate through each environment and export it to a YAML file for env in $env_list; do echo "conda activate $env" echo "conda env export > \"${env}_exported.yml\"" echo "conda deactivate" done #conda create --name spandx2 --clone spandx #conda env remove --name spandx #conda activate spandx #conda env export > spandx.yml #rsync -a -P *.yml xxx@xxx:~/xxx/xxx #conda env create -f spandx.yml conda activate base conda env export > "base_exported.yml" conda deactivate conda activate HLCA_mapping_env conda env export > "HLCA_mapping_env_exported.yml" conda deactivate conda activate assembly2 conda env export > "assembly2_exported.yml" conda deactivate conda activate bac3 conda env export > "bac3_exported.yml" conda deactivate conda activate bactopia conda env export > "bactopia_exported.yml" conda deactivate
Author Archives: gene_x
RNA-seq data analysis of Yersinia on GRCh38
YopM is a systematic antagonist of YopP in suppressing inflammatory gene expression in Yersinia-infected human macrophages
-
goal
# 0th heatmap see manuscript: The # 1.th heatmap mock WAC WAP 1.5 + 6 hours: WAC, WA314 comparing to mock # 2.th heatmap deltaYopM, deltaYopP, deltaYopQ Vergleich zu WAP (6 hours). The sample deltaYopMP is completed deleted: deltaM, deltaP, deltaQ to WA314
-
download the batch 1 data
#-- PRJEB10086 -- mock_6h_a mock_6h_b mock_90min_a mock_90min_b WA314_6h_a WA314_6h_b WA314_90min_a WA314_90min_b WA314dYopM_6h_a WA314dYopM_6h_b WA314dYopM_90min_a WA314dYopM_90min_b #-- PRJEB45780 -- wget ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924425/mock_DoA.fastq.gz wget ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924426/mock_DoII.fastq.gz #mv mock_DoA.fastq.gz mock_6h_DoA.fastq.gz #mv mock_DoII.fastq.gz mock_6h_DoII.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924427/WAC1_5h_DoA.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924428/WAC1_5h_DoII.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924429/WAC6h_DoA.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924430/WAC6h_DoII.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924431/WAP1_5h_DoA.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924421/WAP1_5h_DoII.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924422/WAP6h_DoA.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924423/WAP6h_DoII.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924420/dP6h_DoII.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924424/dP6h_DoIII.fastq.gz #-- PRJEB44958 -- ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087955/dYopP_1.5h_DonorII.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087956/dYopP_1.5h_DonorIII.fastq.gz wget ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087951/dYopMP_1.5h_DonorII.fastq.gz wget ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087952/dYopMP_1.5h_DonorIII.fastq.gz wget ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087953/dYopMP_6h_DonorII.fastq.gz wget ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087954/dYopMP_6h_DonorIII.fastq.gz # ---- 1st. batch: 1.5h + 6h ---- mock WAC WA314 == WAP dYopM,dYopP E-MTAB-10602 E-MTAB-10473 Project: PRJEB45780 Pathogenic bacteria Yersinia enterocolitica injects virulence plasmid-encoded effectors through the type three secretion system into macrophages to modulate gene expression. Here we analyzed the effect on gene expression in primary human macrophages of Y. enterocolitica strains lacking effector YopP (1.5 h infection) or effectors YopP and YopM (1.5 h or 6 h infection) simultaneously using RNA-seq. This is part of a larger sequencing experiment for which other samples can be found in EMBL-EBI (www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-10473 and European Nucleotide Archive (ENA) at http://www.ebi.ac.uk/ena/data/view/PRJEB10086. ./ena-file-download-read_run-PRJEB10086-submitted_ftp-20231220-1619_sorted.sh #? check WA314_90min_a.fastq.gz == WAP1_5h_DoA.fastq.gz #? check WA314_90min_b.fastq.gz == WAP1_5h_DoII.fastq.gz #? check WA314_6h_a.fastq.gz == WAP6h_DoA.fastq.gz #? check WA314_6h_b.fastq.gz == WAP6h_DoII.fastq.gz #TODO: if they are not the same, take WA314_* as the WAP-samples in the 1st batch! cat mock_90min_a_L004_R1_001.fastq.gz mock_90min_a_L005_R1_001.fastq.gz mock_90min_a_L006_R1_001.fastq.gz mock_90min_a_L007_R1_001.fastq.gz mock_90min_a_L008_R1_001.fastq.gz > mock_90min_a.fastq.gz cat mock_90min_b_L004_R1_001.fastq.gz mock_90min_b_L005_R1_001.fastq.gz mock_90min_b_L006_R1_001.fastq.gz mock_90min_b_L007_R1_001.fastq.gz mock_90min_b_L008_R1_001.fastq.gz > mock_90min_b.fastq.gz cat mock_6h_a_L004_R1_001.fastq.gz mock_6h_a_L005_R1_001.fastq.gz mock_6h_a_L006_R1_001.fastq.gz mock_6h_a_L007_R1_001.fastq.gz mock_6h_a_L008_R1_001.fastq.gz > mock_6h_a.fastq.gz cat mock_6h_b_L004_R1_001.fastq.gz mock_6h_b_L005_R1_001.fastq.gz mock_6h_b_L006_R1_001.fastq.gz mock_6h_b_L007_R1_001.fastq.gz mock_6h_b_L008_R1_001.fastq.gz > mock_6h_b.fastq.gz cat WA314_90min_a_L004_R1_001.fastq.gz WA314_90min_a_L005_R1_001.fastq.gz WA314_90min_a_L006_R1_001.fastq.gz WA314_90min_a_L007_R1_001.fastq.gz WA314_90min_a_L008_R1_001.fastq.gz > WA314_90min_a.fastq.gz cat WA314_90min_b_L004_R1_001.fastq.gz WA314_90min_b_L005_R1_001.fastq.gz WA314_90min_b_L006_R1_001.fastq.gz WA314_90min_b_L007_R1_001.fastq.gz WA314_90min_b_L008_R1_001.fastq.gz > WA314_90min_b.fastq.gz cat WA314_6h_a_L004_R1_001.fastq.gz WA314_6h_a_L005_R1_001.fastq.gz WA314_6h_a_L006_R1_001.fastq.gz WA314_6h_a_L007_R1_001.fastq.gz WA314_6h_a_L008_R1_001.fastq.gz > WA314_6h_a.fastq.gz cat WA314_6h_b_L004_R1_001.fastq.gz WA314_6h_b_L005_R1_001.fastq.gz WA314_6h_b_L006_R1_001.fastq.gz WA314_6h_b_L007_R1_001.fastq.gz WA314_6h_b_L008_R1_001.fastq.gz > WA314_6h_b.fastq.gz cat WA314dYopM_90min_a_L004_R1_001.fastq.gz WA314dYopM_90min_a_L005_R1_001.fastq.gz WA314dYopM_90min_a_L006_R1_001.fastq.gz WA314dYopM_90min_a_L007_R1_001.fastq.gz WA314dYopM_90min_a_L008_R1_001.fastq.gz > WA314dYopM_90min_a.fastq.gz cat WA314dYopM_90min_b_L004_R1_001.fastq.gz WA314dYopM_90min_b_L005_R1_001.fastq.gz WA314dYopM_90min_b_L006_R1_001.fastq.gz WA314dYopM_90min_b_L007_R1_001.fastq.gz WA314dYopM_90min_b_L008_R1_001.fastq.gz > WA314dYopM_90min_b.fastq.gz cat WA314dYopM_6h_a_L004_R1_001.fastq.gz WA314dYopM_6h_a_L005_R1_001.fastq.gz WA314dYopM_6h_a_L006_R1_001.fastq.gz WA314dYopM_6h_a_L007_R1_001.fastq.gz WA314dYopM_6h_a_L008_R1_001.fastq.gz > WA314dYopM_6h_a.fastq.gz cat WA314dYopM_6h_b_L004_R1_001.fastq.gz WA314dYopM_6h_b_L005_R1_001.fastq.gz WA314dYopM_6h_b_L006_R1_001.fastq.gz WA314dYopM_6h_b_L007_R1_001.fastq.gz WA314dYopM_6h_b_L008_R1_001.fastq.gz > WA314dYopM_6h_b.fastq.gz #END ./WAC1_5h_DoA.fastq.gz ./WAC1_5h_DoII.fastq.gz ./WAC6h_DoA.fastq.gz ./WAC6h_DoII.fastq.gz ./WAP1_5h_DoA.fastq.gz ./WAP1_5h_DoII.fastq.gz ./WAP6h_DoA.fastq.gz ./WAP6h_DoII.fastq.gz ./dYopP_1.5h_DonorII.fastq.gz ./dYopP_1.5h_DonorIII.fastq.gz ./dP6h_DoII.fastq.gz ./dP6h_DoIII.fastq.gz mv dP6h_DoII.fastq.gz dYopP_6h_DoII.fastq.gz mv dP6h_DoIII.fastq.gz dYopP_6h_DoIII.fastq.gz mv WAC1_5h_DoA.fastq.gz WAC_1.5h_DoA.fastq.gz mv WAC1_5h_DoII.fastq.gz WAC_1.5h_DoII.fastq.gz mv WAC6h_DoA.fastq.gz WAC_6h_DoA.fastq.gz mv WAC6h_DoII.fastq.gz WAC_6h_DoII.fastq.gz mv WAP1_5h_DoA.fastq.gz WAP_1.5h_DoA.fastq.gz mv WAP1_5h_DoII.fastq.gz WAP_1.5h_DoII.fastq.gz mv WAP6h_DoA.fastq.gz WAP_6h_DoA.fastq.gz mv WAP6h_DoII.fastq.gz WAP_6h_DoII.fastq.gz mv WA314dYopM_90min_a.fastq.gz dYopM_90min_a.fastq.gz mv WA314dYopM_90min_b.fastq.gz dYopM_90min_b.fastq.gz mv WA314dYopM_6h_a.fastq.gz dYopM_6h_a.fastq.gz mv WA314dYopM_6h_b.fastq.gz dYopM_6h_b.fastq.gz #Desperated! #wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924425/mock_DoA.fastq.gz #wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR592/ERR5924426/mock_DoII.fastq.gz #wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087953/dYopMP_6h_DonorII.fastq.gz #wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087954/dYopMP_6h_DonorIII.fastq.gz #wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087951/dYopMP_1.5h_DonorII.fastq.gz #wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR608/ERR6087952/dYopMP_1.5h_DonorIII.fastq.gz
-
download the public data and prepare the batch 2 data
# ---- data public (naive and LPS is public data)---- jhuang@hamburg:/media/jhuang/Elements2/Data_Indra_RNASeq_GSM2262901/raw_data #Dataset name GEO ID Series Experiment #- naive macrophages GSM2679941 GSE100382 RNA-seq (RNA_N_rep1) # SRR5746425 4,000,000 204M 136Mb 2017-06-25 # SRR5746426 4,000,000 204M 137Mb 2017-06-25 # SRR5746427 4,000,000 204M 137.2Mb 2017-06-25 # SRR5746428 4,000,000 204M 135.6Mb 2017-06-25 # SRR5746429 4,000,000 204M 134.5Mb 2017-06-25 # SRR5746430 4,000,000 204M 136.2Mb 2017-06-25 # SRR5746431 903,661 46.1M 32.7Mb 2017-06-25 cp RNA_N_rep1.fastq.gz ../raw_data/naive_rep1.fastq.gz #- naive macrophages GSM2679942 GSE100382 RNA-seq (RNA_N_rep2) # SRR5746432 14,176,210 723M 350.2Mb 2017-06-25 # SRR5746433 14,157,173 722M 353.4Mb 2017-06-25 cp RNA_N_rep2.fastq.gz ../raw_data/naive_rep2.fastq.gz #- LPS stimulated macrophages GSM2679944 GSE100382 RNA-seq (RNA_L_rep1) # SRR5746436 4,000,000 204M 135.8Mb 2017-06-25 # SRR5746437 4,000,000 204M 137.1Mb 2017-06-25 # SRR5746438 4,000,000 204M 136.6Mb 2017-06-25 # SRR5746439 4,000,000 204M 135.4Mb 2017-06-25 # SRR5746440 4,000,000 204M 135.3Mb 2017-06-25 # SRR5746441 2,178,850 111.1M 76.7Mb 2017-06-25 cp RNA_L_rep1.fastq.gz ../raw_data/LPS_rep1.fastq.gz #- LPS stimulated macrophages GSM2679945 GSE100382 RNA-seq (RNA_L_rep2) # SRR5746442 17,351,249 884.9M 428Mb 2017-06-25 # SRR5746443 17,318,750 883.3M 431.8Mb 2017-06-25 cp RNA_L_rep2.fastq.gz ../raw_data/LPS_rep2.fastq.gz #- naive macrophages GSM2262901 GSE85243 RNA-seq (RPMI_d6_6095) # SRR4004025 11,559,434 485.5M 338.3Mb 2016-11-21 cp RPMI_d6_6095.fastq.gz ../raw_data/naive_rep3.fastq.gz #- naive macrophages GSM2262902 GSE85243 RNA-seq (RPMI_d6_6718) # SRR4004026 22,911,696 962.3M 635Mb 2016-11-21 cp RPMI_d6_6718.fastq.gz ../raw_data/naive_rep4.fastq.gz #- LPS stimulated macrophages GSM2262906 GSE85243 RNA-seq (RPMI_d6_Restim_8749) # SRR4004030 26,566,129 1.1G 681.3Mb 2016-11-21 cp RPMI_d6_Restim_8749.fastq.gz ../raw_data/LPS_rep3.fastq.gz #- LPS stimulated macrophages GSM2262907 GSE85243 RNA-seq (RPMI_d6_Restim_8754) # SRR4004031 28,719,690 1.2G 731.2Mb 2016-11-21 cp RPMI_d6_Restim_8754.fastq.gz ../raw_data/LPS_rep4.fastq.gz #END # ---- 2nd. batch: 6h jhuang@hamburg:~/DATA/Data_Soeren_RNA-seq_2022/Raw_Data ---- mock WAC WAP deltaYopQ --> Vergleich zu WAP mv MOCK_A.fastq.gz ../raw_data/ mv MOCK_B.fastq.gz ../raw_data/ mv WAC_A.fastq.gz ../raw_data/ mv WAC_B.fastq.gz ../raw_data/ mv WAP_A.fastq.gz ../raw_data/ mv WAP_B.fastq.gz ../raw_data/ mv deltaQ_A.fastq.gz ../raw_data/ mv deltaQ_B.fastq.gz ../raw_data/
-
run nextflow
(rnaseq) [jhuang@sage DATA]$ pwd /home/jhuang/DATA /home/jhuang/REFs/Homo_sapiens/Ensembl #"/home/jhuang/REFs/Homo_sapiens/UCSC/hg38/blacklists/hg38-blacklist.bed" nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38 -profile test_full -resume --max_memory 300.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_deseq2_qc --skip_fastqc ln -s ~/Tools/rnaseq/assets/multiqc_config.yaml multiqc_config.yaml multiqc -f --config multiqc_config.yaml . 2>&1 rm multiqc_config.yaml
-
summary:
#ENSG00000000419 DPM1 385.616 #grep "ENSG00000000419" quant.genes.sf #ENSG00000000419 1159.45 909.453 15.8259 385.616 Attached, you will find the results of the RNASeq analysis discussed in our last online meeting. The raw counts are located in Yersinia_RNAseq_results_GRCh38/star_salmon/salmon.merged.gene_counts.tsv. Note that this table was generated using the STAR-Salmon strategy, with STAR aligning the reads and Salmon quantifying transcript abundances. Salmon's advanced probabilistic model assigns fractional counts to transcripts, particularly when reads map to multiple similar transcripts. This is the reason non-integer numbers appear in this table. The samples included in the analysis are as follows: batch 1: ./mock_90min_a.fastq.gz ./mock_90min_b.fastq.gz ./mock_6h_a.fastq.gz ./mock_6h_b.fastq.gz ./WAC_1.5h_DoA.fastq.gz ./WAC_1.5h_DoII.fastq.gz ./WAC_6h_DoA.fastq.gz ./WAC_6h_DoII.fastq.gz ./WAP_1.5h_DoA.fastq.gz ./WAP_1.5h_DoII.fastq.gz ./WAP_6h_DoA.fastq.gz ./WAP_6h_DoII.fastq.gz ./WA314_90min_a.fastq.gz ./WA314_90min_b.fastq.gz ./WA314_6h_a.fastq.gz ./WA314_6h_b.fastq.gz ./dYopM_90min_a.fastq.gz ./dYopM_90min_b.fastq.gz ./dYopM_6h_a.fastq.gz ./dYopM_6h_b.fastq.gz ./dYopP_1.5h_DonorII.fastq.gz ./dYopP_1.5h_DonorIII.fastq.gz ./dYopP_6h_DoII.fastq.gz ./dYopP_6h_DoIII.fastq.gz ./mock_6h_DoA.fastq.gz ./mock_6h_DoII.fastq.gz ./dYopMP_1.5h_DonorII.fastq.gz ./dYopMP_1.5h_DonorIII.fastq.gz ./dYopMP_6h_DonorII.fastq.gz ./dYopMP_6h_DonorIII.fastq.gz batch public: naive_rep1.fastq.gz naive_rep2.fastq.gz LPS_rep1.fastq.gz LPS_rep2.fastq.gz naive_rep3.fastq.gz naive_rep4.fastq.gz LPS_rep3.fastq.gz LPS_rep4.fastq.gz batch 2: MOCK_A.fastq.gz (6h) MOCK_B.fastq.gz (6h) WAC_A.fastq.gz (6h) WAC_B.fastq.gz (6h) WAP_A.fastq.gz (6h) WAP_B.fastq.gz (6h) deltaQ_A.fastq.gz (6h) deltaQ_B.fastq.gz (6h)
merging of hg38 counts of two tables and cbind virus counts
#-- merge the hg38 counts of 38 samples and the hg38 counts of LT_K331A --
# Load necessary libraries
library(dplyr)
# Read the first table
first_table <- read.csv("merged_gene_counts.csv", stringsAsFactors = FALSE)
# Read the second table
second_table <- read.csv("~/DATA/Data_Denise_LT_K331A_RNASeq/results_GRCh38/star_salmon/gene_name_gene_counts.csv", sep = ',', stringsAsFactors = FALSE)
# Extract the relevant columns from the second table
second_table_filtered <- second_table %>%
select(gene_name, LT_K331A_DI, LT_K331A_DII)
# Summarise the second table by gene_name, summing the LT_K331A_DI and LT_K331A_DII columns for duplicate gene names
second_table_summarised <- second_table_filtered %>%
group_by(gene_name) %>%
summarise(LT_K331A_DI = sum(LT_K331A_DI, na.rm = TRUE), LT_K331A_DII = sum(LT_K331A_DII, na.rm = TRUE))
# Merge the tables by matching the 'gene_name' from the second table to the gene_name of the first table
merged_table <- left_join(first_table, second_table_summarised, by = "gene_name")
# Replace NA values with 0
merged_table[is.na(merged_table)] <- 0
# View the result
print(head(merged_table))
# Optionally, write the result to a new file
write.csv(merged_table, "gene_counts_hg38_30samples.csv", row.names = TRUE)
#-- cbind the hg38 and virus counts --
# Load necessary libraries
library(dplyr)
# Read the first table
gene_counts_hg38_30samples <- read.csv("gene_counts_hg38_30samples.csv", stringsAsFactors = FALSE, row.names=1)
# Read the second table
salmon_merged_gene_counts <- read.csv("~/DATA/Data_Denise_LT_K331A_RNASeq/results_JN707599/star_salmon/salmon.merged.gene_counts.csv", sep = ',', stringsAsFactors = FALSE)
# Rename the columns in the second table as specified
names(salmon_merged_gene_counts)[names(salmon_merged_gene_counts) == "LT_K331A_d8_DonorI"] <- "LT_K331A_DI"
names(salmon_merged_gene_counts)[names(salmon_merged_gene_counts) == "LT.K331A.d8.DII_re"] <- "LT_K331A_DII"
# Determine which columns are in both tables
common_columns <- intersect(names(gene_counts_hg38_30samples), names(salmon_merged_gene_counts))
# Select only the common columns from the second table (ignoring the extra 5 columns)
salmon_merged_gene_counts <- salmon_merged_gene_counts[, common_columns]
# Sort the columns of the second table to match the first table
# Ensuring that 'gene_name' remains the first column
cols_order <- c("gene_name", setdiff(names(gene_counts_hg38_30samples), "gene_name"))
salmon_merged_gene_counts <- salmon_merged_gene_counts[, cols_order]
# Merge the tables by pasting together
# Since the columns are now aligned, this should paste them side by side
merged_table <- cbind(gene_counts_hg38_30samples, salmon_merged_gene_counts[,-1]) # -1 to not duplicate gene_name
# Replace NA values with 0 in the entire table
merged_table[is.na(merged_table)] <- 0
# View the result
print(head(merged_table))
# Optionally, write the result to a new file
write.csv(merged_table, "updated_gene_counts_hg38_30samples.csv", row.names = FALSE)
DNA-damage specific ubiquitination
DNA损伤特异性泛素化是指在DNA损伤响应中,泛素分子特异性地附加到蛋白质上的细胞过程。这一过程是DNA损伤响应(DDR)机制的重要组成部分,DDR是一种复杂的细胞网络,负责检测和修复受损DNA,维护基因组稳定性,防止突变。以下是更详细的概述:
- 泛素化:这是一种翻译后修饰过程,其中一个名为泛素的小蛋白质共价地附加到目标蛋白质上的赖氨酸残基上。这个过程由一系列酶进行:E1(泛素激活酶),E2(泛素结合酶)和E3(泛素连接酶)。这些酶通过级联反应将泛素从E1转移到E2,最终通过E3的作用转移到目标蛋白上。
- DNA损伤:DNA可以由各种因素损伤,如紫外线辐射、化学试剂和代谢副产品。当检测到这种损伤时,细胞启动DDR以修复DNA并防止在细胞分裂过程中传播错误。
- 泛素化在DDR中的作用:对DNA损伤的响应中,涉及DDR途径的特定蛋白质被泛素化。这种泛素化可以具有几种功能:
* 信号传导:泛素化可以改变被修饰蛋白的功能,激活或失活酶,或作为其他蛋白质识别并结合到受损部位的信号。
* 招募:它可以帮助招募DNA修复蛋白到损伤部位。例如,泛素链可以被具有泛素结合域的蛋白质识别,导致它们在DNA损伤部位累积。
* 蛋白酶体降解:在某些情况下,泛素化将蛋白质靶向蛋白酶体降解,蛋白酶体是一个大型的蛋白质复合体,负责降解和回收蛋白质。这可以帮助调节DDR中涉及的蛋白质水平或移除可能妨碍修复过程的蛋白质。
- 泛素链的类型:泛素链的性质(例如,泛素分子之间的连接类型)可以决定泛素化的结果。不同的连接可以导致目标蛋白质的不同命运,包括蛋白酶体降解、蛋白质相互作用的变化或活性的改变。
理解DNA损伤特异性泛素化在分子生物学和遗传学领域至关重要,特别是在研究癌症和其他疾病中,这些疾病通常存在DNA损伤和修复机制的失调。它也是开发治疗药物的目标,因为调节这些途径可以增强癌症治疗(如化疗和放疗)的效果,这些治疗依赖于在癌细胞中引起DNA损伤。
Rad51 Foci and UL36DUB: The formation of Rad51 foci, which are crucial for DNA repair, is not impacted by the deubiquitinating activity of UL36, a protein found in herpesviruses.
DUB Activity in Human Herpesviruses: The deubiquitinating (DUB) activity, which involves the removal of ubiquitin from proteins, is a conserved function across human herpesviruses, influencing their interaction with host cells.
Role of ICP0 in the DNA damage response
Rad51 foci refer to the visible aggregations or clusters of the Rad51 protein within the nucleus of a cell, typically observed using fluorescence microscopy. Rad51 is a crucial protein involved in the homologous recombination repair pathway, a mechanism used by cells to repair double-strand breaks in DNA. The formation of Rad51 foci is an important cellular response to DNA damage, particularly during the S and G2 phases of the cell cycle. Here’s a bit more detail:
- Function of Rad51: Rad51 plays a central role in the search for homology and strand pairing during homologous recombination. It forms a nucleoprotein filament on single-stranded DNA (ssDNA) that is essential for the homologous pairing and strand exchange with a complementary DNA strand. This process is critical for accurate DNA repair.
- Formation of Rad51 Foci: In response to DNA damage, particularly double-strand breaks, Rad51 is recruited to the sites of damage and forms nuclear foci. These foci represent the active sites where DNA repair is taking place.
- Detection of Rad51 Foci: Rad51 foci can be visualized using immunofluorescence techniques, where antibodies specific to Rad51 are used, followed by detection with fluorescent dyes. The presence and number of these foci in a cell can be indicative of the cell’s ability to repair DNA damage effectively.
- Clinical Relevance: The formation of Rad51 foci can be used as a biomarker for assessing the proficiency of homologous recombination repair in cells. This has implications in cancer biology, as cells deficient in homologous recombination repair (such as those lacking functional BRCA1 or BRCA2) exhibit impaired formation of Rad51 foci. These cells are more susceptible to certain types of chemotherapy, such as PARP inhibitors.
In summary, Rad51 foci are indicators of active DNA repair processes within the cell, and their study is important in understanding cellular responses to DNA damage, as well as in the context of cancer diagnosis and treatment.
Early viral and host transcriptome atlas of lytic nuclear DNA virus infection
A “lytic nuclear DNA virus infection” refers to a type of viral infection where a DNA virus enters a host cell, replicates in the nucleus, and eventually causes the lysis (destruction) of the cell. Here’s a breakdown of the process:
-
Infection: The virus attaches to the host cell and injects its DNA into it. For nuclear DNA viruses, this DNA is transported into the nucleus of the host cell.
-
Replication and Transcription: Inside the nucleus, the viral DNA is replicated and transcribed using the host cell’s machinery. This process leads to the production of viral proteins and the replication of the viral genome.
-
Assembly: Newly formed viral genomes and structural proteins are assembled into new virus particles within the host cell.
-
Lysis and Release: In the lytic cycle, once a significant number of new viruses have been produced, the host cell’s membrane is disrupted (lysis), leading to the death of the cell and the release of the new viruses, which can then infect surrounding cells.
This lytic phase is contrasted with the lysogenic phase seen in some viruses, where the viral DNA integrates into the host genome and can remain dormant for a period before becoming active.
Lytic nuclear DNA viruses include many well-known pathogens such as the Herpesviruses (including Herpes Simplex Virus and Varicella-Zoster Virus), Adenoviruses, and some types of Papillomaviruses. These viruses are significant in medicine due to their roles in various diseases, ranging from mild conditions like colds and skin warts to more severe illnesses like certain cancers and encephalitis. Understanding the lytic cycle of these viruses is crucial for developing treatments and preventive measures against these viral infections.
4sU-tagging, ribosome profiling and ATAC-Seq
https://www.virologie.uni-wuerzburg.de/virologie/ags-virologie/ag-doelken/research/
-
Principle of 4sU-tagging: 4-thiouridine (4sU) is added to cell culture medium to start metabolic labelling of newly transcribed (nascent) RNA. 5 to 60 min later total RNA is isolated. Following thiol-specific biotinylation, total RNA is separated into labelled nascent and unlabelled pre-existing RNA. All three RNA fractions are suitable for down-stream analyses including qRT-PCR, microarrays and RNA-seq.
-
Principle of ribosome profiling: Following a mild cell lysis, a stringent RNAse digest is performed. The short fragment of RNA molecules actively being translated at the time of cell lysis is protected from RNAse treatment within the ribosome and can then be recovered, cloned and sequenced. Thereby, a detailed picture of translational activity at the time of cell lysis is obtained
-
Chromatin accessibility profiling by ATAC-Seq
https://www.nature.com/articles/s41596-022-00692-9 https://www.illumina.com/techniques/multiomics/epigenetics/atac-seq-chromatin-accessibility.html
-
The assay for transposase-accessible chromatin with sequencing (ATAC-Seq) is a popular method for determining chromatin accessibility across the genome. By sequencing regions of open chromatin, ATAC-Seq can help you uncover how chromatin packaging and other factors affect gene expression. ATAC-Seq does not require prior knowledge of regulatory elements, making it a powerful epigenetic discovery tool. It has been used to better understand chromatin accessibility, transcription factor binding, and gene regulation in complex diseases, embryonic development, T-cell activation, and cancer. ATAC-Seq can be performed on bulk cell populations or on single cells at high resolution.
-
How Does ATAC-Seq Work? In ATAC-Seq, genomic DNA is exposed to Tn5, a highly active transposase. Tn5 simultaneously fragments DNA, preferentially inserts into open chromatin sites, and adds sequencing primers (a process known as tagmentation). The sequenced DNA identifies the open chromatin and data analysis can provide insight into gene regulation. Additionally, ATAC-Seq can be combined with other methods, such as RNA sequencing, for a multiomic approach to studying gene expression. Subsequent experiments often include ChIP-Seq, Methyl-Seq, or Hi-C-Seq to further characterize forms of epigenetic regulation.
Which hypervariable regions do the amplicon sequences belong to: V3–V4 or V4–V5?
#Questions about V3–V4 primers for 16S rRNA amplicon sequencing, and calculating overlap
https://forum.qiime2.org/t/questions-about-v3-v4-primers-for-16s-rrna-amplicon-sequencing-and-calculating-overlap/20250
#Comparison of Two 16S rRNA Primers (V3–V4 and V4–V5) for Studies of Arctic Microbial Communities
https://pure.mpg.de/rest/items/item_3335003/component/file_3339382/content
#Two different hypervariable regions of the bacterial 16S rRNA gene were amplified using aliquots of the isolated DNA from each sample.
- The V3–V4 region was amplified using the S-D-Bact-0341-b-S-17 (5′-CCTACGGGNGGCWGCAG-3′) and the S-D-Bact-0785-a-A-21 (5′-GACTACHVGGGTATCTAATCC-3′) primers (Klindworth et al., 2013).
- The V4–V5 regions was amplified using the 515F-Y (5′-GTGYCAGCMGCCGCGGTAA-3′) and the 926R (5′-CCGYCAATTYMTTTRAGTTT-3′) primers (Parada et al., 2016).
A. Klindworth, E. Pruesse, T. Schweer, J. Peplies, C. Quast, M. Horn, F.O. Glöckner
Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies
Nucleic Acids Res., 41 (2013), 10.1093/nar/gks808
e1–e1
# check if the data are from V3-V4 or V4-V5?
mkdir pandaseq.out_V3_V4
for file in cutadapted_16S/filtered_R1/*_R1.fastq.gz; do echo "pandaseq -f ${file} -r ${file/_R1/_R2} -l 300 -p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V3_V4/$(echo $file | cut -d'/' -f3 | cut -d'_' -f1)_merged.fasta >> LOG_pandaseq_V3_V4"; done
mkdir pandaseq.out_V4_V5
for file in cutadapted_16S/filtered_R1/*_R1.fastq.gz; do echo "pandaseq -f ${file} -r ${file/_R1/_R2} -l 300 -p GTGYCAGCMGCCGCGGTAA -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V4_V5/$(echo $file | cut -d'/' -f3 | cut -d'_' -f1)_merged.fasta >> LOG_pandaseq_V4_V5"; done
#repalce the second _R1 to _R2
pandaseq -f cutadapted_16S/filtered_R1/AH-LH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-LH_R2.fastq.gz -l 300 -p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V3_V4/AH-LH_merged.fasta >> LOG_pandaseq_V3_V4
pandaseq -f cutadapted_16S/filtered_R1/AH-NLH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-NLH_R2.fastq.gz -l 300 -p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V3_V4/AH-NLH_merged.fasta >> LOG_pandaseq_V3_V4
pandaseq -f cutadapted_16S/filtered_R1/AH-Nose_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-Nose_R2.fastq.gz -l 300 -p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V3_V4/AH-Nose_merged.fasta >> LOG_pandaseq_V3_V4
pandaseq -f cutadapted_16S/filtered_R1/AL-LH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AL-LH_R2.fastq.gz -l 300 -p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V3_V4/AL-LH_merged.fasta >> LOG_pandaseq_V3_V4
pandaseq -f cutadapted_16S/filtered_R1/AH-LH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-LH_R2.fastq.gz -l 300 -p GTGYCAGCMGCCGCGGTAA -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V4_V5/AH-LH_merged.fasta >> LOG_pandaseq_V4_V5
pandaseq -f cutadapted_16S/filtered_R1/AH-NLH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-NLH_R2.fastq.gz -l 300 -p GTGYCAGCMGCCGCGGTAA -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V4_V5/AH-NLH_merged.fasta >> LOG_pandaseq_V4_V5
pandaseq -f cutadapted_16S/filtered_R1/AH-Nose_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-Nose_R2.fastq.gz -l 300 -p GTGYCAGCMGCCGCGGTAA -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V4_V5/AH-Nose_merged.fasta >> LOG_pandaseq_V4_V5
pandaseq -f cutadapted_16S/filtered_R1/AL-LH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AL-LH_R2.fastq.gz -l 300 -p GTGYCAGCMGCCGCGGTAA -q GACTACHVGGGTATCTAATCC -w pandaseq.out_V4_V5/AL-LH_merged.fasta >> LOG_pandaseq_V4_V5
#-p GTGYCAGCMGCCGCGGTAA -q GACTACHVGGGTATCTAATCC --> Used!
#-rwxrwxrwx 1 jhuang jhuang 39590797 Jan 11 17:09 AH-LH_merged.fasta (80630 sequences)
pandaseq -f cutadapted_16S/filtered_R1/AH-LH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-LH_R2.fastq.gz -l 300 -w pandaseq.out_notfiltering_with_primer/AH-LH_merged.fasta >> LOG_pandaseq_notfiltering_with_primer
EV vs OMV
The terms “EV” and “outer membrane vesicle (OMV)” are related to biological structures, particularly in the context of bacterial or cell biology. They refer to different types of vesicles or membrane-bound structures, and here’s the difference between them:
-
EV (Extracellular Vesicle):
- Definition: EV is a broad term used to describe vesicles that are released from cells into the extracellular environment. These vesicles can be produced by various cell types, including eukaryotic cells (such as human cells) and some prokaryotic cells (like bacteria).
- Origin: EVs can originate from different cellular compartments, including the plasma membrane, endosomes, and even the cytoplasm. They are involved in intercellular communication, transporting various molecules (such as proteins, nucleic acids, lipids) between cells.
- Function: EVs have diverse functions, including cell-to-cell signaling, immune response modulation, and the transfer of genetic material between cells. They play a crucial role in various physiological and pathological processes.
-
Outer Membrane Vesicle (OMV):
- Definition: OMVs are a specific type of extracellular vesicle produced by Gram-negative bacteria. They are derived from the outer membrane of these bacteria and contain various components from the bacterial cell envelope, including lipopolysaccharides (LPS), outer membrane proteins, and periplasmic contents.
- Origin: OMVs are produced as blebs or protrusions from the outer membrane of Gram-negative bacteria. They are released into the extracellular environment and can be found in the surrounding medium during bacterial growth.
- Function: OMVs have multiple functions in bacterial physiology and interactions with their environment. They can serve as a delivery system for bacterial toxins, virulence factors, and other molecules. OMVs also play a role in bacterial communication and host-pathogen interactions.
In summary, “EV” is a general term encompassing vesicles released by various cell types, including both eukaryotic and some prokaryotic cells, for diverse purposes. On the other hand, “OMV” specifically refers to vesicles produced by Gram-negative bacteria, which are derived from their outer membrane and contain components relevant to bacterial biology and interactions. While OMVs are a subset of EVs, not all EVs are OMVs, as EVs can come from different sources and serve different functions.
Utilizing Random Forest to Differentiate Bacterial vs. Viral Infections via Host Gene Expression Signatures
Random Forest is a machine learning model that falls under the category of ensemble learning methods. It is particularly known for classification and regression tasks but can also be used for other machine learning tasks like clustering.
A Random Forest model works by constructing a multitude of decision trees during the training process and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
The “random” part of the name comes from two key aspects of randomness used in the model:
-
Bootstrap aggregating (bagging): Each tree in the forest is trained on a random subset of the data points, and this subset is drawn with replacement (meaning some samples can be used multiple times).
-
Feature Randomness: When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.
This approach of combining multiple trees helps to reduce the variance, leading to better performance and robustness than a single decision tree, especially on complex datasets. Random Forests are widely used because they are easy to use, can handle binary, categorical, and numerical data, require very little input preparation, and often produce a model that performs very well without complex tuning.
Using machine learning, specifically a Random Forest classifier, to distinguish between bacterial and viral infections based on transcript expression levels involves several steps. This process generally includes data preparation, feature selection, model training, validation, and testing. Here’s a more detailed breakdown of how you might implement this:
-
Data Collection and Preparation
- Collect Data: Obtain gene expression data from patients with known bacterial and viral infections. This data might come from public databases or your own experimental data.
- Preprocess Data: Normalize the expression levels to make the data comparable across samples. Handling missing values and filtering out low-variance transcripts can also improve model performance.
-
Feature Selection
- Identify Features: The features are the expression levels of various transcripts. You might start with a large number of potential features. (We can choose the published transcripts as starting features!)
- Reduce Dimensionality: Use techniques like Principal Component Analysis (PCA), variance thresholding, or mutual information to reduce the number of features. This step is crucial to improve the model’s performance and reduce overfitting.
-
Model Training
- Split the Data: Divide your dataset into training, validation, and testing sets. A common split ratio is 70% training, 15% validation, and 15% testing.
- Train the Random Forest: Use the training data to fit a Random Forest classifier. This involves choosing a set of parameters, such as the number of trees in the forest (n_estimators) and the depth of the trees (max_depth). Tools like scikit-learn in Python provide straightforward implementations.
-
Model Validation and Tuning
- Cross-Validation: Use cross-validation on the training set to estimate the model’s performance. Adjust the model parameters based on validation results.
- Hyperparameter Tuning: Techniques like grid search or random search can help identify the best parameters for your Random Forest model.
-
Testing and Evaluation
- Test the Model: Use the unseen test data to evaluate the model’s performance. Metrics like accuracy, precision, recall, and the area under the ROC curve (AUC) can provide insight into how well your model distinguishes between bacterial and viral infections.
- Feature Importance: Random Forest can provide information on which transcripts (features) are most important for classification, offering potential biological insights.
-
Implementation Tools
- Python Libraries: Libraries like scikit-learn for machine learning, pandas for data manipulation, and matplotlib or seaborn for visualization are commonly used.
- R Packages: For users more comfortable with R, packages like randomForest, caret, and tidyverse can be used for similar steps.
Example Code Snippet (Python with scikit-learn)
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score import pandas as pd # Assuming `data` is a Pandas DataFrame with the last column as the target label X = data.iloc[:, :-1] # Features: Expression levels y = data.iloc[:, -1] # Target: Infection type (0 for viral, 1 for bacterial) # Splitting the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Creating and training the Random Forest model model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42) model.fit(X_train, y_train) # Predicting and evaluating the model predictions = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, predictions)) #In the context of machine learning, X_train and y_train are variables used during the training phase of a model. They represent the "features" and "target" (or "labels") of your training dataset, respectively. #X_train: This variable holds the features of your dataset that are used to make predictions. The "X" usually stands for the input variables or independent variables. In the case of transcript expression levels to distinguish between bacterial and viral infections, X_train would contain the expression levels of various transcripts for each sample in the training set. It's often structured as a matrix or DataFrame where each row represents a sample and each column represents a feature (e.g., the level of expression of a specific transcript). #y_train: This variable contains the target or labels for the training dataset. The "y" represents the output or dependent variable that you are trying to predict. In your case, y_train would contain the classification of each sample as either bacterial or viral, corresponding to the rows in X_train. y_train is typically a vector or Series, where each entry corresponds to the label of a sample in X_train. #Together, X_train and y_train are used to "train" or fit a machine learning model. The model learns from these training examples: it tries to understand the relationship between the features (X_train) and the labels (y_train) so that it can accurately predict the labels of new, unseen data. After the model is trained, it can then be tested using a separate dataset (not used during training) to evaluate its performance, typically using X_test and y_test, which hold the features and labels of the test dataset, respectively. #The three parameters you mentioned are among the key hyperparameters used to configure a Random Forest model. Here's what each of them means: #* n_estimators # - Definition: The number of trees in the forest. # - Purpose: More trees increase the model's robustness, making its predictions more stable, but also increase computational cost. Finding the right number of trees is a balance between improving model performance and computational efficiency. # - Default Value in scikit-learn: As of version 0.22, the default value is 100. #* max_depth # - Definition: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. # - Purpose: This parameter controls the depth of each tree. Deeper trees can model more complex patterns but also increase the risk of overfitting. Limiting the depth of trees can help in creating simpler models that generalize better. # - Default Value in scikit-learn: The default value is None, meaning the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. #* random_state # - Definition: Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node. # - Purpose: This parameter makes the model's output reproducible. It ensures that the same sequence of random numbers is generated each time you run the model with that specific seed. It's helpful for debugging and for situations where you need reproducibility. # - Default Value in scikit-learn: The default value is None, which means the random number generator is the RandomState instance used by np.random. #Adjusting these parameters can significantly impact the model's performance and training time. It's common practice to use techniques like cross-validation and grid search to find the optimal values for these and other hyperparameters based on the specific dataset you're working with.
-
Final Thoughts
- Iterate: Machine learning is an iterative process. You may need to go back and adjust your preprocessing, feature selection, or model parameters based on test results.
- Biological Interpretation: Beyond classification, consider how the identified transcripts and their expression levels contribute to the biological understanding of infection responses.
Unicycler vs. Trycycler
-
prapare the input sequencing data
NGS.id Sample.name ONT_barcode jk3332 5179R1 Native Barcode NB01 jk3333 1585 Native Barcode NB02 jk3334 1585V Native Barcode NB03 jk3335 5179 Native Barcode NB04 jk3336 HD_05_2 Native Barcode NB05 jk3337 HD_05_2_K5 Native Barcode NB06 jk3338 HD_05_2_K6 Native Barcode NB07
-
assembly using trycycler
cat FAN41335_pass_barcode01_46d24d87_69a75752_0.fastq.gz FAN41335_pass_barcode01_46d24d87_69a75752_1.fastq.gz > FAN41335_pass_barcode01.fastq.gz cat FAN41335_pass_barcode03_46d24d87_69a75752_0.fastq.gz FAN41335_pass_barcode03_46d24d87_69a75752_1.fastq.gz > FAN41335_pass_barcode03.fastq.gz cat FAN41335_pass_barcode04_46d24d87_69a75752_0.fastq.gz FAN41335_pass_barcode04_46d24d87_69a75752_1.fastq.gz FAN41335_pass_barcode04_46d24d87_69a75752_2.fastq.gz > FAN41335_pass_barcode04.fastq.gz cat FAN41335_pass_barcode05_46d24d87_69a75752_0.fastq.gz FAN41335_pass_barcode05_46d24d87_69a75752_1.fastq.gz > FAN41335_pass_barcode05.fastq.gz unicycler -1 s-epidermidis-5179-r1_R1.fastq.gz -2 s-epidermidis-5179-r1_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode01/FAN41335_pass_barcode01.fastq.gz --mode normal -t 55 -o 5179R1_normal unicycler -1 s-epidermidis-1585_R1.fastq.gz -2 s-epidermidis-1585_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode02/FAN41335_pass_barcode02.fastq.gz --mode normal -t 55 -o 1585_normal #3 no short sequencing unicycler -1 s-epidermidis-5179_R1.fastq.gz -2 s-epidermidis-5179_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode04/FAN41335_pass_barcode04.fastq.gz --mode normal -t 55 -o 5179_normal unicycler -1 HD5_2_S38_R1_001.fastq.gz -2 HD5_2_S38_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode05/FAN41335_pass_barcode05.fastq.gz --mode normal -t 55 -o HD05_2_normal unicycler -1 275_K5_Holger_S92_R1_001.fastq.gz -2 275_K5_Holger_S92_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode06/FAN41335_pass_barcode06.fastq.gz --mode normal -t 55 -o HD05_2_K5_normal unicycler -1 276_K6_Holger_S95_R1_001.fastq.gz -2 276_K6_Holger_S95_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode07/FAN41335_pass_barcode07.fastq.gz --mode normal -t 55 -o HD05_2_K6_normal unicycler -1 s-epidermidis-5179-r1_R1.fastq.gz -2 s-epidermidis-5179-r1_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode01/FAN41335_pass_barcode01.fastq.gz --mode bold -t 55 -o 5179R1_bold unicycler -1 s-epidermidis-1585_R1.fastq.gz -2 s-epidermidis-1585_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode02/FAN41335_pass_barcode02.fastq.gz --mode bold -t 55 -o 1585_bold #3 no short sequencing unicycler -1 s-epidermidis-5179_R1.fastq.gz -2 s-epidermidis-5179_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode04/FAN41335_pass_barcode04.fastq.gz --mode bold -t 55 -o 5179_bold unicycler -1 HD5_2_S38_R1_001.fastq.gz -2 HD5_2_S38_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode05/FAN41335_pass_barcode05.fastq.gz --mode bold -t 55 -o HD05_2_bold unicycler -1 275_K5_Holger_S92_R1_001.fastq.gz -2 275_K5_Holger_S92_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode06/FAN41335_pass_barcode06.fastq.gz --mode bold -t 55 -o HD05_2_K5_bold unicycler -1 276_K6_Holger_S95_R1_001.fastq.gz -2 276_K6_Holger_S95_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode07/FAN41335_pass_barcode07.fastq.gz --mode bold -t 55 -o HD05_2_K6_bold unicycler -1 s-epidermidis-5179-r1_R1.fastq.gz -2 s-epidermidis-5179-r1_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode01/FAN41335_pass_barcode01.fastq.gz --mode bold -t 55 -o 5179R1_bold unicycler -1 s-epidermidis-1585_R1.fastq.gz -2 s-epidermidis-1585_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode02/FAN41335_pass_barcode02.fastq.gz --mode bold -t 55 -o 1585_bold #3 no short sequencing unicycler -1 s-epidermidis-5179_R1.fastq.gz -2 s-epidermidis-5179_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode04/FAN41335_pass_barcode04.fastq.gz --mode bold -t 55 -o 5179_bold unicycler -1 HD5_2_S38_R1_001.fastq.gz -2 HD5_2_S38_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode05/FAN41335_pass_barcode05.fastq.gz --mode bold -t 55 -o HD05_2_bold unicycler -1 275_K5_Holger_S92_R1_001.fastq.gz -2 275_K5_Holger_S92_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode06/FAN41335_pass_barcode06.fastq.gz --mode bold -t 55 -o HD05_2_K5_bold unicycler -1 276_K6_Holger_S95_R1_001.fastq.gz -2 276_K6_Holger_S95_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode07/FAN41335_pass_barcode07.fastq.gz --mode bold -t 55 -o HD05_2_K6_bold unicycler -1 s-epidermidis-5179-r1_R1.fastq.gz -2 s-epidermidis-5179-r1_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode01/FAN41335_pass_barcode01.fastq.gz --mode conservative -t 55 -o 5179R1_conservative unicycler -1 s-epidermidis-1585_R1.fastq.gz -2 s-epidermidis-1585_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode02/FAN41335_pass_barcode02.fastq.gz --mode conservative -t 55 -o 1585_conservative #3 no short sequencing unicycler -1 s-epidermidis-5179_R1.fastq.gz -2 s-epidermidis-5179_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode04/FAN41335_pass_barcode04.fastq.gz --mode conservative -t 55 -o 5179_conservative unicycler -1 HD5_2_S38_R1_001.fastq.gz -2 HD5_2_S38_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode05/FAN41335_pass_barcode05.fastq.gz --mode conservative -t 55 -o HD05_2_conservative unicycler -1 275_K5_Holger_S92_R1_001.fastq.gz -2 275_K5_Holger_S92_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode06/FAN41335_pass_barcode06.fastq.gz --mode conservative -t 55 -o HD05_2_K5_conservative unicycler -1 276_K6_Holger_S95_R1_001.fastq.gz -2 276_K6_Holger_S95_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode07/FAN41335_pass_barcode07.fastq.gz --mode conservative -t 55 -o HD05_2_K6_conservative ragtag.py scaffold ../assembly_flye_HD05_2/assembly.fasta assembly.fasta ragtag.py patch ragtag.scaffold.fasta ../../assembly_flye_HD05_2/assembly.fasta grep -o 'N' ragtag.patch.fasta | wc -l makeblastdb -in ../assembly_flye_HD05_2/assembly.fasta -dbtype nucl blastn -db ../assembly_flye_HD05_2/assembly.fasta -query 1-4.fasta -out assmbly_vs_flye.blastn -evalue 0.00000000001 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
-
install the trycycler environment
nextdenovo_dir="/path/to/NextDenovo" nextpolish_dir="/path/to/NextPolish" genome_size="2500000" #2 503 927 /home/jhuang/Tools/canu/build/bin/canu -p canu -d canu_temp -fast genomeSize="$genome_size" useGrid=false maxThreads="$threads" -nanopore read_subsets/sample_"$i".fastq /home/jhuang/Tools/Trycycler/scripts/canu_trim.py canu_temp/canu.contigs.fasta > assemblies/assembly_"$i".fasta /home/jhuang/Tools/Minipolish/miniasm_and_minipolish.sh read_subsets/sample_"$i".fastq "$threads" > assemblies/assembly_"$i".gfa /home/jhuang/Tools/NECAT/Linux-amd64/bin/necat.pl config config.txt /home/jhuang/Tools/NECAT/Linux-amd64/bin/necat.pl bridge config.txt /home/jhuang/Tools/raven/build/bin/raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_"$i".gfa read_subsets/sample_"$i".fastq > assemblies/assembly_"$i".fasta #https://github.com/rrwick (Bandage, Unicycler, Filtlong, Trycycler, Polypolish install canu, flye, raven, miniasm, minipolish, any2fasta via 'mamba install' #install fastp, medaka, polypolish, masurca (install Polca) with 'mamba install' install NextDenovo and NextPolish from https://github.com/Nextomics wget https://github.com/Nextomics/NextDenovo/releases/latest/download/NextDenovo.tgz tar -vxzf NextDenovo.tgz && cd NextDenovo #cd NextDenovo && make wget https://github.com/Nextomics/NextPolish/releases/download/v1.4.1/NextPolish.tgz pip install paralleltask tar -vxzf NextPolish.tgz && cd NextPolish #&& make git clone https://github.com/rrwick/Minipolish.git $ wget https://github.com/xiaochuanle/NECAT/releases/download/v0.0.1_update20200803/necat_20200803_Linux-amd64.tar.gz $ tar xzvf necat_20200803_Linux-amd64.tar.gz $ cd NECAT/Linux-amd64/bin $ export PATH=$PATH:$(pwd) # Install canu and raven under ~/Tools/ git clone https://github.com/marbl/canu.git cd canu/src make -j 50 #<number of threads> git clone https://github.com/lbcb-sci/raven && cd raven cmake -S ./ -B./build -DRAVEN_BUILD_EXE=1 -DCMAKE_BUILD_TYPE=Release cmake --build build # Adapt the script trycycler_assembly_extra-thorough.sh with the following complete paths. /home/jhuang/Tools/NECAT/Linux-amd64/bin/necat.pl /home/jhuang/Tools/Minipolish/miniasm_and_minipolish.sh
-
assembly using trycycler
TODO (IMPORTANT): assmeble all genomes using the following methods. compare them to the unicycler results. (trycycler) jhuang@hamm:~/DATA/Data_Holger_S.epidermidis_1585_5179_HD05$ ./trycycler_assembly_extra-thorough.sh #In the HD05 project, we use the following strategies! I. At first construct the genome only with Trycycler (Trycycler: a consensus long-read assembly tool), cp long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode01/FAN41335_pass_barcode01.fastq.gz trycycler_5179R1/reads.fastq.gz cp long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode02/FAN41335_pass_barcode02.fastq.gz trycycler_1585/reads.fastq.gz cp long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode03/FAN41335_pass_barcode03.fastq.gz trycycler_1585v/reads.fastq.gz cp long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode04/FAN41335_pass_barcode04.fastq.gz trycycler_5179/reads.fastq.gz cp long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode05/FAN41335_pass_barcode05.fastq.gz trycycler_HD05_2/reads.fastq.gz cp long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode06/FAN41335_pass_barcode06.fastq.gz trycycler_HD05_2_K5/reads.fastq.gz cp long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode07/FAN41335_pass_barcode07.fastq.gz trycycler_HD05_2_K6/reads.fastq.gz for sample in trycycler_5179R1 trycycler_1585 trycycler_1585v trycycler_5179 trycycler_HD05_2 trycycler_HD05_2_K5 trycycler_HD05_2_K6; do cd ${sample}; ../trycycler_assembly_extra-thorough.sh; cd ..; done #TODO: further steps (see https://github.com/rrwick/Trycycler/wiki) for sample in trycycler_5179R1 trycycler_1585 trycycler_1585v trycycler_5179 trycycler_HD05_2 trycycler_HD05_2_K5 trycycler_HD05_2_K6; do cd ${sample}; ../trycycler_assembly_extra-thorough_raven.sh; cd ..; done #TODO: further steps (see https://github.com/rrwick/Trycycler/wiki) for sample in trycycler_5179R1 trycycler_1585 trycycler_1585v trycycler_5179 trycycler_HD05_2 trycycler_HD05_2_K5 trycycler_HD05_2_K6; do cd ${sample}; ../trycycler_assembly_extra-thorough_canu.sh; cd ..; done #TODO: further steps (see https://github.com/rrwick/Trycycler/wiki) for sample in trycycler_5179R1 trycycler_1585 trycycler_1585v trycycler_5179 trycycler_HD05_2 trycycler_HD05_2_K5 trycycler_HD05_2_K6; do cd ${sample}; trycycler cluster --threads 55 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler; cd ..; done trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 #Error: failed to circularise sequence D_bctg00000000 because its start could not be found in other sequences. You can either trim some sequence off the start of D_bctg00000000 or exclude the sequence altogether and try again. #Error: failed to circularise sequence E_ctg000010 for multiple reasons. You must either repair this sequence or exclude it and then try running trycycler reconcile again. #Error: failed to circularise sequence W_ctg000000 because its start could not be found in other sequences. You can either trim some sequence off the start of W_ctg000000 or exclude the sequence altogether and try again. trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 #Error: failed to circularise sequence K_ctg000000 because its end could not be found in other sequences. You can either trim some sequence off the end of K_ctg000000 or exclude the sequence altogether and try #Worst-1kbp: W_Utg714 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 #Error: failed to circularise sequence T_contig_1 because its end could not be found in other sequences. You can either trim some sequence off the end of T_contig_1 or exclude the sequence altogether and try again. # Worst-1kbp: D_bctg00000000, J_bctg00000000, P_bctg00000000 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 #Error: failed to circularise sequence A_tig00000003 because its start could not be found in other sequences. You can either trim some sequence off the start of A_tig00000003 or exclude the sequence altogether and try again. #Error: failed to circularise sequence E_ctg000000 because its start could not be found in other sequences. You can either trim some sequence off the start of E_ctg000000 or exclude the sequence altogether and try again. #Error: failed to circularise sequence Q_ctg000000 because its end could not be found in other sequences. You can either trim some sequence off the end of Q_ctg000000 or exclude the sequence altogether and try again. # Worst-1kbp: L_Utg716, X_Utg654 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002 #M_tig00000002, S_tig00000003, A_tig00000003, C_utg000003l, G_tig00000002, I_utg000002l #E_ctg000000, K_ctg000000, Q_ctg000000 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005 trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006 # #--> When finished, Trycycler reconcile will make 2_all_seqs.fasta in the cluster directory, a multi-FASTA file containing each of the contigs ready for multiple sequence alignment. trycycler msa --threads 55 --cluster_dir trycycler/cluster_001 trycycler msa --threads 55 --cluster_dir trycycler/cluster_002 trycycler msa --threads 55 --cluster_dir trycycler/cluster_003 trycycler msa --threads 55 --cluster_dir trycycler/cluster_004 trycycler msa --threads 55 --cluster_dir trycycler/cluster_005 #--> When finished, Trycycler reconcile will make a 3_msa.fasta file in the cluster directory #generate 4_reads.fastq for each contig! trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_* #trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001 trycycler/cluster_002 trycycler/cluster_003 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_002 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_003 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004 trycycler consensus --threads 55 --cluster_dir trycycler/cluster_005 #!!NOTE that we take the isolates of HD05_2_K5 and HD05_2_K6 assembled by Unicycler instead of Trycycler!! # TODO (TODAY), generate the 3 datasets below! # TODO (IMPORTANT): write a Email to Holger, say the short sequencing of HD5_2 is not correct, since the 3 datasets! However, the MTxxxxxxx is confirmed not in K5 and K6! TODO: variant calling needs the short-sequencing, they are not dorable without the correct short-reads! resequencing? It is difficult to call variants only from long-reads since too much errors in long-reads! #TODO: check the MT sequence if in the isolates, more deteiled annotations come late! #II. Comparing the results of Trycycler with Unicycler. #III. Eventually add the plasmids assembled from unicycler to the final results. E.g. add the 4 plasmids to K5 and K6
-
Polishing after Trycycler
#1. Oxford Nanopore sequencer (Ignored due to the samtools version incompatibility!) # for c in trycycler/cluster_*; do # medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12 # mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta # rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi # clean up # done # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta #2. Short-read polishing #---- 5179_R1 (2) ---- # mean read depth: 205.8x # 188 bp have a depth of zero (99.9924% coverage) # 355 positions changed (0.0144% of total positions) # estimated pre-polishing sequence accuracy: 99.9856% (Q38.42) #Step 1: read QC fastp --in1 ../../s-epidermidis-5179-r1_R1.fastq.gz --in2 ../../s-epidermidis-5179-r1_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 37 #Insertion/Deletion Errors: 2 #Assembly Size: 2470001 #Consensus Quality: 99.9984 #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 17748 #Consensus Quality: 99.9775 #Step 4: (optional) more rounds and/or other polishers #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape! #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes. for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G cd .. done Substitution Errors: 13 Insertion/Deletion Errors: 0 Assembly Size: 2470004 Consensus Quality: 99.9995 Substitution Errors: 0 Insertion/Deletion Errors: 0 Assembly Size: 17748 Consensus Quality: 100 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179-r1_R1.fastq.gz ../../../s-epidermidis-5179-r1_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2470004 #Consensus Quality: 100 #---- 1585 (4) ---- # mean read depth: 174.7x # 8,297 bp have a depth of zero (99.6604% coverage) # 271 positions changed (0.0111% of total positions) # estimated pre-polishing sequence accuracy: 99.9889% (Q39.55) #Step 1: read QC fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 7 #Insertion/Deletion Errors: 4 #Assembly Size: 2443174 #Consensus Quality: 99.9995 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9014 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9014 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2344 #Consensus Quality: 100 #Step 4: (optional) more rounds and/or other polishers #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape! #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes. for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2443176 #Consensus Quality: 100 #---- 1585 derived from unicycler, under 1585_normal/unicycler (4) ---- #Step 0: copy chrom and plasmid1, plasmid2, plasmid3 to cluster_001/7_final_consensus.fasta, ... #Step 1: read QC fastp --in1 ../../s-epidermidis-1585_R1.fastq.gz --in2 ../../s-epidermidis-1585_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Polishing 1 (2,443,574 bp): #mean read depth: 174.7x #8,298 bp have a depth of zero (99.6604% coverage) #52 positions changed (0.0021% of total positions) #estimated pre-polishing sequence accuracy: 99.9979% (Q46.72) #Polishing 2 (9,014 bp): #mean read depth: 766.5x #3 bp have a depth of zero (99.9667% coverage) #0 positions changed (0.0000% of total positions) #estimated pre-polishing sequence accuracy: 100.0000% (Q∞) #Polishing 7 (2,344 bp): #mean read depth: 2893.0x #4 bp have a depth of zero (99.8294% coverage) #0 positions changed (0.0000% of total positions) #estimated pre-polishing sequence accuracy: 100.0000% (Q∞) #Polishing 8 (2,255 bp): #mean read depth: 2719.6x #4 bp have a depth of zero (99.8226% coverage) #0 positions changed (0.0000% of total positions) #estimated pre-polishing sequence accuracy: 100.0000% (Q∞) #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 7 #Insertion/Deletion Errors: 4 #Assembly Size: 2443598 #Consensus Quality: 99.9995 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9014 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2344 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2255 #Consensus Quality: 100 #Step 4: (optional) more rounds and/or other polishers #After one round of Polypolish and one round of POLCA, your assembly should be in very good shape! #However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or POLCA to see if they make any more changes. for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-1585_R1.fastq.gz ../../../s-epidermidis-1585_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2443600 #Consensus Quality: 100 #-- 1585v (1, no short reads, waiting) -- # TODO! #-- 5179 (2) -- #mean read depth: 120.7x #7,547 bp have a depth of zero (99.6946% coverage) #356 positions changed (0.0144% of total positions) #estimated pre-polishing sequence accuracy: 99.9856% (Q38.41) #Step 1: read QC fastp --in1 ../../s-epidermidis-5179_R1.fastq.gz --in2 ../../s-epidermidis-5179_R2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 49 #Insertion/Deletion Errors: 23 #Assembly Size: 2471418 #Consensus Quality: 99.9971 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 17748 #Consensus Quality: 100 #Step 4: (optional) more rounds POLCA for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 10 #Insertion/Deletion Errors: 5 #Assembly Size: 2471442 #Consensus Quality: 99.9994 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done Substitution Errors: 6 Insertion/Deletion Errors: 0 Assembly Size: 2471445 Consensus Quality: 99.9998 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../s-epidermidis-5179_R1.fastq.gz ../../../s-epidermidis-5179_R2.fastq.gz" -t 55 -m 120G cd .. done Substitution Errors: 0 Insertion/Deletion Errors: 0 Assembly Size: 2471445 Consensus Quality: 100 #-- HD5_2 (2): without the short-sequencing we cannot correct the base-calling! -- # !ERROR to be REPORTED, the #Polishing cluster_001_consensus (2,504,140 bp): #mean read depth: 94.4x #240,420 bp have a depth of zero (90.3991% coverage) #56,894 positions changed (2.2720% of total positions) #estimated pre-polishing sequence accuracy: 97.7280% (Q16.44) /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_1_S37_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_2_S38_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_3_S39_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_4_S40_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_5_S41_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_6_S42_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_7_S43_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_8_S44_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_9_S45_R2_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R1_001.fastq /media/jhuang/Elements2/Data_Anna12_HAPDICS_HyAsP/180821_rohde/HD5_10_S46_R2_001.fastq #Step 1: read QC fastp --in1 ../../HD5_2_S38_R1_001.fastq.gz --in2 ../../HD5_2_S38_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz # NOTE that the following steps are not run since the short-reads are not correct! # #Step 2: Polypolish # for cluster in cluster_001 cluster_005; do # bwa index ${cluster}/7_final_consensus.fasta # bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam # bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam # polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta # done # #Step 3: POLCA # for cluster in cluster_001 cluster_005; do # cd ${cluster} # polca.sh -a polypolish.fasta -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G # cd .. # done # #Step 4: (optional) more rounds POLCA # for cluster in cluster_001; do # cd ${cluster} # polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../HD5_2_S38_R1_001.fastq.gz ../../../HD5_2_S38_R2_001.fastq.gz" -t 55 -m 120G # cd .. # done # NOTE that the plasmids of HD5_2_K5 and HD5_2_K6 were copied from Unicycler! #-- HD5_2_K5 (4) -- mean read depth: 87.1x 25 bp have a depth of zero (99.9990% coverage) 1,085 positions changed (0.0433% of total positions) estimated pre-polishing sequence accuracy: 99.9567% (Q33.63) #Step 1: read QC fastp --in1 ../../275_K5_Holger_S92_R1_001.fastq.gz --in2 ../../275_K5_Holger_S92_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 146 #Insertion/Deletion Errors: 2 #Assembly Size: 2504401 #Consensus Quality: 99.9941 #Substitution Errors: 41 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9007 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9191 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2767 #Consensus Quality: 100 #Step 4: (optional) more rounds POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 41 #Insertion/Deletion Errors: 0 #Assembly Size: 2504401 #Consensus Quality: 99.9984 #Substitution Errors: 8 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9806 for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 8 #Insertion/Deletion Errors: 0 #Assembly Size: 2504401 #Consensus Quality: 99.9997 #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9903 for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../275_K5_Holger_S92_R1_001.fastq.gz ../../../275_K5_Holger_S92_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 8 #Insertion/Deletion Errors: 0 #Assembly Size: 2504401 #Consensus Quality: 99.9997 #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9903 #-- HD5_2_K6 (4) -- #mean read depth: 116.7x #4 bp have a depth of zero (99.9998% coverage) #1,022 positions changed (0.0408% of total positions) #estimated pre-polishing sequence accuracy: 99.9592% (Q33.89) #Step 1: read QC fastp --in1 ../../276_K6_Holger_S95_R1_001.fastq.gz --in2 ../../276_K6_Holger_S95_R2_001.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz #Step 2: Polypolish for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do bwa index ${cluster}/7_final_consensus.fasta bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 1.fastq.gz > ${cluster}/alignments_1.sam bwa mem -t 55 -a ${cluster}/7_final_consensus.fasta 2.fastq.gz > ${cluster}/alignments_2.sam polypolish polish ${cluster}/7_final_consensus.fasta ${cluster}/alignments_1.sam ${cluster}/alignments_2.sam > ${cluster}/polypolish.fasta done #Step 3: POLCA for cluster in cluster_001 cluster_002 cluster_003 cluster_004; do cd ${cluster} polca.sh -a polypolish.fasta -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 164 #Insertion/Deletion Errors: 2 #Assembly Size: 2504398 #Consensus Quality: 99.9934 #Substitution Errors: 22 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 99.9467 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 9191 #Consensus Quality: 100 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 2767 #Consensus Quality: 100 #Step 4: (optional) more rounds POLCA for cluster in cluster_001 cluster_002; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 32 #Insertion/Deletion Errors: 0 #Assembly Size: 2504400 #Consensus Quality: 99.9987 #Substitution Errors: 0 #Insertion/Deletion Errors: 0 #Assembly Size: 41288 #Consensus Quality: 100 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 4 #Insertion/Deletion Errors: 0 #Assembly Size: 2504400 #Consensus Quality: 99.9998 for cluster in cluster_001; do cd ${cluster} polca.sh -a polypolish.fasta.PolcaCorrected.fa.PolcaCorrected.fa.PolcaCorrected.fa -r "../../../276_K6_Holger_S95_R1_001.fastq.gz ../../../276_K6_Holger_S95_R2_001.fastq.gz" -t 55 -m 120G cd .. done #Substitution Errors: 2 #Insertion/Deletion Errors: 0 #Assembly Size: 2504400 #Consensus Quality: 99.9999
-
Results by directly using Unicycler
#----------------------- 5179R1_normal ----------------------- >1 length=2468563 depth=1.00x circular=true >2 length=17748 depth=1.42x circular=true Component Segments Links Length N50 Longest segment Status total 2 2 2,486,311 2,468,563 2,468,563 1 1 1 2,468,563 2,468,563 2,468,563 complete 2 1 1 17,748 17,748 17,748 complete Segment Length Depth Starting gene Position Strand Identity Coverage 1 2,468,563 1.00x UniRef90_Q5HJZ9 1,212,460 forward 100.0% 100.0% 2 17,748 1.42x UniRef90_A0A0H2VIR3 4,804 reverse 93.2% 99.7% # ---- 5179_bold ---- Segment Length Depth Starting gene Position Strand Identity Coverage 1 2,469,173 1.00x UniRef90_Q5HJZ9 1,901,872 reverse 100.0% 100.0% 2 17,749 2.27x UniRef90_A0A0H2VIR3 4,771 forward 93.2% 99.7% 4 4,595 10.19x none found 8 2,449 17.14x none found >1 length=2469173 depth=1.00x circular=true >2 length=17749 depth=2.27x circular=true >3 length=4761 depth=0.44x >4 length=4595 depth=10.19x circular=true >5 length=3735 depth=0.29x >6 length=3718 depth=0.42x >7 length=3573 depth=0.52x >8 length=2449 depth=17.14x circular=true >9 length=2411 depth=0.35x >10 length=2371 depth=0.32x >11 length=2365 depth=0.43x >12 length=1637 depth=0.44x >13 length=1568 depth=0.66x >14 length=1505 depth=0.65x >15 length=1403 depth=0.93x >16 length=1329 depth=0.55x makeblastdb -in assembly.fasta -dbtype nucl blastn -task blastn-short -db ../HD05_2_K5_conservative/assembly.fasta -query assembly.fasta -out 2-16_vs_1.blastn -evalue 0.00000000001 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 #TODO: manually fill the gap in the HD05_2 genome! 5 1 99.946 3728 1 1 1 3728 1535666 1539392 0.0 7366 6 1 99.973 3718 0 1 1 3718 702963 706679 0.0 7355 7 1 99.888 3573 1 3 1 3573 1764622 1768191 0.0 7027 9 1 100.000 2411 0 0 1 2411 1060914 1063324 0.0 4779 10 1 100.000 2371 0 0 1 2371 615275 612905 0.0 4700 11 1 99.958 2365 0 1 1 2365 1088713 1086350 0.0 4672 12 1 100.000 1637 0 0 1 1637 146635 144999 0.0 3245 13 1 99.936 1568 0 1 1 1568 2024197 2025763 0.0 3092 14 1 100.000 1505 0 0 1 1505 2445480 2443976 0.0 2983 15 1 100.000 1403 0 0 1 1403 197723 196321 0.0 2781 16 1 99.925 1329 1 0 1 1329 49854 48526 0.0 2627 # -------------------- 1585_normal -------------------- >1 length=2443574 depth=1.00x circular=true #contig_1 2442282 10 60 61 >2 length=9014 depth=3.72x circular=true >3 length=4388 depth=0.89x >4 length=3443 depth=0.48x >5 length=3338 depth=0.48x >6 length=3336 depth=0.45x >7 length=2344 depth=11.44x circular=true >8 length=2255 depth=9.81x circular=true >9 length=1929 depth=0.37x >10 length=1703 depth=1.67x >11 length=1605 depth=0.26x >12 length=1381 depth=0.56x >13 length=1360 depth=0.39x >14 length=1281 depth=0.41x >15 length=1163 depth=0.51x >16 length=1088 depth=0.24x 2594107 ragtag.py scaffold ../HD05_2_K5_normal/assembly.fasta assembly.fasta ragtag.py patch ragtag.scaffold.fasta ../../HD05_2_K5_normal/assembly.fasta grep -o 'N' ragtag.patch.fasta | wc -l 3 1 99.977 4388 0 1 1 4388 2410738 2406352 0.0 8683 4 1 99.942 3443 0 2 1 3443 2222741 2219301 0.0 6794 5 1 99.970 3338 0 1 1 3338 455636 452300 0.0 6601 6 1 99.940 3336 0 2 1 3336 1617740 1614407 0.0 6581 9 1 99.948 1929 0 1 1 1929 1321522 1319595 0.0 3808 10 1 99.941 1703 1 0 1 1703 90503 88801 0.0 3368 11 1 99.938 1605 0 1 1 1605 2361795 2363398 0.0 3166 12 1 99.928 1381 0 1 1 1381 241092 242471 0.0 2722 13 1 100.000 1360 0 0 1 1360 1157897 1159256 0.0 2696 14 1 100.000 1281 0 0 1 1281 218323 219603 0.0 2539 15 1 100.000 1163 0 0 1 1163 2077536 2078698 0.0 2305 16 1 100.000 1088 0 0 1 1088 283284 284371 0.0 2157 >1 length=2503585 depth=1.00x circular=true >2 length=41288 depth=3.32x circular=true >3 length=9191 depth=8.29x circular=true >4 length=2767 depth=9.36x circular=true >1 length=2503927 depth=1.00x circular=true >2 length=41288 depth=3.77x circular=true >3 length=9191 depth=7.83x circular=true >4 length=2767 depth=10.11x circular=true #-------------------------- 1585V #[2024-01-17 13:42:28] INFO: Assembly statistics: Total length: 2438882 vs 2443574 Fragments: 1 Fragments N50: 2438882 Largest frg: 2438882 Scaffolds: 0 Mean coverage: 47 unicycler -1 s-epidermidis-5179-r1_R1.fastq.gz -2 s-epidermidis-5179-r1_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode01/FAN41335_pass_barcode01.fastq.gz --mode conservative -t 55 -o 5179R1_conservative unicycler -1 s-epidermidis-1585_R1.fastq.gz -2 s-epidermidis-1585_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode02/FAN41335_pass_barcode02.fastq.gz --mode conservative -t 55 -o 1585_conservative #3 no short sequencing unicycler -1 s-epidermidis-5179_R1.fastq.gz -2 s-epidermidis-5179_R2.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode04/FAN41335_pass_barcode04.fastq.gz --mode conservative -t 55 -o 5179_conservative unicycler -1 HD5_2_S38_R1_001.fastq.gz -2 HD5_2_S38_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode05/FAN41335_pass_barcode05.fastq.gz --mode conservative -t 55 -o HD05_2_conservative unicycler -1 275_K5_Holger_S92_R1_001.fastq.gz -2 275_K5_Holger_S92_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode06/FAN41335_pass_barcode06.fastq.gz --mode conservative -t 55 -o HD05_2_K5_conservative unicycler -1 276_K6_Holger_S95_R1_001.fastq.gz -2 276_K6_Holger_S95_R2_001.fastq.gz -l long_reads/240109_FAN41335_S_epidermidis/fastq_pass/barcode07/FAN41335_pass_barcode07.fastq.gz --mode conservative -t 55 -o HD05_2_K6_conservative # ---- 1 5179R1 2469692 ---- >1 length=2468563 depth=1.00x circular=true >2 length=17748 depth=1.42x circular=true # ---- 2 1585 2442282 ---- (compring to Trycyler chrom is 2443176 nt) >1 length=2443574 depth=1.00x circular=true >2 length=9014 depth=3.72x circular=true >7 length=2344 depth=11.44x circular=true >8 length=2255 depth=9.81x circular=true # ---- 3 1585v 2438882 ---- #using long sequencing only 1 # ---- 4 5179_bold 2471107+17740 ---- >1 length=2469173 depth=1.00x circular=true >2 length=17749 depth=2.27x circular=true >4 length=4595 depth=10.19x circular=true >8 length=2449 depth=17.14x circular=true # ---- 5 HD05_2 2504622 ---- # Note in HD05_2_bold_hq_lq including the bad long-reads. >1 length=965875 depth=0.95x >2 length=855325 depth=1.00x >3 length=582944 depth=1.02x >4 length=183656 depth=1.02x >5 length=13570 depth=4.73x circular=true >6 length=1503 depth=4.85x >7 length=1271 depth=5.06x >8 length=227 depth=2.03x >9 length=153 depth=0.93x >10 length=152 depth=1.09x # trycycler: 2503231 (yes), 9183 (yes), 22394, 18541 -- # ---- 6 HD05_2_K5 2504656+41290+9191 ---- conservative >1 length=2503585 depth=1.00x circular=true >2 length=41288 depth=3.32x circular=true >3 length=9191 depth=8.29x circular=true >4 length=2767 depth=9.36x circular=true # ---- 7 HD05_2_K6 2504588+41285+9192 ---- conservative >1 length=2503927 depth=1.00x circular=true >2 length=41288 depth=3.77x circular=true >3 length=9191 depth=7.83x circular=true >4 length=2767 depth=10.11x circular=true ragtag.py scaffold ../assembly_flye_HD05_2/assembly.fasta assembly.fasta ragtag.py patch ragtag.scaffold.fasta ../../assembly_flye_HD05_2/assembly.fasta grep -o 'N' ragtag.patch.fasta | wc -l makeblastdb -in ../assembly_flye_HD05_2/assembly.fasta -dbtype nucl blastn -db ../assembly_flye_HD05_2/assembly.fasta -query 1-4.fasta -out assmbly_vs_flye.blastn -evalue 0.000001 -num_threads 15 -outfmt 6 -strand minus -max_target_seqs 1
-
Submit all genomes to NCBI
TODO: If 1585V using only the long reads to assemble the genome! BioSample accession BioProject: PRJNA1038700 Staphylococcus epidermidis strain:1585v | isolate:1585v Genome sequencing SAMN38198576 Pathogen: clinical or host-associated sample from Staphylococcus epidermidis 0 SRAs Status To be released Release date 2027-11-10 Created 2023-11-10 15:24 Updated 2023-11-17 16:57 Sample name 1585v Package Pathogen: clinical or host-associated; version 1.0 Organism Name: Staphylococcus epidermidis Taxonomy ID: 1282 Attributes Attribute name Attribute value collected by H R collection date 2004 geographic location Germany: Hamburg host Homo sapiens host disease port-catheter infection isolation source port-catheter isolate missing strain 1585v latitude and longitude 53.551672 N 9.955081 E https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#run_pgap
-
Background of 1585v and 1585
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8346721/ S. epidermidis 1585 is known to be biofilm-negative in laboratory media, but to form biofilm in the presence of human serum.
In contrast, S. epidermidis 1585 v is a variant derived from strain 1585 in which, due to a chromosomal re-arrangement, a 460 kDa isoform of Embp is overexpressed even in TSB, while mutant M135 is an isogenic mutant of 1585 v in which expression of Embp is interrupted by insertion of transposon Tn917.
Staphylococcus epidermidis (S. epidermidis) 1585 is a specific strain of S. epidermidis that is classified as a wild-type strain. Wild-type in bacterial terminology refers to the strain of an organism that is found in nature, as opposed to those that have been modified or mutated in a laboratory setting. Here are some key points about the 1585 wild-type strain of S. epidermidis:
-
No Embp Production in TSB: One notable characteristic of the S. epidermidis 1585 strain is that it does not produce Embp (extracellular matrix binding protein) when grown in TSB (Tryptic Soy Broth). Embp is a protein that plays a crucial role in the biofilm formation and adherence of bacteria to surfaces. The absence of Embp production in this strain could impact its ability to form biofilms, a common virulence factor in Staphylococcus infections.
-
Biofilm Formation: S. epidermidis is known for its ability to form biofilms, especially on medical devices, leading to infections that are difficult to treat. The fact that the 1585 strain doesn’t produce Embp in TSB suggests it may have a reduced capacity for biofilm formation under these conditions, which could be significant in understanding and managing such infections.
-
Research and Clinical Implications: Studying wild-type strains like S. epidermidis 1585 is important for understanding the natural behavior and characteristics of the species. Since this strain behaves differently from other strains in terms of Embp production and possibly biofilm formation, it can provide insights into the mechanisms and genetic factors that control these processes. This knowledge is valuable for developing strategies to prevent and treat infections, especially in hospital and healthcare settings where S. epidermidis infections are common.
-
Genetic Studies: The 1585 strain can also serve as a baseline or control in genetic studies. By comparing the genome and behavior of 1585 with other strains of S. epidermidis, researchers can identify genetic variations and mutations that may be responsible for different phenotypes, such as increased virulence or antibiotic resistance.
Model Organism for Understanding Staphylococcal Behavior: As a wild-type strain, 1585 offers a model for studying the natural state of S. epidermidis. This is crucial for understanding the fundamental biology of the bacterium, which can help in the development of treatments and interventions against infections caused by more virulent or drug-resistant strains. In summary, the S. epidermidis
1585 wild-type strain is significant in microbiological research due to its natural characteristics, particularly its behavior in biofilm formation and Embp production. Understanding these aspects can contribute to better insights into the pathogenicity and treatment of Staphylococcus infections, particularly in clinical settings where these bacteria are a common source of nosocomial infections.