Author Archives: gene_x

prepare virus X14112 gtf for nextflow running

  1. install conda environment rnaseq_2021

    conda create -n rnaseq_2021 python=3.6.7
    conda activate rnaseq_2021
    conda install -c conda-forge -c bioconda -c defaults  nextflow=21.04 fastqc=0.11.8 trim-galore=0.5.0 star=2.6.1d hisat2=2.1.0
    conda install -c conda-forge -c bioconda -c defaults  picard=2.18.27 csvtk=0.17.0 preseq=2.0.3 
    conda install -c conda-forge -c bioconda -c defaults  samtools=1.9
    conda install -c conda-forge -c bioconda -c defaults  gffread=0.9.12
    conda install -c conda-forge -c bioconda -c defaults  subread=1.6.4
    conda install -c conda-forge -c bioconda -c defaults  deeptools=3.2.0
    conda install -c conda-forge -c bioconda -c defaults  multiqc=1.7    #*
    conda install -c conda-forge -c bioconda -c defaults  conda-forge::r-data.table=1.12.0 conda-forge::r-gplots=3.0.1.1 bioconductor-dupradar=1.12.1 bioconductor-edger=3.24.1 
    conda install -c conda-forge -c bioconda -c defaults  stringtie=1.3.5
    conda install -c conda-forge -c bioconda -c defaults  rseqc=3.0.0
  2. prepare the virus gtf-file

    # -- processing for virus gtf --
    # #gffread X14112.1_gene.gff -T -o X14112.1_gene.gtf
    # cp X14112.1.gff X14112.1.gff3
    # #gffread -E -F --bed X14112.1.gff3 -o X14112.1.bed (change the name errors in 1 intron and 2 genes)
    # grep "^##" X14112.1.gff3 > X14112.1_gene.gff3
    # 
    # # --try to filter the file with genes --> failed --
    # grep "ID=gene" X14112.1.gff3 >> X14112.1_gene.gff
    # cp X14112.1.gff3 X14112.1.gff
    
    # -- generating *_gene.gtf file containing only gene records --
    python3 gff2gtf.py
    # -- check if gene_id is unique --
    cut -f9- -d$'\t' X14112.1_gene.gtf > temp
    cut -f1 -d';' temp > temp_  #111
    sort temp_ > temp_1
    sort temp_ -u > temp_2
    diff temp_1 temp_2
    #39d38
    #< ID=gene-UL29
    #59d57
    #< ID=gene-UL43
    #--> delete short ones of the repeated records --> 109 records
    
    python3 extends.py #generating the file X14112.1_gene_extended.gtf
    #Then replace 'transcript_id "exon' --> 'transcript_id "rna' in X14112.1_gene_extended.gtf
    
    gffread -E -F --bed X14112.1_gene_extended.gtf -o X14112.1_gene_extended.bed
    #-->bed contains 109 transcript-name for example "rna-gene-RS1-2"
    
    ##!!!!OPTIONAL!!!!: don't need to change type '\tgene\t' to '\texon\t', since X14112.1_gene_extended.gtf contains exon-records.
    nextflow run rnaseq_old/main.nf --reads "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/Raw_Data/*.fastq.gz" --fasta "K14112.1.fasta" --gtf "K14112.1_gene_extended.gtf" --bed12 "K14112.1_gene_extendced.bed"  --singleEnd -profile standard --aligner star --fcGroupFeaturesType gene_biotype --skip_genebody_coverage -resume --saveReference
    
    # -- correct some special records (optional, as the processing above didn't genrate the records) --
    # delete the lines starting with "#"
    # replace "X14112.1:146805..151063" to IE175
    # replace "X14112.1:133941..146107" to IE68
    # add ;gene=IE68 ;gene=IE175 to the corresponding lines
    # -- python code for convert gff to gtf --
    # open the input file for reading and the output file for writing
    
    # -- scripts choose gene or exon or mRNA --
    # python3 gff2gtf.py #X14112.1.gff-->X14112.1.gtf
    # replace '; transcript_id "gene' to '; transcript_id "tx'
    # !!!!VERY_IMPORTANT!!!!: change type '\tgene\t' to '\texon\t'! 
    # sed -i -e "s/\tgene\t/\texon\t/g" X14112.1_gene.gff # since default is --featurecounts_feature_type 'exon'. 
  3. nextflow command: the input should be *.umi_extract.fastq.gz.

    #SUCCESSFUL
    (rnaseq) jhuang@hamburg:~/DATA/Data_Manja_RNAseq_Organoids_Virus$ /home/jhuang/anaconda3/bin/nextflow run rnaseq_old/main.nf --reads "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/Raw_Data/*.fastq.gz" --fasta "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1.fasta" --gtf "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_gene_extended.gtf" --bed12 "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_gene_extended.bed" --singleEnd -profile standard --aligner hisat2 --fcGroupFeaturesType gene_biotype --skip_genebody_coverage --skip_preseq -resume --saveReference
    
    #NOT_TESTED
    (rnaseq_2021) jhuang@hamm:~/DATA/Data_Manja_RNAseq_Organoids_Virus$ nextflow run rnaseq_old/main.nf --reads "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/Raw_Data/*.fastq.gz" --fasta "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1.fasta" --gtf "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_gene_extended.gtf" --bed12 "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_gene_extended.bed" --singleEnd -profile standard --aligner star  -resume --saveReference
  4. snippet of the human hg38 gtf served as a pattern

    1       ensembl_havana  gene    685679  686673  .       -       .       gene_id "ENSG00000284662"; gene_version "1"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
    
    1       ensembl_havana  transcript      685679  686673  .       -       .       gene_id "ENSG00000284662"; gene_version "1"; transcript_id "ENST00000332831"; transcript_version "4"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F16-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS41221"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
    
    1       ensembl_havana  exon    685679  686673  .       -       .       gene_id "ENSG00000284662"; gene_version "1"; transcript_id "ENST00000332831"; transcript_version "4"; exon_number "1"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F16-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS41221"; exon_id "ENSE00002324228"; exon_version "3"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
    
    1       ensembl_havana  CDS     685719  686654  .       -       0       gene_id "ENSG00000284662"; gene_version "1"; transcript_id "ENST00000332831"; transcript_version "4"; exon_number "1"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F16-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS41221"; protein_id "ENSP00000329982"; protein_version "2"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
    
    1       ensembl_havana  gene    1211340 1214153 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
    
    1       havana  transcript      1211340 1214138 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; transcript_support_level "2";
    
    1       havana  exon    1213983 1214138 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "1"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00001480264"; exon_version "3"; transcript_support_level "2";
    1       havana  exon    1212992 1213785 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "2"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00001906619"; exon_version "1"; transcript_support_level "2";
    1       havana  exon    1212638 1212704 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "3"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00003550137"; exon_version "1"; transcript_support_level "2";
    1       havana  exon    1211942 1212138 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00003604411"; exon_version "1"; transcript_support_level "2";
    1       ensembl_havana  exon    1211704 1211832 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; exon_number "6"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; exon_id "ENSE00001333051"; exon_v
    ersion "1"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
    1       ensembl_havana  CDS     1211704 1211832 .       -       2       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; exon_number "6"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; protein_id "ENSP00000368538"; protein_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
    1       ensembl_havana  exon    1211340 1211625 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; exon_number "7"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; exon_id "ENSE00001915458"; exon_version "2"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
    1       ensembl_havana  CDS     1211558 1211625 .       -       2       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; exon_number "7"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; protein_id "ENSP00000368538"; protein_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
    1       ensembl_havana  stop_codon      1211555 1211557 .       -       0       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; exon_number "7"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
    1       ensembl_havana  five_prime_utr  1214128 1214153 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
    1       ensembl_havana  three_prime_utr 1211340 1211554 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
    
    1       havana  transcript      1211340 1214138 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; transcript_support_level "2";
    1       havana  exon    1213983 1214138 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "1"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00001480264"; exon_version "3"; transcript_support_level "2";
    1       havana  exon    1212992 1213785 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "2"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203
    "; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00001906619"; exon_version "1"; transcript_support_level "2";
    1       havana  exon    1212638 1212704 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "3"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00003550137"; exon_version "1"; transcript_support_level "2";
    1       havana  exon    1211942 1212138 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00003604411"; exon_version "1"; transcript_support_level "2";
    1       havana  exon    1211340 1211832 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000497869"; transcript_version "5"; exon_number "5"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-203"; transcript_source "havana"; transcript_biotype "retained_intron"; exon_id "ENSE00001923078"; exon_version "1"; transcript_support_level "2";
    
    1       havana  transcript      1212019 1213498 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000453580"; transcript_version "1"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "3";
    1       havana  exon    1213395 1213498 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000453580"; transcript_version "1"; exon_number "1"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00001680308"; exon_version "1"; transcript_support_level "3";
    1       havana  exon    1212992 1213093 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000453580"; transcript_version "1"; exon_number "2"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003495433"; exon_version "1"; transcript_support_level "3";
    1       havana  exon    1212638 1212704 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000453580"; transcript_version "1"; exon_number "3"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003550137"; exon_version "1"; transcript_support_level "3";
    1       havana  exon    1212019 1212138 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000453580"; transcript_version "1"; exon_number "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00001723250"; exon_version "1"; transcript_support_level "3";
  5. generate wig files

    samtools faidx X14112.1.fasta
    cut -f1,2 X14112.1.fasta.fai > results/markDuplicates/chrom.sizes
    cd results/markDuplicates/
    for sample in control_r1 control_r2 HSV.d2_r1 HSV.d2_r2 HSV.d4_r1 HSV.d4_r2 HSV.d6_r1 HSV.d6_r2 HSV.d8_r1 HSV.d8_r2; do
        #bamCoverage -b ${sample}.umi_extract.sorted.markDups.bam -o ${sample}.bw
        #bedtools genomecov -ibam ${sample}.umi_extract.sorted.markDups.bam -bg > ${sample}.bedgraph
        bedGraphToBigWig ${sample}.bedgraph chrom.sizes ${sample}.bw
        bigWigToWig ${sample}.bw ${sample}.wig
    done
  6. input and clean data using R

    #BiocManager::install(c("DESeq2"))
    requiredPackages1 <-c("AnnotationDbi","clusterProfiler","ReactomePA","org.Hs.eg.db","DESeq2", "gplots", "RColorBrewer")
    ipak <- function(pkg){
            new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
            if (length(new.pkg))
                    install.packages(new.pkg, dependencies = TRUE)
            sapply(pkg, require, character.only = TRUE)
    }
    ipak(requiredPackages1)
    #requiredPackages2 <- c("tidyverse")
    #ipak(requiredPackages2)
    
    #cut -f1-1 merged_gene_counts.txt > col1.txt
    #paste -d$'\t' col1.txt merged_gene_counts2.txt > merged_gene_counts3.txt
    #sed -i 's/gene-//g' merged_gene_counts3.txt
    
    #replace "X14112.1" to "X14112"; delete "rna-gene-"; get X14112.1_gene_extended2.gtf; using it in IGV
    
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library("org.Hs.eg.db")
    library(DESeq2)
    library(gplots)
    setwd("~/DATA/Data_Manja_RNAseq_Organoids_Virus/results/featureCounts")
    
    #---- dataset (27) samples (firstly import all samples, then spring to 27-3-1-2) ----
    d.raw<- read.delim2("merged_gene_counts3.txt",sep="\t", header=TRUE, row.names=1)
    #> head(d.raw,0)
    # [1] HSV.d4_r2.umi_extract.sorted.bam  HSV.d6_r1.umi_extract.sorted.bam 
    # [3] HSV.d4_r1.umi_extract.sorted.bam  control_r1.umi_extract.sorted.bam
    # [5] HSV.d2_r2.umi_extract.sorted.bam  control_r2.umi_extract.sorted.bam
    # [7] HSV.d2_r1.umi_extract.sorted.bam  HSV.d8_r2.umi_extract.sorted.bam 
    # [9] HSV.d8_r1.umi_extract.sorted.bam  HSV.d6_r2.umi_extract.sorted.bam 
    
    col.order <-  c("control_r1.umi_extract.sorted.bam","control_r2.umi_extract.sorted.bam", "HSV.d2_r1.umi_extract.sorted.bam","HSV.d2_r2.umi_extract.sorted.bam", "HSV.d4_r1.umi_extract.sorted.bam","HSV.d4_r2.umi_extract.sorted.bam", "HSV.d6_r1.umi_extract.sorted.bam","HSV.d6_r2.umi_extract.sorted.bam", "HSV.d8_r1.umi_extract.sorted.bam","HSV.d8_r2.umi_extract.sorted.bam")
    d <- d.raw[,col.order]  #reordered.raw
    #d <- reordered.raw[rowSums(reordered.raw>3)>2,]
    
    colnames(d) =  c("control_r1","control_r2", "HSV.d2_r1","HSV.d2_r2", "HSV.d4_r1","HSV.d4_r2", "HSV.d6_r1","HSV.d6_r2", "HSV.d8_r1","HSV.d8_r2")
    
    # Define the replicates and condition of the samples
    ids <- factor(c("control_r1","control_r2", "HSV.d2_r1","HSV.d2_r2", "HSV.d4_r1","HSV.d4_r2", "HSV.d6_r1","HSV.d6_r2", "HSV.d8_r1","HSV.d8_r2"))
    replicate <- factor(c("r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2"))
    condition <- factor(c("control", "control", "HSV.d2", "HSV.d2", "HSV.d4", "HSV.d4", "HSV.d6", "HSV.d6", "HSV.d8", "HSV.d8"))
    
    # Construct the DESeqDataSet
    cData = data.frame(row.names=colnames(d), replicate=replicate, condition=condition, ids=ids)
    dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~condition)
    
    # Run DESeq (early without the step, WRONG?)
    dds <- DESeq(dds)
    
    # Apply the rlog transformation
    rld <- rlogTransformation(dds)
    #rld <- vst(dds)  #vsd
    
    #-- save raw_data as xls --
    write.csv(d, file="d.csv")
    #~/Tools/csv2xls-0.4/csv_to_xls.py d.csv -d$',' -o d.xls
  7. plotting pca and heatmap and remove batchEffect

    # -- pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    #plotPCA(rld, intgroup = c("condition", "batch"))
    #plotPCA(rld, intgroup = c("condition", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    
    # -- heatmap --
    ## generate the pairwise comparison between samples
    library(gplots) 
    library("RColorBrewer")
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    #paste( rld$dex, rld$cell, sep="-" )
    #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition,batch, sep=":"))
    rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition,ids, sep=":"))
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
  8. select the differentially expressed genes

    #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
    #https://www.biostars.org/p/282295/
    #https://www.biostars.org/p/335751/
    
    #> condition
    # [1] control control HSV.d2  HSV.d2  HSV.d4  HSV.d4  HSV.d6  HSV.d6  HSV.d8  HSV.d8 
    #Levels: control HSV.d2 HSV.d4 HSV.d6 HSV.d8
    
    #CONSOLE: mkdir /home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/results/featureCounts/degenes
    setwd("/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/results/featureCounts/degenes")
    #---- relevel to control ----
    dds$condition <- relevel(dds$condition, "control")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d2_vs_control","HSV.d4_vs_control", "HSV.d6_vs_control", "HSV.d8_vs_control")
    
    dds$condition <- relevel(dds$condition, "HSV.d2")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d4_vs_HSV.d2", "HSV.d6_vs_HSV.d2", "HSV.d8_vs_HSV.d2")
    
    dds$condition <- relevel(dds$condition, "HSV.d4")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d6_vs_HSV.d4", "HSV.d8_vs_HSV.d4")
    
    dds$condition <- relevel(dds$condition, "HSV.d6")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d8_vs_HSV.d6")
    
    for (i in clist) {
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      res_df <- as.data.frame(res)
    
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.05 & log2FoldChange>=1.2)
      down <- subset(res_df, padj<=0.05 & log2FoldChange<=-1.2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    ##https://github.com/kevinblighe/EnhancedVolcano
    #BiocManager::install("EnhancedVolcano")
    library("EnhancedVolcano")
    #for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control; do
    #for i in HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2; do
    #for i in HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4; do
    for i in HSV.d8_vs_HSV.d6; do
      echo "contrast = paste(\"condition\", \"${i}\", sep=\"_\")"
      echo "res = results(dds, name=contrast)"
      #echo "res <- res[!is.na(res$log2FoldChange),]"
      echo "res <- na.omit(res)"
      echo "res_df <- as.data.frame(res)"
      #selectLab = selectLab_italics,
      echo "png(\"${i}.png\",width=1200, height=1000)"
      #legendPosition = 'right',legendLabSize = 12,  arrowheads = FALSE,
      echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.2, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(\"$(echo $i | cut -d'_' -f1) versus $(echo $i | cut -d'_' -f3)\"))"
      echo "dev.off()"
    done
    
    #under DIR degenes under KONSOLE
    for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do
      echo "~/Tools/csv2xls-0.4/csv_to_xls.py ${i}-all.txt ${i}-up.txt ${i}-down.txt -d$',' -o ${i}.xls;"
    done

9 (optional). clustering the genes and draw heatmap

    install.packages("gplots")
    library("gplots")

    for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do
      echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"
      echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"
    done

    cat *.id | sort -u > ids
    #add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id
    RNASeq.NoCellLine <- assay(rld)

    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    write.csv(as.data.frame(datamat), file ="significant_gene_expressions.txt")
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.5)
    mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
    mycol = mycol[as.vector(mycl)]

    #png("DEGs_heatmap.png", width=900, height=800)
    #cex.lab=10, labRow="",
    png("DEGs_heatmap.png", width=900, height=1000)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), 
                RowSideColors = mycol, margins=c(10,20), cexRow=1.5, srtCol=45, lhei = c(2, 8))  #rownames(datamat)  
    #heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,5), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=30, lhei=c(0.1,2))
    dev.off()

    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') 
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')  
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls

    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    significant_gene_expressions.txt \
    -d',' -o DEGs_heatmap_expression_data.xls;

Peak calling using homer combining sicer and macs2

  1. nextflow processing data

    V_8_1_6_p601_d8_D1_H3K4me3.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K4me3_D1
    V_8_1_5_p601_d8_D2_H3K4me3.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K4me3_D2
    V_8_1_6_p604_d8_D1_H3K4me3.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K4me3_D1
    V_8_1_5_p604_d8_D2_H3K4me3.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K4me3_D2
    V_8_1_6_p601_d8_D1_H3K27me3.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K27me3_D1
    V_8_1_5_p601_d8_D2_H3K27me3.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K27me3_D2
    V_8_1_6_p604_d8_D1_H3K27me3.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K27me3_D1
    V_8_1_5_p604_d8_D2_H3K27me3.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K27me3_D2
    V_8_1_7_p601_d8_D1_H3K9me3.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K9me3_D1
    V_8_1_7_p601_d8_D2_H3K9me3.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K9me3_D2
    V_8_1_7_p604_d8_D1_H3K9me3.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K9me3_D1
    V_8_1_7_p604_d8_D2_H3K9me3.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K9me3_D2
    V_8_1_8_p601_d8_D1_H3K27ac.fastq.gz,V_8_1_6_p601_d8_D1_input.fastq.gz,p601_H3K27ac_D1
    V_8_1_8_p601_d8_D2_H3K27ac.fastq.gz,V_8_1_5_p601_d8_D2_input.fastq.gz,p601_H3K27ac_D2
    V_8_1_8_p604_d8_D1_H3K27ac.fastq.gz,V_8_1_6_p604_d8_D1_input.fastq.gz,p604_H3K27ac_D1
    V_8_1_8_p604_d8_D2_H3K27ac.fastq.gz,V_8_1_5_p604_d8_D2_input.fastq.gz,p604_H3K27ac_D2
    
    nextflow run NGI-ChIPseq/main.nf --reads '/home/jhuang/DATA/Data_Denise_LT_DNA_Binding/ChIPseq_histone_hg38/H3K4me3_H3K27ac__H3K27me3_H3K9me3/Raw_Data_GEO_uploaded/*.fastq.gz' --genome hg38 --macsconfig macs.config --saveReference --saveAlignedIntermediates --singleEnd --blacklist_filtering -profile standard --project NHDF_enhancer_analysis_hg38 -resume
    
    nextflow run NGI-ChIPseq/main.nf --reads '/mnt/h1/jhuang/DATA/Data_Denise_LT_DNA_Binding/ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/Raw_Data/*.fastq.gz' --genome hg38 --macsconfig macs.config --saveReference --saveAlignedIntermediates --singleEnd --blacklist_filtering -profile standard --project NHDF_enhancer_analysis_hg38 -resume
    
    (DEBUG: Control doesn't work well!)
    lrwxrwxrwx 1 jhuang jhuang   37 Mai 19 11:19 NHDF-Ad_H3K4me1_r1.fastq.gz -> ../Raw_Data_orig/SRR568344_1.fastq.gz
    lrwxrwxrwx 1 jhuang jhuang   37 Mai 19 11:19 NHDF-Ad_H3K4me1_r2.fastq.gz -> ../Raw_Data_orig/SRR568345_1.fastq.gz
    lrwxrwxrwx 1 jhuang jhuang   37 Mai 19 11:19 NHDF-Ad_H3K27ac_r1.fastq.gz -> ../Raw_Data_orig/SRR227397_1.fastq.gz
    lrwxrwxrwx 1 jhuang jhuang   37 Mai 19 11:19 NHDF-Ad_H3K27ac_r2.fastq.gz -> ../Raw_Data_orig/SRR227398_1.fastq.gz
    lrwxrwxrwx 1 jhuang jhuang   37 Mai 19 11:19 NHDF-Ad_Control_r1.fastq.gz -> ../Raw_Data_orig/SRR227590_1.fastq.gz
    lrwxrwxrwx 1 jhuang jhuang   37 Mai 19 11:19 NHDF-Ad_Control_r2.fastq.gz -> ../Raw_Data_orig/SRR227591_1.fastq.gz
  2. make homer directories and findPeaks with HOMER under (myperl)

    conda activate myperl
    
    #Why do I need give "-genome hg38" in makeTagDirectory?
    #If you don't provide a genome with the -genome option, HOMER will only count the number of tags in each region without any genomic context or sequence information. #So, it is essential to include this information when creating a tag directory if you plan to perform any genome-based analysis.
    
    makeTagDirectory p601_d8_D1_input ../results/picard/V_8_1_6_p601_d8_D1_input.dedup.sorted.bam -genome hg38
    makeTagDirectory p601_d8_D2_input ../results/picard/V_8_1_5_p601_d8_D2_input.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D1_input ../results/picard/V_8_1_6_p604_d8_D1_input.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D2_input ../results/picard/V_8_1_5_p604_d8_D2_input.dedup.sorted.bam -genome hg38
    makeTagDirectory p601_d8_D1_H3K4me3 ../results/picard/V_8_1_6_p601_d8_D1_H3K4me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p601_d8_D2_H3K4me3 ../results/picard/V_8_1_5_p601_d8_D2_H3K4me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D1_H3K4me3 ../results/picard/V_8_1_6_p604_d8_D1_H3K4me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D2_H3K4me3 ../results/picard/V_8_1_5_p604_d8_D2_H3K4me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p601_d8_D1_H3K27me3 ../results/picard/V_8_1_6_p601_d8_D1_H3K27me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p601_d8_D2_H3K27me3 ../results/picard/V_8_1_5_p601_d8_D2_H3K27me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D1_H3K27me3 ../results/picard/V_8_1_6_p604_d8_D1_H3K27me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D2_H3K27me3 ../results/picard/V_8_1_5_p604_d8_D2_H3K27me3.dedup.sorted.bam -genome hg38
    makeTagDirectory p601_d8_D1_H3K27ac ../results/picard/V_8_1_8_p601_d8_D1_H3K27ac.dedup.sorted.bam -genome hg38
    makeTagDirectory p601_d8_D2_H3K27ac ../results/picard/V_8_1_8_p601_d8_D2_H3K27ac.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D1_H3K27ac ../results/picard/V_8_1_8_p604_d8_D1_H3K27ac.dedup.sorted.bam -genome hg38
    makeTagDirectory p604_d8_D2_H3K27ac ../results/picard/V_8_1_8_p604_d8_D2_H3K27ac.dedup.sorted.bam -genome hg38
    
    for sample in p601_d8_D1_input p601_d8_D2_input p604_d8_D1_input p604_d8_D2_input  p601_d8_D1_H3K4me3 p601_d8_D2_H3K4me3 p604_d8_D1_H3K4me3 p604_d8_D2_H3K4me3  p601_d8_D1_H3K27me3 p601_d8_D2_H3K27me3 p604_d8_D1_H3K27me3 p604_d8_D2_H3K27me3  p601_d8_D1_H3K27ac p601_d8_D2_H3K27ac p604_d8_D1_H3K27ac p604_d8_D2_H3K27ac; do
        makeUCSCfile ${sample} -pseudo 1 -bigWig /home/jhuang/REFs/hg38.chromSizes -o auto -style chipseq    -norm 1e7 -normLength 100 -fsize 1
    done
    
    # -- not necessary any more: using MACS2 and SICER instead of using findPeaks --
    # #factor (transcription factor ChIP-Seq, uses -center, output: peaks.txt,  default)
    # #histone (histone modification ChIP-Seq, region based, uses -region -size 500 -L 0, regions.txt)
    # for sample in p601_d8_D1 p601_d8_D2 p604_d8_D1 p604_d8_D2; do
    #   #Finding peaks of size 1000, no closer than 2000
    #   findPeaks ${sample}_H3K4me3 -style factor -size 1000  -o auto -i ${sample}_input
    #   #-minDist <#> (minimum distance between peaks, default: peak size x2)
    #   #findPeaks ${sample}_H3K27me3 -style histone -region -size 3000 -minDist 5000 -o auto -i ${sample}_input
    #   #findPeaks ${sample}_H3K27ac -style factor -size 200 -minDist 200 -o auto -i ${sample}_input
    #   #findPeaks ${sample}_H3K4me1 -style histone -region -size 1000 -minDist 2500 -o auto -i ${sample}_input
    # done
    
    ./p601_d8_D1_H3K4me3/peaks.txt
    ./p601_d8_D2_H3K4me3/peaks.txt
    ./p604_d8_D1_H3K4me3/peaks.txt
    ./p604_d8_D2_H3K4me3/peaks.txt
    
    ./p601_d8_D1_H3K27me3/regions.txt
    ./p601_d8_D2_H3K27me3/regions.txt
    ./p604_d8_D1_H3K27me3/regions.txt
    ./p604_d8_D2_H3K27me3/regions.txt
    
    for dir in p601_d8_D1_H3K4me3 p601_d8_D2_H3K4me3 p604_d8_D1_H3K4me3 p604_d8_D2_H3K4me3; do
    awk -v OFS='\t' '{print $2, $3, $4, $1, $6}' ./${dir}/peaks.txt > ${dir}_peaks.bed
    grep -v "#" ${dir}_peaks.bed | sort -k1,1 -k2,2n > ${dir}_sorted_peaks.bed
    done
    
    for dir in p601_d8_D1_H3K27me3 p601_d8_D2_H3K27me3 p604_d8_D1_H3K27me3 p604_d8_D2_H3K27me3; do
    awk -v OFS='\t' '{print $2, $3, $4, $1, $6}' ./${dir}/regions.txt > ${dir}_regions.bed
    grep -v "#" ${dir}_regions.bed | sort -k1,1 -k2,2n > ${dir}_sorted_regions.bed
    done
    
    #DEBUG: why the bam files so small?
    makeTagDirectory NHDF-Ad_Control_r1 /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/results_ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/results/picard/NHDF_Control_r1.dedup.sorted.bam -genome hg38
    makeTagDirectory NHDF-Ad_Control_r2 /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/results_ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/results/picard/NHDF_Control_r2.dedup.sorted.bam -genome hg38
    makeTagDirectory NHDF-Ad_H3K27ac_r1 /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/results_ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/results/picard/NHDF_H3K27ac_r1.dedup.sorted.bam -genome hg38
    makeTagDirectory NHDF-Ad_H3K27ac_r2 /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/results_ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/results/picard/NHDF_H3K27ac_r2.dedup.sorted.bam -genome hg38
    makeTagDirectory NHDF-Ad_H3K4me1_r1 /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/results_ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/results/picard/NHDF_H3K4me1_r1.dedup.sorted.bam -genome hg38
    makeTagDirectory NHDF-Ad_H3K4me1_r2 /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/results_ChIPseq_histone_hg38/H3K27ac_H3K4me1_public/results/picard/NHDF_H3K4me1_r2.dedup.sorted.bam -genome hg38
    
    NHDF-Ad_Control_r1 NHDF-Ad_Control_r2 NHDF-Ad_H3K27ac_r1 NHDF-Ad_H3K27ac_r2 NHDF-Ad_H3K4me1_r1 NHDF-Ad_H3K4me1_r2
    
    > (myperl) environments for HOMER, ~/Tools/diffreps/bin/diffReps.pl, MACS2, ~/Tools/SICER1.1/SICER/SICER.sh
  3. combine the diffReps.pl and HOMER annotatePeaks.pl

    #Dynamic regions were defined as MACS (H3K4me3, H3K27ac) or SICER (H3K4me1, H3K27me3) peaks overlapping significantly (≥ 2-fold change, adjusted P-value ≤ 0.05) up- or down-regulated differentially enriched regions from diffReps in the three pairwise comparisons WAC vs mock, WA314 vs mock and WAC vs WA314.
    
    #STEP1
    #--> not given "--gname hg38"
    ## Step4: Annotate differential sites.
    #unless($noanno or $gname eq ''){
    #        `region_analysis.pl -i $report -r -d refseq -g $gname`;
    #}
    ## Step5: Look for hotspots.
    #unless($nohs){
    #        my $hotspot = $report . '.hotspot';
    #        `findHotspots.pl -d $report -o $hotspot`;
    #}
    ~/Tools/diffreps/bin/diffReps.pl -tr ../results/picard/V_8_1_6_p601_d8_D1_H3K4me3.dedup.sorted.bed ../results/picard/V_8_1_5_p601_d8_D2_H3K4me3.dedup.sorted.bed -co ../results/picard/V_8_1_6_p601_d8_D1_input.dedup.sorted.bed ../results/picard/V_8_1_5_p601_d8_D2_input.dedup.sorted.bed  --report output_results  --chrlen /home/jhuang/REFs/hg38.chromSizes --nsd sharp
    
    #STEP2
    #replace Chr to '#Chr'
    grep -v "#" output_results | sort -k1,1 -k2,2n > output_results_
    awk 'BEGIN {OFS="\t"} {print $1, $2, $3, "diffreps_peak_"NR, $12}' output_results_ > H3K4me3.bed
    #grep -v "#" H3K4me3.bed | sort -k1,1 -k2,2n > H3K4me3_sorted_peaks.bed
    
    #STEP3 (under myperl) peak calling macs2 for narrow peaks, CISER for broad peaks!
    #process the output of diffReps.pl to BED file.
    annotatePeaks.pl H3K4me3.bed hg38 > H3K4me3_annotated_peaks.txt
  4. combine macs2 to getDifferentialPeaksReplicates.pl

    replace the initial peak identification by using your MACS2 output.
    
    #http://homer.ucsd.edu/homer/ngs/diffExpression.html
    #getDifferentialPeaksReplicates.pl = findPeaks + annotatePeaks.pl + getDiffExpression.pl 
    #annotatePeaks.pl tss hg38 -raw -d H3K4me3-Mock-rep1/ H3K4me3-Mock-rep2/ H3K4me3-WNT-rep1/ H3K4me3-WNT-rep3/ > countTable.peaks.txt
    
    Here's an outline of how we might be able to replace the initial peak identification by using your MACS2 output.
    
    #TODO: using MACS call peaks of the data H3K27ac.

4.1. MACS2 peak calling

  #macs2 --> bed --> annotatePeaks.pl
  conda activate ngi_chipseq_ac2
  macs2 callpeak -t ../results/picard/V_8_1_6_p601_d8_D1_H3K4me3.dedup.sorted.bam -c ../results/picard/V_8_1_6_p601_d8_D1_input.dedup.sorted.bam -f BAM -g hs -n p601_d8_D1 -q 0.05
  macs2 callpeak -t ../results/picard/V_8_1_5_p601_d8_D2_H3K4me3.dedup.sorted.bam -c ../results/picard/V_8_1_5_p601_d8_D2_input.dedup.sorted.bam -f BAM -g hs -n p601_d8_D2 -q 0.05
  macs2 callpeak -t ../results/picard/V_8_1_6_p604_d8_D1_H3K4me3.dedup.sorted.bam -c ../results/picard/V_8_1_6_p604_d8_D1_input.dedup.sorted.bam -f BAM -g hs -n p604_d8_D1 -q 0.05
  macs2 callpeak -t ../results/picard/V_8_1_5_p604_d8_D2_H3K4me3.dedup.sorted.bam -c ../results/picard/V_8_1_5_p604_d8_D2_input.dedup.sorted.bam -f BAM -g hs -n p604_d8_D2 -q 0.05

  awk 'OFS="\t" {print $1, $2, $3, $4, $5}' p601_d8_D1_peaks.narrowPeak > p601_d8_D1_peaks.bed
  awk 'OFS="\t" {print $1, $2, $3, $4, $5}' p601_d8_D2_peaks.narrowPeak > p601_d8_D2_peaks.bed
  awk 'OFS="\t" {print $1, $2, $3, $4, $5}' p604_d8_D1_peaks.narrowPeak > p604_d8_D1_peaks.bed
  awk 'OFS="\t" {print $1, $2, $3, $4, $5}' p604_d8_D2_peaks.narrowPeak > p604_d8_D2_peaks.bed

  #annotatePeaks.pl p601_d8_D1_peaks.bed hg38 > p601_d8_D1_annotated_peaks.txt
  #annotatePeaks.pl p601_d8_D2_peaks.bed hg38 > p601_d8_D2_annotated_peaks.txt
  #annotatePeaks.pl p604_d8_D1_peaks.bed hg38 > p604_d8_D1_annotated_peaks.txt
  #annotatePeaks.pl p604_d8_D2_peaks.bed hg38 > p604_d8_D2_annotated_peaks.txt

4.2. Convert your MACS2 peaks to HOMER-compatible format. You can do this manually or with a script. For example:

It’s possible to use more information from the MACS2 output file to create a more informative peaks.txt file for HOMER. However, it’s important to note that some information that HOMER needs for its differential peak analysis is not available in the MACS2 output (such as Normalized Tag Count, Control Tags, and others). But we can certainly map more of the available MACS2 columns to the corresponding HOMER columns.

  #The following awk command can be used to convert more MACS2 information into the HOMER format:

  cd macs2
  awk 'BEGIN{OFS="\t"}{print $1,$2,$3,"Peak_"NR,$5,$6,$7,$8,$9,$10}' macs2_peaks.bed > macs2_peaks.txt
  awk 'BEGIN{OFS="\t"} NR > 1 {print $10, $1, $2, $3, "+", "0", "0", $6, $6, "0", $8, $5, $9, "0", "0"}' p601_d8_D1_peaks.xls > p601_d8_D1_macs2_peaks.txt
  awk 'BEGIN{OFS="\t"} NR > 1 {print $10, $1, $2, $3, "+", "0", "0", $6, $6, "0", $8, $5, $9, "0", "0"}' p601_d8_D2_peaks.xls > p601_d8_D2_macs2_peaks.txt
  awk 'BEGIN{OFS="\t"} NR > 1 {print $10, $1, $2, $3, "+", "0", "0", $6, $6, "0", $8, $5, $9, "0", "0"}' p604_d8_D1_peaks.xls > p604_d8_D1_macs2_peaks.txt
  awk 'BEGIN{OFS="\t"} NR > 1 {print $10, $1, $2, $3, "+", "0", "0", $6, $6, "0", $8, $5, $9, "0", "0"}' p604_d8_D2_peaks.xls > p604_d8_D2_macs2_peaks.txt

  This command will:

      * Skip the header line (NR > 1)
      * Map the MACS2 peak name ($10) to the HOMER PeakID
      * Map the MACS2 chromosome, start, and end ($1, $2, $3) to the HOMER chr, start, end
      * Use a placeholder "+" for the HOMER strand
      * Use a placeholder "0" for the HOMER Normalized Tag Count and Focus Ratio
      * Map the MACS2 pileup ($6) to the HOMER findPeaks Score and Total Tags
      * Use a placeholder "0" for the HOMER Control Tags
      * Map the MACS2 fold_enrichment ($8) to the HOMER Fold Change vs Control
      * Map the MACS2 abs_summit ($5) to the HOMER p-value vs Control
      * Map the MACS2 -log10(qvalue) ($9) to the HOMER Fold Change vs Local
      * Use a placeholder "0" for the HOMER p-value vs Local and Clonal Fold Change

  This script is limited by the differences in the information provided by MACS2 and HOMER. While it makes use of as much information as possible from the MACS2 output, some columns in the HOMER format still have to be filled with placeholder values.

- The following awk command can be used to convert more SICER information into the HOMER format (TODO) oder directly using findPeaks.pl.

- The following awk command can be used to convert more diffReps.pl information into the HOMER format (TODO).

4.3. Associate the converted peak files with their respective tag directories. In HOMER, peak files can be associated with a tag directory by placing them in the tag directory with the filename “peaks.txt”.

  mv homer/p601_d8_D1_H3K4me3/peaks.txt homer/p601_d8_D1_H3K4me3/peaks_raw.txt
  mv homer/p601_d8_D2_H3K4me3/peaks.txt homer/p601_d8_D2_H3K4me3/peaks_raw.txt
  mv homer/p604_d8_D1_H3K4me3/peaks.txt homer/p604_d8_D1_H3K4me3/peaks_raw.txt
  mv homer/p604_d8_D2_H3K4me3/peaks.txt homer/p604_d8_D2_H3K4me3/peaks_raw.txt
  cp macs2/p601_d8_D1_macs2_peaks.txt homer/p601_d8_D1_H3K4me3/peaks.txt
  cp macs2/p601_d8_D2_macs2_peaks.txt homer/p601_d8_D2_H3K4me3/peaks.txt
  cp macs2/p604_d8_D1_macs2_peaks.txt homer/p604_d8_D1_H3K4me3/peaks.txt
  cp macs2/p604_d8_D2_macs2_peaks.txt homer/p604_d8_D2_H3K4me3/peaks.txt

  #Repeat this for each of your tag directories.

4.4. The program getDifferentialPeaksReplicates will essentially perform 3 steps, in the step 2 was modified.

First, it will pool the target tag directories and input directories separately into pooled experiments and perform an initial peak identification (using findPeaks). Pooling the experiments is generally more sensitive than trying to merge the individual peak files coming from each experiment (although this can be done using the “-use ” option if each directory already has a peak file associated with it). Next, it will quantify the reads at the initial putative peaks across each of the target and input tag directories using annotatePeaks.pl. Finally, it calls getDiffExpression.pl and ultimately passes these values to the R/Bioconductor package DESeq2 to calculate enrichment values for each peak, returning only those peaks that pass a given fold enrichment (default: 2-fold) and FDR cutoff (default 5%). We can run getDifferentialPeaksReplicates.pl with the -use option to specify that the provided peaks should be used instead of calling findPeaks:

    #-- Successful modification of the script getDifferentialPeaksReplicates.pl --
    #The -d parameter in the mergePeaks function in HOMER is used to specify the maximum distance between peak centers
    #change Max distance to merge to 30000 bp in getDifferentialPeaksReplicates.pl
    #mergePeaks -d 30000 temp_sorted | sort

    #conda list homer  #4.11
    mergePeaks p601_d8_D1_H3K27me3/peaks.txt p601_d8_D2_H3K27me3/peaks.txt >  mergePeaks_res.txt
    python3 update_header.py
    cat p601_d8_D1_H3K27me3/peaks.txt p601_d8_D2_H3K27me3/peaks.txt > temp
    awk '{print $2 "\t" $3 "\t" $4 "\t" $1}' temp | sort -k1,1 -k2,2n | bedtools merge -d 1000 > bedtools_res.txt
    python3 adjust_mergePeaks_res.py

    #check if the results are correct
    cut -d$'\t' -f2-4 filtered_mergePeaks_res.txt > control1
    diff control1 bedtools_res.txt

  #(myperl) jhuang@hamburg:~/DATA/Data_Denise_LT_DNA_Bindung/results_chipseq_histone_hg38/H3K4me3_H3K27ac__H3K27me3_H3K9me3/homer$
  #getDifferentialPeaksReplicates.pl -use 
/peaks.txt,/peaks.txt … #TOO TIME_CONSUMING, using original version getDifferentialPeaksReplicates.pl -t p601_d8_D1_H3K4me3 p601_d8_D2_H3K4me3 -i p601_d8_D1_input p601_d8_D2_input -genome hg38 -use peaks.txt > p601_d8_H3K4me3_macs2_peaks.txt getDifferentialPeaksReplicates.pl -t p604_d8_D1_H3K4me3 p604_d8_D2_H3K4me3 -i p604_d8_D1_input p604_d8_D2_input -genome hg38 -use peaks.txt > p604_d8_H3K4me3_macs2_peaks.txt #Remember to replace /peaks.txt,/peaks.txt … with the path to your own tag directories and peak files. 4.5. Draw plots import matplotlib.pyplot as plt from matplotlib_venn import venn2 venn2(subsets=(2476, 3567, 22719), set_labels=(‘Donor 1’, ‘Donor 2’)) plt.title(‘Peaks between p601 d8 H3K4me3 Donors’) plt.xlabel(‘Number of Elements’) plt.ylabel(”) plt.savefig(‘Peak_Consistency_Between_p601_d8_H3K4me3_Donors.png’, dpi=300, bbox_inches=’tight’) #2476+3567+22719=28762 venn2(subsets=(2681, 3410, 19044), set_labels=(‘Donor 1’, ‘Donor 2’)) plt.title(‘Peaks between p604 d8 H3K4me3 Donors’) plt.xlabel(‘Number of Elements’) plt.ylabel(”) plt.savefig(‘Peak_Consistency_Between_p604_d8_H3K4me3_Donors.png’, dpi=300, bbox_inches=’tight’) #2681+3410+19044=25135 awk ‘NR>1 {print $2 “\t” $3 “\t” $4 “\t” $1}’ p601_d8_H3K4me3_macs2_peaks.txt > p601_d8_H3K4me3_macs2_peaks.bed awk ‘NR>1 {print $2 “\t” $3 “\t” $4 “\t” $1}’ p604_d8_H3K4me3_macs2_peaks.txt > p604_d8_H3K4me3_macs2_peaks.bed ~/Tools/csv2xls-0.4/csv_to_xls.py p601_d8_H3K4me3_macs2_peaks.txt -d$’\t’ -o p601_d8_H3K4me3_macs2_peaks.xls ~/Tools/csv2xls-0.4/csv_to_xls.py p604_d8_H3K4me3_macs2_peaks.txt -d$’\t’ -o p604_d8_H3K4me3_macs2_peaks.xls 5.1. SICER peak calling (under env myperl) #SICER –> bed –> annotatePeaks.pl mkdir sicer; cd sicer; ln -s ../results/picard/V_8_1_6_p601_d8_D1_H3K27me3.dedup.sorted.bed . ln -s ../results/picard/V_8_1_6_p601_d8_D1_input.dedup.sorted.bed . ln -s ../results/picard/V_8_1_5_p601_d8_D2_H3K27me3.dedup.sorted.bed . ln -s ../results/picard/V_8_1_5_p601_d8_D2_input.dedup.sorted.bed . ln -s ../results/picard/V_8_1_6_p604_d8_D1_H3K27me3.dedup.sorted.bed . ln -s ../results/picard/V_8_1_6_p604_d8_D1_input.dedup.sorted.bed . ln -s ../results/picard/V_8_1_5_p604_d8_D2_H3K27me3.dedup.sorted.bed . ln -s ../results/picard/V_8_1_5_p604_d8_D2_input.dedup.sorted.bed . #chr10:49,003,170-51,222,175 #chr10:48,528,307-50,747,312 #chr10:58,422,744-60,641,749 mkdir p601_d8_D1 p601_d8_D2 p604_d8_D1 p604_d8_D2 #/home/jhuang/Tools/SICER1.1/SICER/SICER.sh [InputDir] [bed file] [control file] [OutputDir] [Species] [redundancy threshold] [window size (bp)] [fragment size] [effective genome fraction] [gap size (bp)] [FDR] # 10000 is window size, 30000 is the gap size, 160 is the fragment size ~/Tools/SICER1.1/SICER/SICER.sh . V_8_1_6_p601_d8_D1_H3K27me3.dedup.sorted.bed V_8_1_6_p601_d8_D1_input.dedup.sorted.bed p601_d8_D1 hg38 1 10000 160 0.74 30000 0.01; ~/Tools/SICER1.1/SICER/SICER.sh . V_8_1_5_p601_d8_D2_H3K27me3.dedup.sorted.bed V_8_1_5_p601_d8_D2_input.dedup.sorted.bed p601_d8_D2 hg38 1 10000 160 0.74 30000 0.01; ~/Tools/SICER1.1/SICER/SICER.sh . V_8_1_6_p604_d8_D1_H3K27me3.dedup.sorted.bed V_8_1_6_p604_d8_D1_input.dedup.sorted.bed p604_d8_D1 hg38 1 10000 160 0.74 30000 0.01; ~/Tools/SICER1.1/SICER/SICER.sh . V_8_1_5_p604_d8_D2_H3K27me3.dedup.sorted.bed V_8_1_5_p604_d8_D2_input.dedup.sorted.bed p604_d8_D2 hg38 1 10000 160 0.74 30000 0.01; #TODO: – check if the peak calling works well in IGV! # – call peaks for H3K27me1, adjust the SICER-parameters! # – transform them peaks.txt of HOMER, and call getDifferentialPeaksReplicates.pl -t p601_d8_D1_H3K27me3 p601_d8_D2_H3K27me3 -i p601_d8_D1_input p601_d8_D2_input -genome hg38 -use peaks.txt > p601_d8_H3K27me3_sicer_peaks.txt getDifferentialPeaksReplicates.pl -t p604_d8_D1_H3K27me3 p604_d8_D2_H3K27me3 -i p604_d8_D1_input p604_d8_D2_input -genome hg38 -use peaks.txt > p604_d8_H3K27me3_sicer_peaks.txt # – write a mail to Denise, sending the results of bigWig-files of H3K4me3 and H3K27me3, and called peaks. By the way, asks if she needs the Gene members of the red colors in the PCA plot! #Note that histone using ‘cat file1.bed file2.bed | sort -k1,1 -k2,2n | bedtools merge > merged.bed’ #Note that factor using HOMER getDifferentialPeaksReplicates from begining: “mergePeaks.pl –> DESeq recheck”! mergePeaks sicer/p601_d8_D1/V_8_1_6_p601_d8_D1_H3K27me3.dedup.sorted-W10000-G30000-FDR0.01-island.bed sicer/p601_d8_D2/V_8_1_5_p601_d8_D2_H3K27me3.dedup.sorted-W10000-G30000-FDR0.01-island.bed -d 10000 > mergedPeaks1.txt cat sicer/p601_d8_D1/V_8_1_6_p601_d8_D1_H3K27me3.dedup.sorted-W10000-G30000-FDR0.01-island.bed sicer/p601_d8_D2/V_8_1_5_p601_d8_D2_H3K27me3.dedup.sorted-W10000-G30000-FDR0.01-island.bed | sort -k1,1 -k2,2n | bedtools merge > merged_p601_d8.bed #4957 (4388, 4822) awk ‘BEGIN{OFS=”\t”; print “PeakID\tchr\tstart\tend\tstrand”} {print “Peak-“NR, $1, $2, $3, “.”}’ merged_p601_d8.bed > homer_peaks.bed #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!TODO!!!!!!!!!!!!!!!!!!!!!!!!: The bedtools merge command does not generate output in the HOMER peak format directly. However, you can use a combination of Unix command line tools (e.g., awk, sort) to convert bedtools output into HOMER format. #bedtools merge -i input.bed > merged.bed #awk ‘BEGIN{OFS=”\t”; print “PeakID\tchr\tstart\tend\tstrand”} {print “Peak-“NR, $1, $2, $3, “.”}’ merged.bed > homer_peaks.bed 5.2-3. Convert your SICER peaks to HOMER-compatible format, and replace them in homer/${sample}/peaks.txt. You can do this manually or with a script. – The following awk command can be used to convert more MACS2 information into the HOMER format: mv homer/p601_d8_D1_H3K27me3/regions.txt homer/p601_d8_D1_H3K27me3/regions_raw.txt mv homer/p601_d8_D2_H3K27me3/regions.txt homer/p601_d8_D2_H3K27me3/regions_raw.txt mv homer/p604_d8_D1_H3K27me3/regions.txt homer/p604_d8_D1_H3K27me3/regions_raw.txt mv homer/p604_d8_D2_H3K27me3/regions.txt homer/p604_d8_D2_H3K27me3/regions_raw.txt #NOTE that we should name the file as peaks.txt, since the Input-samples contain only peaks.txt. awk ‘BEGIN{OFS=”\t”} {print $1″-“$2”-“$3,$1,$2,$3,”+”,$4,0,0,$4,$5,0,$6,0}’ sicer/p601_d8_D1/V_8_1_6_p601_d8_D1_H3K27me3.dedup.sorted-W10000-G30000-islands-summary-FDR0.01 > homer/p601_d8_D1_H3K27me3/peaks.txt awk ‘BEGIN{OFS=”\t”} {print $1″-“$2”-“$3,$1,$2,$3,”+”,$4,0,0,$4,$5,0,$6,0}’ sicer/p601_d8_D2/V_8_1_5_p601_d8_D2_H3K27me3.dedup.sorted-W10000-G30000-islands-summary-FDR0.01 > homer/p601_d8_D2_H3K27me3/peaks.txt awk ‘BEGIN{OFS=”\t”} {print $1″-“$2”-“$3,$1,$2,$3,”+”,$4,0,0,$4,$5,0,$6,0}’ sicer/p604_d8_D1/V_8_1_6_p604_d8_D1_H3K27me3.dedup.sorted-W10000-G30000-islands-summary-FDR0.01 > homer/p604_d8_D1_H3K27me3/peaks.txt awk ‘BEGIN{OFS=”\t”} {print $1″-“$2”-“$3,$1,$2,$3,”+”,$4,0,0,$4,$5,0,$6,0}’ sicer/p604_d8_D2/V_8_1_5_p604_d8_D2_H3K27me3.dedup.sorted-W10000-G30000-islands-summary-FDR0.01 > homer/p604_d8_D2_H3K27me3/peaks.txt #The format is similar to the following. #chr1-83000-89999 chr1 83000 89999 + 21 0 0 21 27 0 5.623763510806874e-10 0 0 0 #chr1-859000-865999 chr1 859000 865999 + 23 0 0 23 28 0 3.501056563888787e-11 0 0 0 This command will: * PeakID: Format it as “chr-start-end”. * chr: Same as SICER. * start: Same as SICER. * end: Same as SICER. * strand: SICER doesn’t provide strand information, we could fill with a default value ‘+’. * Normalized Tag Count: Same as the normalized read count column in SICER. * focus ratio: SICER doesn’t provide this, we could fill with a default value ‘0’. * findPeaks Score: Corresponds to Fold Change or FDR column in SICER. * Total Tags: SICER doesn’t provide this, we could fill with a default value ‘0’. * Control Tags (normalized to IP Experiment): Same as Clonal reads or local background column in SICER. * Fold Change vs Control: SICER doesn’t provide this, we could fill with a default value ‘0’. * p-value vs Control: Same as the p-value column in SICER. * Fold Change vs Local: SICER doesn’t provide this, we could fill with a default value ‘0’. * p-value vs Local: SICER doesn’t provide this, we could fill with a default value ‘0’. * Clonal Fold Change: SICER doesn’t provide this, we could fill with a default value ‘0’. 5.4. The program getDifferentialPeaksReplicates will essentially perform 3 steps, in the step 2 was modified. conda activate myperl cd homer #diff /home/jhuang/anaconda3/envs/myperl/bin/getDifferentialPeaksReplicates.pl /home/jhuang/homer/bin/getDifferentialPeaksReplicates.pl #vim /home/jhuang/anaconda3/envs/myperl/bin/getDifferentialPeaksReplicates.pl #max_distance_to_merge at line 203 to 133000, “-d <#>” ./getDifferentialPeaksReplicates.pl -t p601_d8_D1_H3K27me3 p601_d8_D2_H3K27me3 -i p601_d8_D1_input p601_d8_D2_input -genome hg38 -use peaks.txt > p601_d8_H3K27me3_sicer_regions.txt ./getDifferentialPeaksReplicates.pl -t p604_d8_D1_H3K27me3 p604_d8_D2_H3K27me3 -i p604_d8_D1_input p604_d8_D2_input -genome hg38 -use peaks.txt > p604_d8_H3K27me3_sicer_regions.txt 5.5. Draw plots grep “p601_d8_D1_H3K27me3/peaks.txt|p601_d8_D2_H3K27me3/peaks.txt” p601_d8_H3K27me3_sicer_regions.txt | wc -l # 799 grep “p601_d8_D1_H3K27me3/peaks.txt” p601_d8_H3K27me3_sicer_regions.txt | wc -l # 867 – 799 = 68 grep “p601_d8_D2_H3K27me3/peaks.txt” p601_d8_H3K27me3_sicer_regions.txt | wc -l # 3019 – 799 = 2220 # 68 + 799 + 2220 = 3087 grep “p604_d8_D1_H3K27me3/peaks.txt|p604_d8_D2_H3K27me3/peaks.txt” p604_d8_H3K27me3_sicer_regions.txt | wc -l # 677 grep “p604_d8_D1_H3K27me3/peaks.txt” p604_d8_H3K27me3_sicer_regions.txt | wc -l # 740 – 677 = 63 grep “p604_d8_D2_H3K27me3/peaks.txt” p604_d8_H3K27me3_sicer_regions.txt | wc -l # 2393 – 677 = 1716 # 63 + 677 + 1716 = 2456 import matplotlib.pyplot as plt from matplotlib_venn import venn2 venn2(subsets=(2476, 3567, 22719), set_labels=(‘Donor 1’, ‘Donor 2’)) plt.title(‘Peaks between p601 d8 H3K27me3 Donors’) plt.xlabel(‘Number of Elements’) plt.ylabel(”) plt.savefig(‘Peak_Consistency_Between_p601_d8_H3K27me3_Donors.png’, dpi=300, bbox_inches=’tight’) #2476+3567+22719=28762 venn2(subsets=(2681, 3410, 19044), set_labels=(‘Donor 1’, ‘Donor 2’)) plt.title(‘Peaks between p604 d8 H3K27me3 Donors’) plt.xlabel(‘Number of Elements’) plt.ylabel(”) plt.savefig(‘Peak_Consistency_Between_p604_d8_H3K27me3_Donors.png’, dpi=300, bbox_inches=’tight’) #2681+3410+19044=25135 awk ‘NR>1 {print $2 “\t” $3 “\t” $4 “\t” $1}’ p601_d8_H3K27me3_sicer_regions.txt > p601_d8_H3K27me3_sicer_regions.bed awk ‘NR>1 {print $2 “\t” $3 “\t” $4 “\t” $1}’ p604_d8_H3K27me3_sicer_regions.txt > p604_d8_H3K27me3_sicer_regions.bed ~/Tools/csv2xls-0.4/csv_to_xls.py p601_d8_H3K27me3_sicer_regions.txt -d$’\t’ -o p601_d8_H3K27me3_sicer_regions.xls ~/Tools/csv2xls-0.4/csv_to_xls.py p604_d8_H3K27me3_sicer_regions.txt -d$’\t’ -o p604_d8_H3K27me3_sicer_regions.xls 6. getDifferentialPeaksReplicates.pl #!/usr/bin/env perl use warnings; use lib “/home/jhuang/homer/.//bin”; my $homeDir = “/home/jhuang/homer/./”; my $foldThresh = 2; my $fdrThresh = 0.05; my $peakFoldInput = 2; my $peakFdrInput = 0.001; my $style = “”; sub printCMD { print STDERR “\n\tUsage: getDifferentialPeaksReplicates.pl [options] -t [IP tagdir2] …\n”; print STDERR “\t -b [background tagdir2] …\n”; print STDERR “\t -i [Input tagdir2] …\n”; print STDERR “\t\tNote: if input is provided, peaks will be called.\n”; print STDERR “\n\tOptions:\n”; #print STDERR “\t\t-F <#> (fold enrichment over bg, default: $foldThresh)\n”; #print STDERR “\t\t-fdr <#> (FDR over input, default: $fdrThresh)\n”; print STDERR “\t\t-f <#> (fold enrichment over bg, default: $foldThresh)\n”; print STDERR “\t\t-q <#> (FDR over bg, default: $fdrThresh)\n”; print STDERR “\t\t-fdr <#>, -F <#>, -L <#> (parameters for findPeaks)\n”; print STDERR “\t\t-genome (genome version to use for annotation)\n”; print STDERR “\t\t-DESeq2 | -DESeq | -edgeR (differential stats algorithm, default: DESeq2)\n”; print STDERR “\t\t-balanced (normalize signal across peaks, default: normalize to read totals)\n”; print STDERR “\t\t-fragLength <#> (standardize estimated fragment length across analysis)\n”; print STDERR “\t\t-all (report all peaks, not just differentially regulated)\n”; print STDERR “\n\tPeak finding directives:\n”; print STDERR “\t\t-style (findPeaks style to use for finding peaks)\n”; print STDERR “\t\t-use (use existing peaks in tag directories)\n”; print STDERR “\t\t-p (use specific peak file instead of tagDir/peaks.txt or finding new one)\n”; print STDERR “\t\tOther options will be passed to findPeaks\n”; print STDERR “\n”; exit; } my @targets = (); my @background = (); my @inputs = (); my $findPeaksOpts = “”; my $use = “”; my $givenPeakFile = ”; my $norm2total = “-norm2total”; my $diffAlg = “-DESeq2”; my $genome = ‘none’; my $annOptions = “”; my $fragLength = ”; my $allFlag = 0; my $ogCmd = “getDifferentialPeaksReplicates.pl”; for (my $i=0;$i<@ARGV;$i++) { $ogCmd .= " " . $ARGV[$i]; } for (my $i=0;$i<@ARGV;$i++) { if ($ARGV[$i] eq '-t') { $i++; while ($i < @ARGV) { if ($ARGV[$i] =~ /^-/) { $i--; last; } push(@targets, $ARGV[$i++]); } } elsif ($ARGV[$i] eq '-i') { $i++; while ($i < @ARGV) { if ($ARGV[$i] =~ /^-/) { $i--; last; } push(@inputs, $ARGV[$i++]); } } elsif ($ARGV[$i] eq '-b') { $i++; while ($i < @ARGV) { if ($ARGV[$i] =~ /^-/) { $i--; last; } push(@background, $ARGV[$i++]); } } elsif ($ARGV[$i] eq '-f') { $foldThresh = $ARGV[++$i]; } elsif ($ARGV[$i] eq '-genome') { $genome = $ARGV[++$i]; } elsif ($ARGV[$i] eq '-q') { $fdrThresh = $ARGV[++$i]; } elsif ($ARGV[$i] eq '-use') { $use = $ARGV[++$i]; if ($use eq 'tss.txt') { $annOptions .= " -fragLength 1 -strand +"; $fragLength = " -fragLength 1" if ($fragLength eq ''); } } elsif ($ARGV[$i] eq '-p') { $givenPeakFile = $ARGV[++$i]; } elsif ($ARGV[$i] eq '-style') { $style = $ARGV[++$i]; if ($style eq 'tss') { $annOptions .= " -fragLength 1 -strand +"; $fragLength = " -fragLength 1" if ($fragLength eq ''); } } elsif ($ARGV[$i] eq '-edgeR') { $diffAlg = $ARGV[$i]; } elsif ($ARGV[$i] eq '-fragLength') { $fragLength = " -fragLength $ARGV[++$i]"; } elsif ($ARGV[$i] eq '-DESeq2') { $diffAlg = $ARGV[$i]; } elsif ($ARGV[$i] eq '-all') { $allFlag = 1; } elsif ($ARGV[$i] eq '-DESeq') { $diffAlg = $ARGV[$i]; } elsif ($ARGV[$i] eq '-balanced') { $norm2total = ""; } elsif ($ARGV[$i] eq '-h' || $ARGV[$i] eq '--help' || $ARGV[$i] eq '--') { printCMD(); } else { $findPeaksOpts .= " " . $ARGV[$i]; #print STDERR "!!! \"$ARGV[$i]\" not recognized\n"; #printCMD(); } } my $rand = rand(); my %toDelete = (); if ($diffAlg eq '-edgeR' && $norm2total ne '') { print STDERR "!!! Error, -edgeR requires \"-balanced\" to work correctly!!!\n"; exit; } $log2Thresh = log($foldThresh)/log(2); if (@targets < 1) { print STDERR "!!! Error, need at least one target directory!!!\n"; printCMD(); } if (scalar(@inputs) + scalar(@background) < 1) { print STDERR "\t!!! Error: program requires either input or background experiments to perform differential calculations!\n"; exit; } my $targetDirs = ''; my $targetStr = ""; foreach(@targets) { $targetDirs .= " \"$_\""; $targetStr .= " target"; } my $inputDirs = ''; my $inputStr = ""; foreach(@inputs) { $inputDirs .= " \"$_\""; $inputStr .= " input"; } my $bgDirs = ''; my $bgStr = ""; foreach(@background) { $bgDirs .= " \"$_\""; $bgStr .= " bg"; } if ($givenPeakFile eq '') { $peakFile = $rand . ".peaks"; $toDelete{$peakFile}=1; } else { $peakFile = $givenPeakFile; } if ($findPeaksOpts ne '') { print STDERR "\tUsing the following extra parameters for findPeaks:\n\t\t$findPeaksOpts\n"; } print STDERR "\tStep1: Defining Putative Peak Set\n"; if ($use eq '' && $givenPeakFile eq '') { print STDERR "\t\tFinding peaks in merged meta-experiment from target tag directories\n"; my $targetDir = $rand . ".targetTagDir"; `makeTagDirectory \"$targetDir\" -d $targetDirs $fragLength`; $toDelete{$targetDir}=1; my $inputDir = $rand . ".inputTagDir"; my $cmd = "findPeaks \"$targetDir\""; if ($style eq '') { $style = 'factor'; print STDERR "\tUsing -style $style...\n"; } $cmd .= " -style $style"; $cmd .= $fragLength . " " . $findPeaksOpts; if (@inputs > 0) { `makeTagDirectory \”$inputDir\” -d $inputDirs $fragLength`; $toDelete{$inputDir}=1; $cmd .= ” -i \”$inputDir\””; } #print STDERR “`$cmd > \”$peakFile\”`\n”; `$cmd > \”$peakFile\”`; } elsif ($givenPeakFile eq ” && $use ne ”) { my $files = “”; my @allDirs = (); push(@allDirs, @targets, @inputs, @background); print STDERR “\t\tUsing existing peak files for features:\n”; foreach(@allDirs) { my $p = $_ . “/” . $use; if (-e $p) { print STDERR “\t\t\t$p\n”; $files .= ” \”$p\””; } } #MODIFIED #`mergePeaks $files > \”$peakFile\”`; #print “$files\n”; #print “\”$peakFile\”\n”; `mergePeaks $files -d 1000 > mergePeaks_res.txt`; `python3 update_header.py`; `cat $files > merged_peaks.txt`; ## Split the list of files into an array #my @files = split / /, $files; # Open the output file #open(my $out, ‘>’, ‘merged_peaks.txt’) or die “Cannot open merged_peaks.txt: $!”; #foreach my $file (@files) { # print(“xxxxx$file\n”); # open(my $in, ‘<', $file) or die "Cannot open $file: $!"; # while (<$in>) { # s/\t+$//; # removes trailing tabs # print $out “$_\n”; # } # close $in; #} #close $out; system(“awk ‘{print \$2 \”\\t\” \$3 \”\\t\” \$4 \”\\t\” \$1}’ merged_peaks.txt | sort -k1,1 -k2,2n | bedtools merge -d 1000 > bedtools_res.txt”); `python3 adjust_mergePeaks_res.py`; `mv filtered_mergePeaks_res.txt \”$peakFile\”`; } $rawFile= $rand . “.raw.txt”; $diffFile= $rand . “.diff.txt”; $upFile = $rand . “.Up_target_vs_bg.txt”; $downFile = $rand . “.Down_target_vs_bg.txt”; if ($bgDirs eq ”) { $bgDirs = $inputDirs; $bgStr = $inputStr; @background = @inputs; $upFile = $rand . “.Up_target_vs_input.txt”; $downFile = $rand . “.Down_target_vs_input.txt”; } print STDERR “\n\tStep2: Quantifying reads across target/background/input tag directories\n\n”; #print STDERR “`annotatePeaks.pl \”$peakFile\” none -d $bgDirs $targetDirs -raw > \”$rawFile\”`;\n”; `annotatePeaks.pl \”$peakFile\” $genome -d $bgDirs $targetDirs -raw $annOptions $fragLength > \”$rawFile\”`; #print STDERR “`getDiffExpression.pl \”$rawFile\” $bgStr $targetStr $norm2total $diffAlg -fdr $fdrThresh -log2fold $log2Thresh -export $rand > $diffFile`;\n”; # print STDERR “\n\tStep3: Calling R for differential enrichment statistics ($diffAlg)\n\n”; `getDiffExpression.pl \”$rawFile\” $bgStr $targetStr $norm2total $diffAlg -fdr $fdrThresh -log2fold $log2Thresh -export $rand > $diffFile`; $toDelete{$rawFile}=1; $toDelete{$diffFile}=1; $toDelete{$upFile}=1; $toDelete{$downFile}=1; print “#cmd=$ogCmd|”; my $ofile = $upFile; $ofile = $diffFile if ($allFlag); open IN, $ofile; while ( ) { print $_; } close IN; foreach(keys %toDelete) { next if ($_ eq ‘/’); #print STDERR “\trm -r \”$_\”\n”; `rm -r “$_”`; } 7. update_header.py import os # File path file_path = ‘mergePeaks_res.txt’ # Read in the file with open(file_path, ‘r’) as file: lines = file.readlines() # Split the first line into components components = lines[0].split(“\t”) # Change the first component to “name” components[0] = “name” # Join the components back into a single string lines[0] = “\t”.join(components) # Write the file back out with open(file_path, ‘w’) as file: file.writelines(lines) 8. adjust_mergePeaks_res.py import pandas as pd # Read the mergePeaks result file and bedtools result file mergePeaks_df = pd.read_csv(‘mergePeaks_res.txt’, sep=’\t’, header=0) bedtools_df = pd.read_csv(‘bedtools_res.txt’, sep=’\t’, header=None, names=[‘chr’, ‘start’, ‘end’]) # Function to check if a peak is within bedtools ranges # def is_in_bedtools_range(row): filtered_mergePeaks_df = [] for _, row in mergePeaks_df.iterrows(): for _, bed_row in bedtools_df.iterrows(): if row[‘chr’] == bed_row[‘chr’] and row[‘start’] >= bed_row[‘start’] and row[‘end’] <= bed_row['end']: row['start'] = bed_row['start'] row['end'] = bed_row['end'] filtered_mergePeaks_df.append(row) # Convert the filtered results to a DataFrame filtered_mergePeaks_df = pd.DataFrame(filtered_mergePeaks_df) ## Write the filtered results to a file #filtered_mergePeaks_df.to_csv('filtered_mergePeaks_res.txt', sep='\t', index=False) ## Filter the mergePeaks rows #filtered_mergePeaks_df = mergePeaks_df[mergePeaks_df.apply(is_in_bedtools_range, axis=1)] ## Sort and drop duplicates sorted_df = filtered_mergePeaks_df.sort_values(by=['chr', 'start', 'end']) deduplicated_df = sorted_df.drop_duplicates(subset=['chr', 'start', 'end']) ## Save to file deduplicated_df.to_csv('filtered_mergePeaks_res.txt', sep='\t', index=False)

T细胞与B细胞在免疫反应中的相互作用

T cells and B cells interact in several ways as part of the immune response. Here’s a general overview of their interaction:

Antigen presentation: When a pathogen invades the body, it is engulfed by a type of cell known as an antigen-presenting cell (APC). The APC processes the pathogen and displays fragments of it, known as antigens, on its surface. B cells can also act as antigen-presenting cells.

T cell activation: A type of T cell known as a helper T cell (Th cell) can recognize these antigens. The Th cell binds to the antigen, causing the T cell to become activated. This process usually requires additional signals from the APC, provided through other surface molecules.

B cell activation: Once activated, the Th cell can interact with B cells that are displaying the same antigen. The T cell releases signaling molecules known as cytokines, which help to activate the B cell.

Antibody production: Once activated, the B cell begins to proliferate and differentiate into plasma cells. These plasma cells produce antibodies that are specific to the antigen. These antibodies can then neutralize the pathogen or mark it for destruction by other immune cells.

Memory cells: Some of the B cells and T cells will differentiate into memory cells. These cells “remember” the antigen and can mount a rapid response if the same pathogen invades the body again.

So, the interaction between T cells and B cells is crucial for the adaptive immune response. It allows the immune system to mount a targeted response to specific pathogens and to remember those pathogens in case of future invasions.

T细胞和B细胞在免疫反应中有多种相互作用方式。以下是它们相互作用的一般概述:

抗原呈递:当病原体侵入身体时,会被一种称为抗原呈递细胞(APC)的细胞吞噬。该APC处理病原体并将其碎片,也就是抗原,展示在其表面。B细胞也可以作为抗原呈递细胞。

T细胞激活:一种称为辅助T细胞(Th细胞)的T细胞可以识别这些抗原。Th细胞与抗原结合,导致T细胞被激活。这个过程通常需要APC提供的其他表面分子的额外信号。 CD4+ is a type of T cell often referred to as a helper T cell.

B细胞激活:一旦激活,Th细胞可以与显示相同抗原的B细胞进行交互。T细胞释放称为细胞因子的信号分子,这些分子有助于激活B细胞。

抗体产生:一旦激活,B细胞开始增殖并分化为浆细胞。这些浆细胞产生特异性的抗原抗体。这些抗体可以中和病原体或将其标记为由其他免疫细胞销毁。

记忆细胞:一部分B细胞和T细胞会分化为记忆细胞。这些细胞“记住”了抗原,并且如果同一病原体再次侵入身体,它们可以快速反应。

因此,T细胞和B细胞之间的相互作用对适应性免疫反应至关重要。它使免疫系统能够对特定的病原体产生针对性的反应,并记住这些病原体以防未来的侵入。

T细胞表面上的表位(epitope)在”T细胞激活”这一步骤中发挥作用。在这个阶段,辅助T细胞(Th细胞)可以识别抗原呈递细胞(APC)表面上的抗原。抗原是通过与T细胞受体(TCR)结合的MHC分子展示的,其中抗原中的特定部分——表位,是被TCR识别的部分。因此,在这个过程中,T细胞表面的表位是关键。这种识别过程触发了T细胞的激活,进而影响了免疫反应的其他步骤,如B细胞的激活和抗体的生成。

抗原是存在于抗原呈递细胞(APC)上的。当病原体,比如细菌或病毒,进入身体后,抗原呈递细胞(如巨噬细胞、树突状细胞等)会捕获并处理病原体,把处理后的病原体的一部分(抗原)放在它们的表面上。接着,T细胞通过自身表面的T细胞受体(TCR)识别并与这些抗原结合,这样就触发了免疫反应。

在这个上下文中,”表位(Epitope)”通常指的是抗原(即病原体蛋白质的一个部分)的一个特定区域,这个区域可以被免疫系统(特别是抗体或T细胞受体)识别和结合。当我们说”表位出现在T细胞上”时,实际上是指T细胞受体(TCR)能够识别并结合到抗原的这个特定区域,而不是把表位物质本身放在T细胞上。 在T细胞激活的过程中,T细胞受体(TCR)会识别和绑定到抗原呈递细胞(APC)表面的抗原表位,然后这个信息(即信号)会传递给T细胞,触发免疫反应。因此,我们可以说,表位是抗原在与T细胞相互作用时所起的关键作用。

Generation of Heatmap from DEGs Data and Annotation of Identified Gene Clusters

This script is structured to process gene expression data, specifically DEGs (Differentially Expressed Genes) and create a heatmap visualizing the patterns of the data. The steps involved are as follows:

  1. Package Installation and Library Loading: The script first ensures that essential packages are installed and then loads them. Some of the key packages include “gplots” for generating heatmaps, “readxl” and “writexl” for reading and writing Excel data, and “biomaRt” for fetching gene annotation data from Ensembl.

  2. Data Input: It reads in the gene expression data from an Excel file named “DEGs_heatmap_data.xls”.

  3. Hierarchical Clustering: The script performs hierarchical clustering on the data using both Pearson and Spearman correlations to determine the relationships between genes.

  4. Heatmap Generation: A heatmap is generated to visualize the clustered data, and this visualization is saved as an image file named “DEGs_heatmap.png”.

  5. Annotation and Data Segregation: The genes are further grouped into clusters, and for each cluster, annotation details such as gene ID, gene name, chromosome name, start and end positions, and more are fetched from Ensembl. This annotated data for each cluster is stored with the expression data in distinct data frames.

  6. Output: All the processed clusters are then compiled and written to an Excel file named “gene_clusters.xlsx”, with each cluster having its designated sheet.

This script aids in the identification and exploration of gene expression patterns and further provides essential annotations for identified gene clusters.

#ensure you have the following packages installed. If not, you'll have to install them
install.packages("gplots")
install.packages("readxl")
install.packages("writexl")
install.packages("dplyr")
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("biomaRt")

library(gplots)
library(readxl)
library(writexl)
library(dplyr)
library(biomaRt)
listEnsembl()
listMarts()
ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")
datasets <- listDatasets(ensembl)

# Read the Excel file
datamat = read_excel(path = "DEGs_heatmap_data.xls", sheet = 1, col_names = TRUE)
datamat <- as.data.frame(datamat)
rownames(datamat) <- datamat[, 1]
datamat <- datamat[, -1] # Remove the first column which is now the row names

hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.2)
mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
mycol = mycol[as.vector(mycl)]
png("DEGs_heatmap.png", width=900, height=1010)
heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
            scale='row',trace='none',col=bluered(75),
            RowSideColors = mycol, labRow="", srtCol=30, keysize=0.72, cexRow = 2, cexCol = 1.4)
dev.off()

#### cluster members #####
subset_1<-names(subset(mycl, mycl == '1'))
subset_1_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
      filters = 'ensembl_gene_id',
      values = subset_1,
      mart = ensembl)
subset_1_uniq <- distinct(subset_1_, ensembl_gene_id, .keep_all= TRUE)
subset_1_expr  <- datamat[subset_1,]
subset_1_expr$ENSEMBL = rownames(subset_1_expr)
cluster1_YELLOW <- merge(subset_1_uniq, subset_1_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
#write.csv(cluster1_YELLOW,file='cluster1_YELLOW.txt')

subset_2<-names(subset(mycl, mycl == '2'))
subset_2_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
      filters = 'ensembl_gene_id',
      values = subset_2,
      mart = ensembl)
subset_2_uniq <- distinct(subset_2_, ensembl_gene_id, .keep_all= TRUE)
subset_2_expr  <- datamat[subset_2,]
subset_2_expr$ENSEMBL = rownames(subset_2_expr)
cluster2_DARKBLUE <- merge(subset_2_uniq, subset_2_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
#write.csv(cluster2_DARKBLUE,file='cluster2_DARKBLUE.txt')

subset_3<-names(subset(mycl, mycl == '3'))
subset_3_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
      filters = 'ensembl_gene_id',
      values = subset_3,
      mart = ensembl)
subset_3_uniq <- distinct(subset_3_, ensembl_gene_id, .keep_all= TRUE)
subset_3_expr  <- datamat[subset_3,]
subset_3_expr$ENSEMBL = rownames(subset_3_expr)
cluster3_DARKORANGE <- merge(subset_3_uniq, subset_3_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
#write.csv(cluster3_DARKORANGE,file='cluster3_DARKORANGE.txt')

subset_4<-names(subset(mycl, mycl == '4'))
subset_4_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
      filters = 'ensembl_gene_id',
      values = subset_4,
      mart = ensembl)
subset_4_uniq <- distinct(subset_4_, ensembl_gene_id, .keep_all= TRUE)
subset_4_expr  <- datamat[subset_4,]
subset_4_expr$ENSEMBL = rownames(subset_4_expr)
cluster4_DARKMAGENTA <- merge(subset_4_uniq, subset_4_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
#write.csv(cluster4_DARKMAGENTA,file='cluster4_DARKMAGENTA.txt')

subset_5<-names(subset(mycl, mycl == '5'))
subset_5_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
      filters = 'ensembl_gene_id',
      values = subset_5,
      mart = ensembl)
subset_5_uniq <- distinct(subset_5_, ensembl_gene_id, .keep_all= TRUE)
subset_5_expr  <- datamat[subset_5,]
subset_5_expr$ENSEMBL = rownames(subset_5_expr)
cluster5_DARKCYAN <- merge(subset_5_uniq, subset_5_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
#write.csv(cluster5_DARKCYAN,file='cluster5_DARKCYAN.txt')

subset_6<-names(subset(mycl, mycl == '6'))
subset_6_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
      filters = 'ensembl_gene_id',
      values = subset_6,
      mart = ensembl)
subset_6_uniq <- distinct(subset_6_, ensembl_gene_id, .keep_all= TRUE)
subset_6_expr  <- datamat[subset_6,]
subset_6_expr$ENSEMBL = rownames(subset_6_expr)
cluster6_DARKRED <- merge(subset_6_uniq, subset_6_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
#write.csv(cluster6_DARKRED,file='cluster6_DARKRED.txt')

write_xlsx(list(
  "Cluster 1 YELLOW" = cluster1_YELLOW,
  "Cluster 2 DARKBLUE" = cluster2_DARKBLUE,
  "Cluster 3 DARKORANGE" = cluster3_DARKORANGE,
  "Cluster 4 DARKMAGENTA" = cluster4_DARKMAGENTA,
  "Cluster 5 DARKCYAN" = cluster5_DARKCYAN,
  "Cluster 6 DARKRED" = cluster6_DARKRED
), "gene_clusters.xlsx")

Gene Set Variation Analysis (GSVA) and Visualization of Gene Sets from Excel Signatures

#install.packages("readxl")
library(readxl)

# Path to the Excel file
file_path <- "Signatures.xls"

#example of a signature:
#geneSymbol geneEntrezID    ENSEMBL GeneSet
#CD160  11126   ENSG00000117281 Anergic or act. T cells
#CD244  51744   ENSG00000122223 Anergic or act. T cells
#CTLA4  1493    ENSG00000163599 Anergic or act. T cells
#HAVCR2 84868   ENSG00000135077 Anergic or act. T cells
#ICOS   29851   ENSG00000163600 Anergic or act. T cells
#KLRG1  10219   ENSG00000139187 Anergic or act. T cells
#LAG3   3902    ENSG00000089692 Anergic or act. T cells
#PDCD1  5133    ENSG00000188389 Anergic or act. T cells
#PDCD1  5133    ENSG00000276977 Anergic or act. T cells

# Get the names of the sheets
sheet_names <- excel_sheets(file_path)

# Initialize an empty list to hold gene sets
geneSets <- list()

# Loop over each sheet, extract the ENSEMBL IDs, and add to the list
for (sheet in sheet_names) {
  # Read the sheet
  data <- read_excel(file_path, sheet = sheet)

  # Process the GeneSet names (replacing spaces with underscores, for example)
  gene_set_name <- gsub(" ", "_", unique(data$GeneSet)[1])

  # Add ENSEMBL IDs to the list
  geneSets[[gene_set_name]] <- as.character(data$ENSEMBL)
}

# Print the result to check
print(geneSets)

# 1. Compute GSVA scores:
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
gsva_df$Condition <- dds$condition

# 4. Filter the gsva_df to retain only the desired conditions:
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("Ace2_mock_2h", "Ace2_inf_24h"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("Ace2_mock_2h", "mock", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("Ace2_inf_24h", "infection", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("mock", "infection"))
plot_violin <- function(data, gene_name) {
  # Calculate the t-test p-value for the two conditions
  condition1_data <- data[data$Condition == "mock", gene_name]
  condition2_data <- data[data$Condition == "infection", gene_name]
  p_value <- t.test(condition1_data, condition2_data)$p.value

  # Convert p-value to annotation
  p_annotation <- ifelse(p_value < 0.01, "**", ifelse(p_value < 0.05, "*", ""))
  rounded_p_value <- paste0("p = ", round(p_value, 2))

  plot_title <- gsub("_", " ", gene_name)
  p <- ggplot(data, aes(x=Condition, y=!!sym(gene_name))) +
    geom_violin(linewidth=1.2) + 
    labs(title=plot_title, y="GSVA Score") +
    ylim(-1, 1) +
    theme_light() +
    theme(
      axis.title.x = element_text(size=12),
      axis.title.y = element_text(size=12),
      axis.text.x  = element_text(size=10),
      axis.text.y  = element_text(size=10),
      plot.title   = element_text(size=12, hjust=0.5)
    )

  # Add p-value annotation to the plot
  p <- p + annotate("text", x=1.5, y=0.9, label=paste0(p_annotation, " ", rounded_p_value), size=5, hjust=0.5)

  return(p)
}

# 6. Generate the list of plots:
#genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
#plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 6. Generate the list of plots in a predefined order:
desired_order <- c("Platelets","Granulocytes","LDG","pDC","Anti-inflammation",  "Pro-inflam._IL-1","Dendritic_cells","MHC_II","Alt._complement","TNF",  "NLRP3_inflammasome","Unfolded_protein","B_cells","Monocyte_cell_surface","Inflammasome",  "Monocyte_secreted","IL-1_cytokines","SNOR_low_UP","CD40_activated","Lectin_complement",  "Classical_complement","Cell_cycle","Plasma_cells","IG_chains","Erythrocytes",  "IL-6R_complex","IFN","TCRB","TCRA","Cyt._act._T_cells",  "TCRG","T_cells","CD8T-NK-NKT","Anergic_or_act._T_cells","T_activated",  "NK_cells","TCRD","T_regs","SNOR_low_DOWN","Monocytes",  "Myeloid_cells","Neutrophils")
genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
genes <- genes[match(desired_order, genes)]
plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 7. Pad the list of plots:
remaining_plots <- 6 - (length(plots_list) %% 6)
if (remaining_plots != 6) {
  plots_list <- c(plots_list, rep(list(NULL), remaining_plots))
}

# 8. Create the plots and arrange them in a grid:
library(gridExtra)
plots_matrix <- matrix(plots_list, ncol=6, byrow=T)
#do.call("grid.arrange", c(plots_matrix, list(ncol=6)))

# 9. Save the plots to a PNG:
png("All_Violin_Plots.png", width=1000, height=1000)
do.call("grid.arrange", c(plots_matrix, list(ncol=6)))
dev.off()

Understanding xrandr Output and Connection Types

xrandr
Screen 0: minimum 8 x 8, current 3840 x 1080, maximum 16384 x 16384
VGA-0 connected primary 1920x1080+0+0 (normal left inverted right x axis y axis) 521mm x 293mm
  1920x1080     60.00*+
  1680x1050     59.95  
  1600x900      60.00  
  1440x900      59.89  
  1280x1024     75.02    60.02  
  1280x800      59.81  
  1280x720      60.00  
  1152x864      75.00  
  1024x768      75.03    70.07    60.00  
  800x600       75.00    72.19    60.32    56.25  
  640x480       75.00    72.81    59.94  
DVI-D-0 disconnected (normal left inverted right x axis y axis)
HDMI-0 connected 1920x1080+1920+0 (normal left inverted right x axis y axis) 521mm x 293mm
  1920x1080     60.00*+  50.00  
  1680x1050     59.95  
  1600x900      60.00  
  1440x900      59.89  
  1280x1024     75.02    60.02  
  1280x800      59.81  
  1280x720      60.00    50.00  
  1152x864      75.00  
  1024x768      75.03    70.07    60.00  
  800x600       75.00    72.19    60.32    56.25  
  720x576       50.00  
  720x480       59.94  

From the provided output:

  1. Screen 0:

    • The minimum screen resolution your system supports is 8×8.
    • Your current screen resolution is 3840×1080 (which suggests a dual monitor setup, each with a resolution of 1920×1080).
    • The maximum resolution supported is 16384×16384.
  2. VGA-0:

    • This is the first display connected via a VGA port.
    • It is currently set as the primary display.
    • Its current resolution is 1920×1080 (the * denotes the current resolution and the + denotes the preferred resolution).
    • It supports various other resolutions as listed.
  3. DVI-D-0:

    • This display is currently disconnected.
    • No other information about this display is provided since it’s not connected.
  4. HDMI-0:

    • This is the second display connected via an HDMI port.
    • Its current resolution is 1920×1080.
    • It supports various other resolutions, similar to VGA-0.

Our two screens (VGA-0 and HDMI-0) are side by side, creating a total screen resolution of 3840×1080.

From the xrandr output, it’s not directly evident that we are using an HDMI-to-DVI cable. However, we can infer this from the connection types and their states:

  1. HDMI-0 connected: This indicates that a device is connected to the HDMI port of your computer.
  2. DVI-D-0 disconnected: This indicates that there is no direct connection to the DVI port on your computer.

If we know that our second monitor only has a DVI input and we’re using an HDMI-to-DVI cable to connect it, then the “HDMI-0 connected” state in the xrandr output is a result of the HDMI end of our converter cable being connected to our computer’s HDMI port.

However, it’s essential to clarify that xrandr only shows the connection state from the perspective of the computer’s output ports. It won’t specify cable types or conversion methods. We usually determine the use of converter cables by knowing our hardware and connection setup.