Author Archives: gene_x

transponson off-target analysis

TODO: edit and sort the following text after vacation

# Input 9,181,515x2 reads
# -- starting point: initial_mutants_rep1.trimmed1_failed_trim

# -- generate bam-file containing  not-mapped reads --
#I have a sam-file containing 8586773 alignments from read1. I have 9181515 read2. Can I produce the sorted bam files of 8586773 paired-ends on WA-314_m.fna.

#1115132/2=557566 + 8586773 = 9144339 != 9183535
I have the paired read1 and read2, update the bwa commands
samtools view initial_mutants_rep1.trimmed1_failed_trim | cut -f1 > read1_ids.txt
seqtk subseq initial_mutants_rep1.reads1 read1_ids.txt > filtered_initial_mutants_rep1.reads1.fastq
seqtk subseq initial_mutants_rep1.reads2 read1_ids.txt > filtered_initial_mutants_rep1.reads2.fastq

bwa index ../WA-314_locusid.fasta
bwa mem -M ../WA-314_locusid.fasta filtered_initial_mutants_rep1.reads1.fastq filtered_initial_mutants_rep1.reads2.fastq > initial_mutants_rep1_no_insertion_sites.sam
# Convert SAM to BAM
samtools view -Sb initial_mutants_rep1_no_insertion_sites.sam > initial_mutants_rep1_no_insertion_sites.bam
# Sort the BAM file
samtools sort initial_mutants_rep1_no_insertion_sites.bam -o initial_mutants_rep1_sorted_no_insertion_sites.bam
# Index the sorted BAM file
samtools index initial_mutants_rep1_sorted_no_insertion_sites.bam

#Visualization using tools like IGV to inspect the alignment and distribution of reads.

#Analysis and Visualization: perform various analyses, such as coverage checks, quality assessments.

samtools view -b -f 2 initial_mutants_rep1_sorted_no_insertion_sites.bam > initial_mutants_rep1_properly_paired.bam
bedtools genomecov -ibam initial_mutants_rep1_properly_paired.bam -bg > initial_mutants_rep1_coverage.bedgraph
samtools depth initial_mutants_rep1_properly_paired.bam > initial_mutants_rep1_depth.txt
python3 ~/Scripts/draw_coverages.py

#----> TODO_1 (TOMORROW): draw coverage plots of the reads! senden the results to them!
#      TODO_2 (TOMORROW): generate the saturation plots tomorrow!)

samtools faidx WA-314_m.fasta "gi|420257081|ref|NZ_AKKR01000009.1|contig_1_1":31429-31582

Are the following 2 sequences related? for example, subsequences, or reverse complementary?
>gi|420257081|ref|NZ_AKKR01000009.1|contig_1_1:31429-31582
CGTGATCAGGTCTTTTTAGCGACTAAGTTCGGTATTGTGCGTGACCCCACTGATCCCAGG
GTACGCGGTGTGAGCAGCCGTCCTGATTATATCCGCCAGTCGGTTGATGGCAGCCTTAAG
CGCTTGGGTGTTGATGTCATTGATCTTTATTACC
>transposon
CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTAC
ATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAA
TACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATT
CCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGT
GCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAA
AGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTA
TGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACT
GCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATAT
TGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTT
TTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTT
GATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAAT
GCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATA
ACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCA
GACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACA
GAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATT
TGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATT
ACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGAT
CAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCAC
CAACTGGTCCACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCT
GCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG

To determine if the paired reads provide evidence of an insertion site, we need to analyze how the reads align to both the reference genome and the transposon sequence. Typically, for transposon sequencing, we expect to see reads that span the junction between the transposon and the insertion site in the genome.

Given the provided read information and the transposon sequence, here are some key points to consider:

    Read Alignment to the Reference Genome:
        The read is aligned to positions on the reference genome NZ_AKKR01000009 from 31,438 to 31,543 on the reverse strand.
        The CIGAR string 106M52S indicates that 106 bases of the read are mapped to the reference, while the remaining 52 bases are soft clipped (not mapped to the reference).

    Mate Read Information:
        The mate read starts at position 31,408 on the forward strand of the same reference sequence.
        The insert size is -135, suggesting the reads are overlapping or very close to each other.

    Supplementary Alignments:
        There are supplementary alignments on the reference genome:
            From 31,212 to 31,269 on the forward strand (58bp)
            From 31,438 to 31,543 on the reverse strand (107bp)

    Soft Clipped Sequence:
        The 52 soft clipped bases from the read may contain the transposon sequence or part of it, indicating a junction between the transposon and the genome.

To check for evidence of an insertion site, we should:

    Extract the soft clipped sequence.
    Compare it with the transposon sequence to see if it matches or partially matches the transposon sequence.

Extracting the Soft Clipped Sequence

Given the read length is 158bp and 106 bases are mapped, the remaining 52 bases are soft clipped.

The soft clipped sequence should be extracted from the end of the read.
Check if the Soft Clipped Sequence Matches the Transposon Sequence

You can use sequence alignment tools or a simple script to compare the soft clipped sequence with the transposon sequence.

Let's extract and compare the soft clipped sequence manually.
Manual Extraction and Comparison

Assume the read sequence is as follows:

less

Read sequence (158bp): [106 bases aligned][52 bases soft clipped]

To extract the 52 soft clipped bases:

    If the read is provided or can be retrieved from the FASTQ file, you can directly extract the sequence.
    Compare the extracted sequence with the start of the transposon sequence.

Example Extraction and Comparison Script

Here is a Python script to extract the soft clipped sequence and compare it with the transposon sequence:

python

# Example read sequence (for illustration, actual sequence needed from FASTQ)
read_sequence = "ACGT... (158bp read sequence here)"

# Extract the 52 soft clipped bases from the end
soft_clipped_sequence = read_sequence[-52:]

# Provided transposon sequence
transposon_sequence = ("CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTAC"
                       "ATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAA"
                       "TACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATT"
                       "CCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGT"
                       "GCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAA"
                       "AGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTA"
                       "TGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACT"
                       "GCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATAT"
                       "TGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTT"
                       "TTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTT"
                       "GATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAAT"
                       "GCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATA"
                       "ACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCA"
                       "GACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACA"
                       "GAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATT"
                       "TGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATT"
                       "ACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGAT"
                       "CAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCAC"
                       "CAACTGGTCCACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCT"
                       "GCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG")

# Check if the soft clipped sequence matches the start of the transposon sequence
if soft_clipped_sequence in transposon_sequence:
    print("The soft clipped sequence matches the transposon sequence, indicating a potential insertion site.")
else:
    print("No match found. The soft clipped sequence does not match the transposon sequence.")

# Print the sequences for reference
print("Soft clipped sequence:", soft_clipped_sequence)
print("Transposon sequence start:", transposon_sequence[:52])

Conclusion

To determine if the paired reads provide evidence of an insertion site:

    Extract the soft clipped sequence from the read.
    Compare the soft clipped sequence with the transposon sequence to see if there is a match.
    If there is a match, it indicates a junction between the transposon and the genome, suggesting an insertion site.

The analysis will confirm if the reads are evidence of an insertion site by showing the overlap or junction between the transposon and the genomic sequence.

#--------------
#VH00358:89:AAFC5MTM5:1:2202:39325:27126_:N:0:TTTCTCTA+CTCGACG

I have a lot of reads over 20000 reads in certain positions with soft clips (see the attached image). could you write a script check how many reads from initial_mutants_rep1_no_insertion_sites.sam can confirm the NZ_AKKR01000009:31537 is a insertion site?

python check_insertion_site.py

>transposon
CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTACATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAATACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATTCCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGTGCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAAAGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTATGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACTGCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATATTGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTTTTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTTGATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAATGCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATAACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCAGACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACAGAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATTTGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATTACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGATCAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCACCAACTGGTC
|CACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG

add the step to compare the clipped sequences to the transposon sequences. to locate which subsequences of transposon are inserted?

#In "saturation_graph.png," the x-axis represents the input paired reads number, and the y-axis shows the counts for both the template count and TAs hit. It suggests that deeper sequencing will yield more template counts and TAs hits, particularly for the template counts.

samtools flagstat initial_mutants_rep1_sorted_no_insertion_sites.bam
17836098 + 0 in total (QC-passed reads + QC-failed reads)
17173546 + 0 primary
662552 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
17000489 + 0 mapped (95.32% : N/A)
16337937 + 0 primary mapped (95.13% : N/A)
17173546 + 0 paired in sequencing
8586773 + 0 read1
8586773 + 0 read2
15942110 + 0 properly paired (92.83% : N/A)
16234318 + 0 with itself and mate mapped
103619 + 0 singletons (0.60% : N/A)
167124 + 0 with mate mapped to a different chr
163885 + 0 with mate mapped to a different chr (mapQ>=5)

ME Erkennungs Sequenz 5 ́ AGATGTGTATAAGAGACAG 3 ́
ME Sequenz Komplementär: 3 ́ TCTACACATATTCTCTGTC 5 ́
ME Sequenz Rev Komplementär: 5 ́ CTGTCTCTTATACACATCT 3 ́
Kan-2 FP-1 Forward Primer 5 ́ ACCTACAACAAAGCTCTCATCAACC 3 ́
Kan-2 RP-1 Reverse Primer 5 ́ GCAATGTAACATCAGAGATTTTGAG 3 ́
Ampli-Start Senior et al 5’ ACCTACAACAAAGCTCTCATCAACC
 3 ́
Sequenzierungs-Strategie (NEBNext Ultra II DNA Modifikation)
NEB_i5_Tn_01
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTACCTACAACAAAGCTCTCATCAACC-3'
NEB_i5_Tn_02
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCACCTACAACAAAGCTCTCATCAAC-3'
NEB_i5_Tn_03
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCCTACAACAAAGCTCTCATCAACCG-3'
>> mix 3 forward primer equimolar to create diversity in start of
read 1 (frameshift at the transposon amplification sequence)
NEB_i7_rc
5'-GACTGGAGTTCAGACGTGTGCTCTTCCGATC-3'

>transposon
CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTACATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAATACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATTCCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGTGCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAAAGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTATGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACTGCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATATTGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTTTTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTTGATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAATGCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATAACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCAGACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACAGAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATTTGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATTACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGATCAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCACCAACTGGTC
|CACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG

AGAGCTTTGTTGTAGGTG -->
GTGGATGTTGTTTCGAGA
TCTCGAAACAACATCCAC
CACCTACAACAAAGCTCT *

AGAGCTTTGTTGTAGG

Draw plots for miRNAs generated by COMPSRA

pca.png

pca2.png

miRNA_heatmap_keeping_replicates.png

miRNA_heatmap_merging_replicates.png

  1. Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.

    COMPSRA_MERGE_0_miRNA.txt *
    COMPSRA_MERGE_0_piRNA.txt *
    COMPSRA_MERGE_0_snRNA.txt *
    COMPSRA_MERGE_0_tRNA.txt
    COMPSRA_MERGE_0_snoRNA.txt
    COMPSRA_MERGE_0_circRNA.txt
  2. Input files for miRNA are two files: COMPSRA_MERGE_0_miRNA.txt and ids under “/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_7/miRNA_re”

    • COMPSRA_MERGE_0_miRNA.txt

      #./our_out_without_cutadapt/COMPSRA_MERGE_0_miRNA.txt #./our_out_on_hg38/COMPSRA_MERGE_0_miRNA.txt #./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt diff ./our_out_on_hg38/COMPSRA_MERGE_0_miRNA.txt ./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt #–> different! #BUG: why using the same code resulting in different miRNA results! A little different!! #–> using ./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt

      cp our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt miRNA_re cd miRNA_re

    • prepare the file ids

      #cp ./our_out_on_hg38+JN707599_miRNA_merging_replicates/ids miRNA_re for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo “cut -d’,’ -f1-1 ${i}-up.txt > ${i}-up.id”; echo “cut -d’,’ -f1-1 ${i}-down.txt > ${i}-down.id”; done cat *.id | sort -u > ids #add Gene_Id in the first line, delete the “”

  3. Draw plots with R using DESeq2

    #BiocManager::install("AnnotationDbi")
    #BiocManager::install("clusterProfiler")
    #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
    #BiocManager::install("limma")
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library("org.Hs.eg.db")
    library(DESeq2)
    library(gplots)
    library(limma)
    # Check the current library paths
    .libPaths()
    #setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/our_out_on_hg38+JN707599_2024_corrected/")
    
    d.raw<- read.delim2("COMPSRA_MERGE_0_miRNA.txt",sep="\t", header=TRUE, row.names=1)
    d.raw$X <- NULL
    #colnames(d.raw)<- c("WaGa_EV", "MKL1_EV", "WaGa_wt", "MKL1_wt")
    d.raw[] <- lapply(d.raw, as.numeric)
    
    #MCPyV-M1   0   0   29  0   23  0   30  8   0   0   202 196
    # New row to add
    new_row <- data.frame(
      X0505_WaGa_sT_DMSO = 0,
      X1905_WaGa_sT_DMSO = 0,
      X0505_WaGa_sT_Dox = 29,
      X1905_WaGa_sT_Dox = 0,
      X0505_WaGa_scr_DMSO = 23,
      X1905_WaGa_scr_DMSO = 0,
      X0505_WaGa_scr_Dox = 30,
      X1905_WaGa_scr_Dox = 8,
      X0505_WaGa_wt = 0,
      X1905_WaGa_wt = 0,
      control_MKL1 = 202,
      control_WaGa = 196,
      row.names = "MCPyV-M1"
    )
    
    # Add the new row to the data frame
    d.raw <- rbind(d.raw, new_row)
    
    #cell_line = as.factor(c("WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "MKL1","WaGa"))
    #condition = as.factor(c("sT","sT", "sT","sT", "scr","scr", "scr","scr", "wt","wt", "control","control"))
    #treatment = as.factor(c("DMSO","DMSO", "Dox","Dox", "DMSO","DMSO", "Dox","Dox", "wt","wt", "control","control"))
    
    EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
    donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
    replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
    ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))
    
    cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
    
    rld <- rlogTransformation(dds)
    
    # -- before pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("replicates"))
    #plotPCA(rld, intgroup = c("replicates", "batch"))
    #plotPCA(rld, intgroup = c("replicates", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    
    png("pca2.png", 1200, 800)
    plotPCA(rld, intgroup=c("donor"))
    dev.off()
    
    ##### STEP3: prepare all_genes #####
    rld <- rlogTransformation(dds)
    mat <- assay(rld)
    mm <- model.matrix(~replicates, colData(rld))
    mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
    assay(rld) <- mat
    RNASeq.NoCellLine <- assay(rld)
    # reorder the columns
    colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
    col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
    RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]
    
    #Option4: manully defining
    #for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
    #cat *.id | sort -u > ids
    ##add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id
    datamat = RNASeq.NoCellLine[GOI, ]
    
    ##### STEP4: clustering the genes and draw heatmap #####
    datamat <- datamat[,-1]  #delete the sample "control MKL1"
    colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
    colnames(datamat)[2] <- "WaGa wildtype 0505"
    colnames(datamat)[3] <- "WaGa wildtype 1905"
    colnames(datamat)[4] <- "WaGa sT DMSO 0505"
    colnames(datamat)[5] <- "WaGa sT DMSO 1905"
    colnames(datamat)[6] <- "WaGa sT Dox 0505"
    colnames(datamat)[7] <- "WaGa sT Dox 1905"
    colnames(datamat)[8] <- "WaGa scr DMSO 0505"
    colnames(datamat)[9] <- "WaGa scr DMSO 1905"
    colnames(datamat)[10] <- "WaGa scr Dox 0505"
    colnames(datamat)[11] <- "WaGa scr Dox 1905"
    write.csv(datamat, file ="gene_expression_keeping_replicates.txt")
    
    #"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/2.0)
    mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
    
    mycol = mycol[as.vector(mycl)]
    png("miRNA_heatmap_keeping_replicates.png", width=800, height=1000)
    #svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
    heatmap.2(as.matrix(datamat),
      Rowv=as.dendrogram(hr),
      Colv=NA,
      dendrogram='row',
      labRow="",
      scale='row',
      trace='none',
      col=bluered(75),
      RowSideColors=mycol,
      srtCol=20,
      lhei=c(1,8),
      #cexRow=1.2,   # Increase row label font size
      cexCol=1.7,    # Increase column label font size
      margin=c(7,1)
     )
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
    write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
    write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
    write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
    write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
    write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
    write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt GREEN.txt CYAN.txt MAGENTA.txt LIGHTCYAN.txt GREY.txt RED.txt -d',' -o miRNA_heatmap_keeping_replicates.xls
    
    # merging replicates
    datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
    datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
    datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
    datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
    datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
    datamat <- datamat[,c(-2:-11)]
    write.csv(datamat, file ="gene_expression_merging_replicates.txt")
    
    # Ensure 'mycl' is calculated properly.
    mycl <- cutree(hr, h=max(hr$height)/2.2)
    # TODO: Rearrange the colors of the plot *_merging_replicates.png to match those in *_keeping_replicates.png!
    # mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
    
    # Now map your clusters to colors, making sure that there's one color for each row:
    actualColors <- mycol[mycl]  # Assign colors based on cluster assignment
    
    # Then use these 'actualColors' in your heatmap:
    png("miRNA_heatmap_merging_replicates.png", width=800, height=1000)
    heatmap.2(as.matrix(datamat),
              Rowv=as.dendrogram(hr),
              Colv=NA,
              dendrogram='row',
              labRow="",
              scale='row',
              trace='none',
              col=bluered(75),
              RowSideColors=actualColors, # Update this part
              srtCol=20,
              lhei=c(1,8),
              cexCol=1.7,    # Increase column label font size
              margin=c(7,1)
            )
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
    write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
    write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
    write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
    write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
    write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
    #write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o miRNA_heatmap_merging_replicates.xls
    #100+11+4+7+2+5+14+63=206
  4. Display viral transcripts found in mRNA-seq (or small RNA) MKL-1, WaGa EVs compared to parental cells

    http://xgenes.com/article/article-content/87/display-viral-transcripts-found-in-mrna-seq-mkl-1-waga-evs-compared-to-cells/
  5. TODO: why miRNA_heatmap_keeping_replicates.png and miRNA_heatmap_merging_replicates.png are different while gene_expression_keeping_replicates.txt and gene_expression_merging_replicates.txt are the same?

    diff our_out_on_hg38+JN707599_miRNA_keeping_replicates/gene_expression_keeping_replicates.txt miRNA_re/gene_expression_keeping_replicates.txt
    diff our_out_on_hg38+JN707599_miRNA_merging_replicates/gene_expression_merging_replicates.txt miRNA_re/gene_expression_merging_replicates.txt
    
    display our_out_on_hg38+JN707599_miRNA_keeping_replicates/miRNA_heatmap_keeping_replicates.png
    display our_out_on_hg38+JN707599_miRNA_merging_replicates/miRNA_heatmap_merging_replicates.png
    display miRNA_re/miRNA_heatmap_keeping_replicates.png
    display miRNA_re/miRNA_heatmap_merging_replicates.png
    
    #TODO: If necessary, in the next revision, say the plots cannot be reproduced due to the software-update.

Draw plots for piRNA generated by COMPSRA

piRNA_pca.png

piRNA_heatmap.png

  1. Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.

    COMPSRA_MERGE_0_miRNA.txt
    COMPSRA_MERGE_0_piRNA.txt *
    COMPSRA_MERGE_0_snRNA.txt
    COMPSRA_MERGE_0_tRNA.txt
    COMPSRA_MERGE_0_snoRNA.txt
    COMPSRA_MERGE_0_circRNA.txt
  2. Input files for piRNA are two files: COMPSRA_MERGE_0_piRNA.txt and ids

    • COMPSRA_MERGE_0_piRNA.txt

      #The former are more precise due to the reads from virus will be mapped on the virus-genome diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_piRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_piRNA.txt diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_snRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_snRNA.txt cp ../our_out_on_hg38+JN707599/COMPSRA_MERGE_0_piRNA.txt .

    • prepare the file ids

      #see Option4: manully defining

  3. Draw plots with R using DESeq2

    #BiocManager::install("AnnotationDbi")
    #BiocManager::install("clusterProfiler")
    #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
    #BiocManager::install("limma")
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library("org.Hs.eg.db")
    library(DESeq2)
    library(gplots)
    library(limma)
    # Check the current library paths
    .libPaths()
    #setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/our_out_on_hg38+JN707599_2024_corrected/")
    
    d.raw<- read.delim2("COMPSRA_MERGE_0_piRNA.txt",sep="\t", header=TRUE, row.names=1)
    d.raw$X <- NULL
    d.raw[] <- lapply(d.raw, as.numeric)
    
    EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
    donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
    replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
    ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))
    
    cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
    
    rld <- rlogTransformation(dds)
    
    # -- before pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("replicates"))
    #plotPCA(rld, intgroup = c("replicates", "batch"))
    #plotPCA(rld, intgroup = c("replicates", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    
    png("pca2.png", 1200, 800)
    plotPCA(rld, intgroup=c("donor"))
    dev.off()
    
    #### STEP2: DEGs ####
    #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    sizeFactors(dds)
    #NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    normalized_counts <- counts(dds, normalized=TRUE)
    write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
    #---- * to untreated ----
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~EV_or_parental+donor)
    dds$EV_or_parental <- relevel(dds$EV_or_parental, "parental")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("EV_vs_parental")
    for (i in clist) {
      contrast = paste("EV_or_parental", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
      res$padj <- ifelse(is.na(res$padj), 1, res$padj)
      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    #~/Tools/csv2xls-0.4/csv_to_xls.py EV_vs_parental-all.txt EV_vs_parental-up.txt EV_vs_parental-down.txt -d$',' -o EV_vs_parental.xls;
    
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
    dds$replicates <- relevel(dds$replicates, "sT_DMSO")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("sT_Dox_vs_sT_DMSO")
    
    dds$replicates <- relevel(dds$replicates, "scr_Dox")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("sT_Dox_vs_scr_Dox")
    
    dds$replicates <- relevel(dds$replicates, "scr_DMSO")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("scr_Dox_vs_scr_DMSO", "sT_DMSO_vs_scr_DMSO")
    
    for (i in clist) {
      contrast = paste("replicates", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
      res$padj <- ifelse(is.na(res$padj), 1, res$padj)
      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    sT_Dox_vs_sT_DMSO-all.txt \
    sT_Dox_vs_sT_DMSO-up.txt \
    sT_Dox_vs_sT_DMSO-down.txt \
    -d$',' -o sT_Dox_vs_sT_DMSO.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    sT_Dox_vs_scr_Dox-all.txt \
    sT_Dox_vs_scr_Dox-up.txt \
    sT_Dox_vs_scr_Dox-down.txt \
    -d$',' -o sT_Dox_vs_scr_Dox.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    scr_Dox_vs_scr_DMSO-all.txt \
    scr_Dox_vs_scr_DMSO-up.txt \
    scr_Dox_vs_scr_DMSO-down.txt \
    -d$',' -o scr_Dox_vs_scr_DMSO.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    sT_DMSO_vs_scr_DMSO-all.txt \
    sT_DMSO_vs_scr_DMSO-up.txt \
    sT_DMSO_vs_scr_DMSO-down.txt \
    -d$',' -o sT_DMSO_vs_scr_DMSO.xls;
    
    ##### STEP3: prepare all_genes #####
    rld <- rlogTransformation(dds)
    mat <- assay(rld)
    mm <- model.matrix(~replicates, colData(rld))
    mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
    assay(rld) <- mat
    RNASeq.NoCellLine <- assay(rld)
    # reorder the columns
    colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
    col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
    RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]
    
    #Option4: manully defining
    #for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
    #cat *.id | sort -u > ids
    ##add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id
    datamat = RNASeq.NoCellLine[GOI, ]
    
    ##### STEP4: clustering the genes and draw heatmap #####
    datamat <- datamat[,-1]  #delete the sample "control MKL1"
    colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
    colnames(datamat)[2] <- "WaGa wildtype 0505"
    colnames(datamat)[3] <- "WaGa wildtype 1905"
    colnames(datamat)[4] <- "WaGa sT DMSO 0505"
    colnames(datamat)[5] <- "WaGa sT DMSO 1905"
    colnames(datamat)[6] <- "WaGa sT Dox 0505"
    colnames(datamat)[7] <- "WaGa sT Dox 1905"
    colnames(datamat)[8] <- "WaGa scr DMSO 0505"
    colnames(datamat)[9] <- "WaGa scr DMSO 1905"
    colnames(datamat)[10] <- "WaGa scr Dox 0505"
    colnames(datamat)[11] <- "WaGa scr Dox 1905"
    write.csv(datamat, file ="gene_expression_keeping_replicates.txt")
    
    #"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.5)
    mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
    
    mycol = mycol[as.vector(mycl)]
    png("piRNA_heatmap_keeping_replicates.png", width=800, height=1000)
    #svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
    heatmap.2(as.matrix(datamat),
      Rowv=as.dendrogram(hr),
      Colv=NA,
      dendrogram='row',
      labRow="",
      scale='row',
      trace='none',
      col=bluered(75),
      RowSideColors=mycol,
      srtCol=20,
      lhei=c(1,8),
      #cexRow=1.2,   # Increase row label font size
      cexCol=1.7,    # Increase column label font size
      margin=c(7,1)
     )
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt -d',' -o piRNA_heatmap_keeping_replicates.xls
    
    mv piRNA_heatmap_keeping_replicates.png piRNA_heatmap.png
    mv piRNA_heatmap_keeping_replicates.xls piRNA_heatmap.xls
    mv pca.png piRNA_pca.png
    mv EV_vs_parental.xls piRNA_EV_vs_parental.xls
    # --> SENDING piRNA_pca.png, piRNA_EV_vs_parental.xls, piRNA_heatmap.png and piRNA_heatmap.xls
    
    # ---- NOT WORKING WELL ----
    # merging replicates
    datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
    datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
    datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
    datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
    datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
    datamat <- datamat[,c(-2:-11)]
    write.csv(datamat, file ="gene_expression_merging_replicates.txt")
    
    # Ensure 'mycl' is calculated properly.
    mycl <- cutree(hr, h=max(hr$height)/2.2)
    # TODO: Rearrange the colors of the plot *_merging_replicates.png to match those in *_keeping_replicates.png!
    # mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
    
    # Now map your clusters to colors, making sure that there's one color for each row:
    actualColors <- mycol[mycl]  # Assign colors based on cluster assignment
    
    # Then use these 'actualColors' in your heatmap:
    png("piRNA_heatmap_merging_replicates.png", width=800, height=1000)
    heatmap.2(as.matrix(datamat),
              Rowv=as.dendrogram(hr),
              Colv=NA,
              dendrogram='row',
              labRow="",
              scale='row',
              trace='none',
              col=bluered(75),
              RowSideColors=actualColors, # Update this part
              srtCol=20,
              lhei=c(1,8),
              cexCol=1.7,    # Increase column label font size
              margin=c(7,1)
            )
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
    write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
    write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
    write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
    write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
    write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
    #write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o piRNA_heatmap_merging_replicates.xls

Draw plots for snRNA generated by COMPSRA

snRNA_pca.png

snRNA_heatmap.png

  1. Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.

    COMPSRA_MERGE_0_miRNA.txt
    COMPSRA_MERGE_0_piRNA.txt
    COMPSRA_MERGE_0_snRNA.txt *
    COMPSRA_MERGE_0_tRNA.txt
    COMPSRA_MERGE_0_snoRNA.txt
    COMPSRA_MERGE_0_circRNA.txt
  2. Input files for snRNA are two files: COMPSRA_MERGE_0_snRNA.txt and ids

    • COMPSRA_MERGE_0_snRNA.txt

      #The former are more precise due to the reads from virus will be mapped on the virus-genome diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_snRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_snRNA.txt cp ../our_out_on_hg38+JN707599/COMPSRA_MERGE_0_snRNA.txt .

    • prepare the file ids

      #see Option4: manully defining

  3. Draw plots with R using DESeq2

    #BiocManager::install("AnnotationDbi")
    #BiocManager::install("clusterProfiler")
    #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
    #BiocManager::install("limma")
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library("org.Hs.eg.db")
    library(DESeq2)
    library(gplots)
    library(limma)
    # Check the current library paths
    .libPaths()
    #setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/our_out_on_hg38+JN707599_2024_corrected/")
    
    d.raw<- read.delim2("COMPSRA_MERGE_0_snRNA.txt",sep="\t", header=TRUE, row.names=1)
    d.raw$X <- NULL
    d.raw[] <- lapply(d.raw, as.numeric)
    
    EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
    donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
    replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
    ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))
    
    cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
    
    rld <- rlogTransformation(dds)
    
    # -- before pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("replicates"))
    #plotPCA(rld, intgroup = c("replicates", "batch"))
    #plotPCA(rld, intgroup = c("replicates", "ids"))
    #plotPCA(rld, "batch")
    dev.off()
    
    png("pca2.png", 1200, 800)
    plotPCA(rld, intgroup=c("donor"))
    dev.off()
    
    #### STEP2: DEGs ####
    #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    sizeFactors(dds)
    #NULL
    dds <- estimateSizeFactors(dds)
    sizeFactors(dds)
    normalized_counts <- counts(dds, normalized=TRUE)
    write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
    #---- * to untreated ----
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~EV_or_parental+donor)
    dds$EV_or_parental <- relevel(dds$EV_or_parental, "parental")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("EV_vs_parental")
    for (i in clist) {
      contrast = paste("EV_or_parental", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
      res$padj <- ifelse(is.na(res$padj), 1, res$padj)
      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    #~/Tools/csv2xls-0.4/csv_to_xls.py EV_vs_parental-all.txt EV_vs_parental-up.txt EV_vs_parental-down.txt -d$',' -o EV_vs_parental.xls;
    
    dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
    dds$replicates <- relevel(dds$replicates, "sT_DMSO")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("sT_Dox_vs_sT_DMSO")
    
    dds$replicates <- relevel(dds$replicates, "scr_Dox")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("sT_Dox_vs_scr_Dox")
    
    dds$replicates <- relevel(dds$replicates, "scr_DMSO")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("scr_Dox_vs_scr_DMSO", "sT_DMSO_vs_scr_DMSO")
    
    for (i in clist) {
      contrast = paste("replicates", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
      res$padj <- ifelse(is.na(res$padj), 1, res$padj)
      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    sT_Dox_vs_sT_DMSO-all.txt \
    sT_Dox_vs_sT_DMSO-up.txt \
    sT_Dox_vs_sT_DMSO-down.txt \
    -d$',' -o sT_Dox_vs_sT_DMSO.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    sT_Dox_vs_scr_Dox-all.txt \
    sT_Dox_vs_scr_Dox-up.txt \
    sT_Dox_vs_scr_Dox-down.txt \
    -d$',' -o sT_Dox_vs_scr_Dox.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    scr_Dox_vs_scr_DMSO-all.txt \
    scr_Dox_vs_scr_DMSO-up.txt \
    scr_Dox_vs_scr_DMSO-down.txt \
    -d$',' -o scr_Dox_vs_scr_DMSO.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    sT_DMSO_vs_scr_DMSO-all.txt \
    sT_DMSO_vs_scr_DMSO-up.txt \
    sT_DMSO_vs_scr_DMSO-down.txt \
    -d$',' -o sT_DMSO_vs_scr_DMSO.xls;
    
    ##### STEP3: prepare all_genes #####
    rld <- rlogTransformation(dds)
    mat <- assay(rld)
    mm <- model.matrix(~replicates, colData(rld))
    mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
    assay(rld) <- mat
    RNASeq.NoCellLine <- assay(rld)
    # reorder the columns
    colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
    col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
    RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]
    
    #Option4: manully defining
    #for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
    #cat *.id | sort -u > ids
    ##add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id
    datamat = RNASeq.NoCellLine[GOI, ]
    
    ##### STEP4: clustering the genes and draw heatmap #####
    datamat <- datamat[,-1]  #delete the sample "control MKL1"
    colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
    colnames(datamat)[2] <- "WaGa wildtype 0505"
    colnames(datamat)[3] <- "WaGa wildtype 1905"
    colnames(datamat)[4] <- "WaGa sT DMSO 0505"
    colnames(datamat)[5] <- "WaGa sT DMSO 1905"
    colnames(datamat)[6] <- "WaGa sT Dox 0505"
    colnames(datamat)[7] <- "WaGa sT Dox 1905"
    colnames(datamat)[8] <- "WaGa scr DMSO 0505"
    colnames(datamat)[9] <- "WaGa scr DMSO 1905"
    colnames(datamat)[10] <- "WaGa scr Dox 0505"
    colnames(datamat)[11] <- "WaGa scr Dox 1905"
    write.csv(datamat, file ="gene_expression_keeping_replicates.txt")
    
    #"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.5)
    mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
    
    mycol = mycol[as.vector(mycl)]
    png("snRNA_heatmap_keeping_replicates.png", width=800, height=1000)
    #svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
    heatmap.2(as.matrix(datamat),
      Rowv=as.dendrogram(hr),
      Colv=NA,
      dendrogram='row',
      labRow="",
      scale='row',
      trace='none',
      col=bluered(75),
      RowSideColors=mycol,
      srtCol=20,
      lhei=c(1,8),
      #cexRow=1.2,   # Increase row label font size
      cexCol=1.7,    # Increase column label font size
      margin=c(7,1)
     )
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt -d',' -o snRNA_heatmap_keeping_replicates.xls
    
    mv snRNA_heatmap_keeping_replicates.png snRNA_heatmap.png
    mv snRNA_heatmap_keeping_replicates.xls snRNA_heatmap.xls
    mv pca.png snRNA_pca.png
    mv EV_vs_parental.xls snRNA_EV_vs_parental.xls
    mv sT_DMSO_vs_scr_DMSO.xls snRNA_sT_DMSO_vs_scr_DMSO.xls
    mv sT_Dox_vs_scr_Dox.xls snRNA_sT_Dox_vs_scr_Dox.xls
    mv sT_Dox_vs_sT_DMSO.xls snRNA_sT_Dox_vs_sT_DMSO.xls
    mv scr_Dox_vs_scr_DMSO.xls snRNA_scr_Dox_vs_scr_DMSO.xls
    # --> SENDING snRNA_*.png, snRNA_EV_vs_parental.xls, and snRNA_heatmap.xls
    
    # ---- NOT WORKING WELL ----
    # merging replicates
    datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
    datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
    datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
    datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
    datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
    datamat <- datamat[,c(-2:-11)]
    write.csv(datamat, file ="gene_expression_merging_replicates.txt")
    
    # Ensure 'mycl' is calculated properly.
    mycl <- cutree(hr, h=max(hr$height)/2.9)
    # mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
    
    # Now map your clusters to colors, making sure that there's one color for each row:
    actualColors <- mycol[mycl]  # Assign colors based on cluster assignment
    
    # Then use these 'actualColors' in your heatmap:
    png("snRNA_heatmap_merging_replicates.png", width=800, height=1000)
    heatmap.2(as.matrix(datamat),
              Rowv=as.dendrogram(hr),
              Colv=NA,
              dendrogram='row',
              labRow="",
              scale='row',
              trace='none',
              col=bluered(75),
              RowSideColors=actualColors, # Update this part
              srtCol=20,
              lhei=c(1,8),
              cexCol=1.7,    # Increase column label font size
              margin=c(7,1)
            )
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
    write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
    write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
    write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
    write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
    write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
    write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
    #write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
    #~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o snRNA_heatmap_merging_replicates.xls

高维数据推断:简介

Text source from https://rafalab.dfci.harvard.edu/

  • 高通量技术已将基础生物学和生物医学科学从数据贫乏的学科转变为数据密集的学科。
  • 一个具体的例子来自于研究基因表达的领域。
  • 基因表达是将DNA(生命的蓝图)复制到RNA(蛋白质合成模板,生命的构建基石)的过程。
  • 在1990年代,基因表达数据的分析相当于在纸上识别黑点或从标准曲线中提取几个数字。
  • 随着微阵列等高通量技术的出现,这一过程突然变为处理成千上万个数字。
  • 最近,RNA测序进一步增加了数据的复杂性。
  • 生物学家从用眼睛或简单的总结来分类结果,转变为每个样本有成千上万(现在是数百万)的测量数据需要分析。
  • 在本章中,我们将重点讨论高通量测量背景下的统计推断。
  • 具体而言,我们关注使用统计检验检测组间差异的问题,并以有意义的方式量化不确定性。
  • 我们还介绍了在分析高通量数据时应结合推断使用的探索性数据分析技术。
  • 在后续章节中,我们将研究聚类、机器学习、因子分析和多层次建模背后的统计学原理。
  • 由于有大量可用的公共数据集,我们使用了几个基因表达的例子。
  • 尽管如此,您将学习的统计技术在其他使用高通量技术的领域也已被证明是有用的。
  • 微阵列、下一代测序、fMRI和质谱等技术都生成数据来回答一些问题,而我们在这里学到的知识将是不可或缺的。

数据包

  • 我们将在接下来的章节中使用的几个示例最好通过R包获取。
  • 这些包可以从GitHub获得,并可以使用devtools包中的install_github函数安装。

三张表

  • 我们在本书中用作示例的大多数数据都是通过高通量技术生成的。

  • 这些技术测量成千上万个特征。

  • 特征的例子包括基因、基因组的单个碱基位置、基因组区域或图像像素强度。

  • 每个特定的测量产品由一组特定的特征定义。

  • 例如,特定的基因表达微阵列产品由它测量的基因集定义。

  • 一个特定的研究通常会使用一个产品对若干实验单位(如个体)进行测量。

  • 最常见的实验单位是个体,但也可以由其他实体定义,例如肿瘤的不同部分。

  • 我们通常按照实验术语将实验单位称为样本。

  • 重要的是不要将这些与前几章提到的样本(例如“随机样本”)混淆。

  • 因此,高通量实验通常由三张表定义:一张包含高通量测量数据,另外两张分别包含关于这张表的列和行的信息。

  • 因为一个数据集通常由一组实验单位定义,而一个产品定义了一组固定的特征,所以高通量测量数据可以存储在一个n × m的矩阵中,其中n是单位数,m是特征数。

  • 在R中,惯例是存储这些矩阵的转置。以下是一个基因表达数据集的示例:

    library(GSE5859Subset)
    data(GSE5859Subset) ##这会加载三张表
    dim(geneExpression)
  • 我们有来自24个个体(实验单位)的血液样本中8793个基因的RNA表达测量数据。

  • 对于大多数统计分析,我们还需要关于这些个体的信息:dim(sampleInfo)

  • 例如,在这种情况下,数据最初是为了比较不同族群的基因表达而收集的。

  • 其中一列,filenames,使我们能够将此表的行与测量表的列连接起来。

    match(sampleInfo$filename,colnames(geneExpression))
    head(match(geneAnnotation$PROBEID,rownames(geneExpression)))

Long-Read Transposon Insertion-site Sequencing vs. short-read TIS methods

https://pubmed.ncbi.nlm.nih.gov/35241765/

Overview_of_LoRTIS_sequencing

  • Transposon insertion site sequencing (TIS) is a powerful method for associating genotype to phenotype.
  • However, all TIS methods described to date use short nucleotide sequence reads which cannot uniquely determine the locations of transposon insertions within repeating genomic sequences where the repeat units are longer than the sequence read length.
  • To overcome this limitation, we have developed a TIS method using Oxford Nanopore sequencing technology that generates and uses long nucleotide sequence reads; we have called this method LoRTIS (Long-Read Transposon Insertion-site Sequencing).
  • LoRTIS enabled the unique localisation of transposon insertion sites within long repetitive genetic elements of E. coli, such as the transposase genes of insertion sequences and copies of the ~ 5 kb ribosomal RNA operon.
  • We demonstrate that LoRTIS is reproducible, gives comparable results to short-read TIS methods for essential genes, and better resolution around repeat elements.
  • The Oxford Nanopore sequencing device that we used is cost-effective, small and easily portable.
  • Thus, LoRTIS is an efficient means of uniquely identifying transposon insertion sites within long repetitive genetic elements and can be easily transported to, and used in, laboratories that lack access to expensive DNA sequencing facilities.

Introduction

  • Transposon insertion site sequencing (TIS) is a robust tool used to identify the genetic loci associated with the phenotype of an ­organism1.
  • The general approach for TIS methods has been developed over the last decade to understand the essential genes in many bacterial and eukaryotic ­species2,3. * TIS has helped the scientific community to understand genes involved in responses to a wide range of stress conditions and has enabled identification of many novel susceptibility and survival ­mechanisms1.
  • The first step in TIS methods is the generation of a transposon mutant library in which the resulting pool of mutants contains many random transposon insertions in all genes (except essential genes) of the target organism at multiple positions4.
  • The location of transposon insertions is identified by sequencing from the transposon insertion site into adjacent regions of the target ­genome5.
  • Saturated mutant libraries can provide very high resolution coverage of the genome allowing the importance of regions within genes to be identified as well as simply a role for the gene ­itself6,7.
  • To identify the repertoire of genes essential for responses to a particular stress condition, the transposon mutant library is grown under that stress condition (test) and in the absence of that stress (control).
  • Differences in the prevalence of transposon mutants within the pools are compared between the stress condition of interest and the control6.
  • Mutants with an advantage following exposure to the stress condition will proliferate while those that are disadvantaged will become depleted.
  • Comparison of the total number of inserts at each site between control and condition experiments indicates loci of i­nterest8.
  • The DNA of mutant libraries is sequenced, using a customised protocol, [e.g. transposon-directed insertion site sequencing (TraDIS), transposon sequencing (Tn-seq), high-throughput insertion tracking by deep sequencing (HITS) and insertion sequencing (INSeq)], from the transposon reading into the adjacent genome to identify the location of transposon insertions5,9–11.
  • These TIS methods use different transposon-genome junction enrichment techniques including restriction enzyme digestion, (e.g. MmeI), or the use of customised primer annealing followed by high-throughput sequencing of the resulting ­libraries5,12. These transposon based methods all use short-read sequencing, and the majority of them use either Illumina or Ion torrent sequencing ­platforms13.
  • Sequence reads will not map uniquely to repeat sequences within a reference genome that are longer than the nucleotide sequence reads themselves (Fig. 1)14. * The sequence read length generated by Illumina-based approaches is around 300 base pairs (for a single read), and so reads cannot map uniquely to repeat-sequence motifs that are longer than these ­reads15.
  • Examples of long repeat regions in E. coli include transposase insertion sequences, which are usually over 600 bp, and the several copies of the ribosomal RNA operons, which are over 5 ­kb16.
  • The repeat regions are even a greater challenge in eukaryotic genomes which are much larger and have numerous repeating genetic elements.
  • Many repeating elements are of evolutionary importance, and/or have roles in replication, recombination, transcription and genome ­rearrangement17.
  • Thus, transposon insertions into different copies of a repeating element may have different phenotypic effects.
  • Therefore, determining the unique locations of transposon insertion sites in repeating elements is essential for formulating meaningful hypotheses from TIS data.
  • Here, we describe LoRTIS (Long-Read Transposon Insertion-site Sequencing), which overcomes the limitations resulting from using short-read nucleotide sequences.
  • In LoRTIS, as with other TIS methods, we have used the known sequences of the transposon as an anchor point from which to generate DNA fragments by amplification of DNA into the unknown transposon insertion sites for sequencing.
  • However, in LoRTIS, by modifying this amplification step, much longer DNA fragments are generated, from which long sequence reads are possible using a MinION nanopore sequencer (Oxford Nanopore)18.
  • The resulting long-reads are then aligned with the nucleotide sequence of a reference genome, and the locations identifying the precise positions of the transposon insertions sites determined.
  • The longer reads can span repeat sequences and read into adjacent unique sites.
  • The long-reads generated by LoRTIS were able to locate transposon insertion sites uniquely within repeating sequence elements of the ribosomal RNA operons, which are approximately 5 kb.
  • In addition, we incorporated different nucleotide sequence identifiers (indexes) into replicates to demonstrate that multiple samples can be sequenced on a single flow cell.
  • Thus, LoRTIS can identify hundreds of thousands of transposon insertion sites uniquely and overcomes the limitation of short-reads generated by other TIS methods.

Methods: DNA extraction and experimental setup

  • The Escherichia coli strain BW25113 large transposon mutant library described by Yasir et al. was used in this study6. Genomic DNA was extracted using the Quick-DNA Fungal/Bacterial 96 Kit (Zymo Research).

  • LoRTIS and TraDIS-Xpress samples were prepared in duplicate, from DNA extraction to sequencing library preparation. For LoRTIS each replicate was sequenced on a separate flow cell, so Native Barcodes (NBD) were added to each replicate to test the multiplexing capability of the LoRTIS method. NBD details are shown in Supplementary Table 5.

Methods: Preparation of DNA fragments for LoRTIS and TraDIS-Xpress

  • For LoRTIS sequencing library preparation,1 μg of genomic DNA was tagmented using an Illumina compatible MuSeek library preparation kit (Thermo Fisher Scientific) and the tagmented DNA was cleaned using 0.5× volume of AMPure XP magnetic beads (Beckman-Coulter). LongAmp Taq DNA polymerase (New England Biolabs, UK) was used with biotinylated oligonucleotide primers (Supplementary Table 5) that hybridise specifically within the transposon to amplify DNA fragments containing transposon insertion junctions (Supplementary Table 6 PCR1a). A second primer that anneals to MuSeek adapters (‘IonTMu-02’) was used to generate biotinylated double-stranded PCR products containing transposon-genome junctions (Supplementary Table 6 PCR1b). These PCR products were enriched using a Dynabeads™ kilobaseBINDER™ kit (Thermo Fisher Scientific). A nested PCR reaction was carried out using NEB LongAmp Taq polymerase to generate DNA fragments with Oxford Nanopore multiplexing adapters that lacked biotin (Supplementary Table 6 PCR2). The PCR products from this reaction were purified using 0.5× volume of AMPure XP magnetic beads (Beckman-Coulter) and were end-repaired, then nanopore sequencing adapters were ligated according to the Oxford Nanopore protocol.

  • For nucleotide sequencing by LoRTIS, the resulting DNA fragments were loaded onto an Oxford Nanopore MinION flowcell and sequenced for up to 48 h (Fig. 7). Using the LoRTIS protocol, nanopore long-read nucleotide sequence data were generated from two separate runs on an Oxford Nanopore flow cell to generate two replicate data sets that could be compared for reproducibility.

Methods: Bioinformatics

  • FAST5 format data from MinION was processed using Guppy Basecalling software (version 3.6.0) running in High Accuracy Calling (HAC) mode and using GPUs on the Quadram Institute cloud infrastructure. The resulting sequence data in FASTQ format were demultiplexed using QCat (version 1.1.0) and those reads containing transposon nucleotide sequences were identified, trimmed and retained, again using QCat. The nucleotide sequence reads were then located to the BW25113 reference genome (CP009273) using Bio-TraDIS (version 1.4.1)27 and Bio-LoRTIS software (version 0.0.5)28 in a similar way as for short-read data, except that steps to remove the transposon sequences were skipped and Minimap2 (version 2.17-r941)29 was substituted in the place of SMALT. Results from BioTradis, including plot files, were outputted in the same format as for short-read data.

  • The insertion patterns at candidate loci were inspected visually using Artemis (version 18.1.0) which was also used to capture images for figures30. Gene essentiality was determined using tradis_essentiality.R from BioTradis, and scatter charts were plotted to determine reproducibility using Microsoft Excel after normalising for numbers of reads in each replicate. The correlation between TraDIS-Xpress and LoRTIS data was calculated using Spearman’s rank correlation coefficient.

TraDIS bioinformatics toolkit:

    #https://www.biorxiv.org/content/10.1101/2022.05.26.493556v1.full.pdf
    Nano-Bio-TraDIS: https://github.com/quadram-institute-bioscience/LoRTIS
    http://sanger-pathogens.github.io/Bio-Tradis/

Downdetector: https://downdetector.com/status/openai/

散光和老花

1.揉眼睛是散光的最大原因。
2.查检查结果,拍下来。约一个眼科医生或好的验光师。下次选择抗蓝光涂层和防反光涂层。
3.带一个尺子测量一下屏幕的距离。

"Blau-Licht-Filter" und "Entspiegelung" sind zwei wichtige Beschichtungen für Brillengläser, die unterschiedliche Vorteile bieten:
Blau-Licht-Filter (Blau-Licht-Blocker)

Blau-Licht-Filter, auch als Anti-Blau-Licht-Beschichtung bekannt, hilft, schädliches blaues Licht zu reduzieren, das von digitalen Bildschirmen (Computer, Smartphones, Tablets) und künstlichen Lichtquellen ausgestrahlt wird. Diese Beschichtung kann helfen:

    Augenbelastung zu verringern
    Schlafqualität zu verbessern
    Das Risiko von Augenschäden durch langfristige Einwirkung von blauem Licht zu reduzieren

Entspiegelung (Anti-Reflex-Beschichtung)

Eine Entspiegelungsbeschichtung reduziert Reflexionen auf der Brillenoberfläche, was zu klarerem und komfortablerem Sehen führt. Diese Beschichtung bietet folgende Vorteile:

    Weniger störende Reflexionen, besonders beim Autofahren in der Nacht oder bei hellem Licht
    Bessere Ästhetik, da die Brille weniger blendet und die Augen besser sichtbar sind
    Erhöhte Haltbarkeit und einfache Reinigung, da viele Entspiegelungsbeschichtungen auch wasser- und schmutzabweisend sind

Beide Beschichtungen können kombiniert werden, um maximalen Komfort und Schutz für Ihre Augen zu bieten.

https://www.allaboutvision.com/de/augenerkrankungen/presbyopie/presbyopie-optionen/

“散光”是Stabsichtigkeit,或Astigmatismus
Presbyopie Alterssichtigkeit
Kombination of Stabsichtigkeit und Presbyopie Brille
Brille für Alterssichtigkeit und Stabsichtigkeit

#https://www.fielmann.it/de/wissen/fehlsichtigkeiten/astigmatismus/
Astigmatismus: „Alles unscharf – wenn man einen Knick in der Optik hat…“

老花眼镜不太适合长期在屏幕前工作。老花眼镜主要设计用于近距离阅读,它们的焦点通常在30-40厘米左右的距离,这对于看书或查看手机是理想的。然而,当您在电脑上工作时,屏幕通常会放置在更远的位置(约60-70厘米),这种情况下,老花眼镜的焦点并不适用。

为了更舒适地进行屏幕工作,建议使用专门设计的屏幕眼镜或渐进多焦镜。屏幕眼镜通常具有中距离的焦点,非常适合电脑屏幕的工作距离。渐进多焦镜则具有多个焦点,可以涵盖从近距离到中距离的视野,非常适合需要频繁在不同距离间切换视野的情况,如在电脑屏幕和桌面文件之间切换。

总结来说,如果您长期在屏幕前工作,老花眼镜可能会导致视力模糊、眼睛疲劳甚至头痛,因此更适合选择屏幕眼镜或渐进多焦镜。

#------

是的,您可以选择结合老花和散光矫正的镜片。这种镜片通常称为多焦点镜片或渐进多焦镜片,它们能够同时矫正老花和散光。

以下是一些常见的选择:

    渐进多焦镜片:
        特点:这种镜片在同一镜片中具有多个焦点区域,能够从近距离(阅读)到中距离(电脑工作)再到远距离(看远处)提供连续的清晰视野。
        优点:适合需要在不同距离间切换视野的情况,既能解决老花问题,又能矫正散光。

    双光或三光镜片:
        特点:这种镜片在同一镜片中具有两个或三个不同的焦点区域,通常是近距离和远距离(双光镜片)或近距离、中距离和远距离(三光镜片)。
        优点:能够满足不同距离的视力需求,适合需要固定距离工作的情况。

    专门的屏幕眼镜:
        特点:这种镜片专门为中距离(电脑工作距离)设计,同时可以根据您的具体情况加入老花和散光的矫正。
        优点:专为屏幕工作设计,可以有效减少眼睛疲劳和不适。

建议步骤:

    眼科检查:首先,进行详细的眼科检查,确定您的老花度数和散光度数。
    咨询专业眼镜店:与专业的眼镜店或验光师进行咨询,选择适合您的镜片类型。
    定制镜片:根据您的具体需求和验光结果定制适合的镜片。

注意事项:

    正确的验光结果:确保验光结果准确,以便定制的镜片能够完美适合您的视力需求。
    镜片材质和涂层:考虑选择抗蓝光涂层和防反光涂层,以提高在电脑工作时的舒适度。
    佩戴习惯的适应:如果您选择渐进多焦镜片,可能需要一些时间来适应佩戴的习惯。

通过上述方式,您可以获得既能矫正老花又能矫正散光的镜片,提升在屏幕前工作的舒适度和效率

#------

如果您不需要在不同距离间切换视野,并且只是长期在屏幕前工作,那么一副专门的屏幕眼镜可能是最适合您的选择。屏幕眼镜的设计是专门为中距离视野优化的,这正好适合电脑屏幕的工作距离(约60-70厘米)。这种眼镜能帮助您在使用电脑时保持视力清晰,同时减少眼睛疲劳和不适。

以下是选择屏幕眼镜的一些建议:

    定制镜片:确保镜片是根据您的视力情况定制的,而不是使用超市购买的标准老花眼镜。这样可以保证每只眼睛的度数都是最适合您的。

    抗蓝光涂层:考虑选择具有抗蓝光涂层的镜片,减少长时间看屏幕对眼睛的伤害。

    舒适的框架:选择轻便、佩戴舒适的镜框,避免长时间佩戴造成的不适。

总结而言,如果您长期在固定距离的屏幕前工作,专门的屏幕眼镜会比老花眼镜更合适,能有效提高工作舒适度和效率。

Lässt ab Mitte 40 die Sehkraft nach, haben viele Menschen zunehmend Probleme, wenn sie längere Zeit am Monitor arbeiten. Muss es dann gleich eine Gleitsichtbrille sein?

Schauen Sie auch den größten Teil des Tages auf einen Bildschirm? Wer dann mit Kopfschmerzen, trockenen Augen oder verschwommener Sicht und Verspannungen im Nacken- und Schulterbereich zu kämpfen hat, sollte sich auf die Suche nach den Ursachen machen. Stecken Sehdefizite hinter den Problemen, können Ihnen laut Kuratorium Gutes Sehen (KGS) je nach Alter folgende Lösungen helfen.
Ab 45 Jahren: Angepasste Lesebrille für den Übergang

Lässt Mitte 40 die Sehkraft nach, können sich Betroffene, die am Computer arbeiten, auch erst mal mit einer regulären Lesebrille behelfen. Vorausgesetzt, die Alterssichtigkeit liegt bei einem Wert von unter +1,25 Dioptrien.

Ein Modell vom Discounter ist aber nicht unbedingt die beste Wahl. Selten haben beide Augen die gleiche Sehstärke, heißt es vom KSG – die Gläser von Lesebrillen aus dem Discounter weisen aber üblicherweise jeweils die gleiche Dioptrien auf. Besser also: auch die Lesebrille individuell beim Optiker anpassen lassen.

Außerdem wichtig: Beim Blick in den Raum sollten Sie die Lesebrille absetzen, sonst verschwimmt die Sicht. Dieses Problem lässt sich mit einer Bildschirmbrille mit Gleitsichtfunktion umgehen. Damit können die Augen sowohl den Monitor als auch die Tastatur sowie den Raum dahinter klar erfassen.

    Lesen Sie auch: Leiden Sie unter dem Office-Eye-Syndrom?

Ab 50 Jahren: Bildschirmbrille mit oder ohne Gleitsicht

Spätestens wenn die Alterssichtigkeit um das 50. Lebensjahr +1,25 Dioptrien beträgt, ist es ratsam, über eine spezielle Bildschirmbrille nachzudenken, so das KGS. Ein Modell mit Gleitsichtfunktion biete einen großen Sehbereich für mittlere und kurze Distanzen. Der Fokus liegt auf dem Monitor, der obere Bereich des Glases ermöglicht gleichzeitig scharfes Sehen im Raum. Im Alltag eignet sich die Brille den Infos zufolge ebenfalls zum Lesen, Kochen oder Handwerken.

Eine Bildschirmbrille ohne Gleitsichtgläser ist in diesem Alter ebenfalls eine Option. Beschäftigte müssen sie aber immer absetzen, wenn sie etwa den Raum scharf wahrnahmen möchten.
Was genau bedeutet Dioptrien (dpt)?

Dioptrien ist die Maßeinheit, die die Brechkraft der Brillengläser angibt. Mithilfe dieser Zahl können die Gläser mit der richtigen Wölbung geschliffen werden, sodass die individuelle Fehlsichtigkeit korrigiert wird. Ein Pluszeichen (+) steht dabei für eine Weitsichtigkeit (Dioptrien im positiven Bereich), ein Minuszeichen (-) bedeutet eine Kurzsichtigkeit (Dioptrien im negativen Bereich).
Ab 55 Jahren: Bildschirmbrille mit Gleitsicht

Ab dem 55. Lebensjahr steige die Alterssichtigkeit auf über +1,5 Dioptrien an, so das Kuratorium. Dann sei eine Bildschirmbrille mit Gleitsichtfunktion die beste Lösung. Am besten messen Beschäftigte für eine passende Einstellung beim Optiker vorab die Entfernungen zum Monitor, zur Tastatur und zu einem Blatt auf dem Schreibtisch.

    Kräuter, Kompressen und Co.: Diese Hausmittel helfen bei trockenen Augen
    Gefahr von Fehlhaltungen: Wann Sie Gleitsichtbrillen nicht tragen sollten
    Selbsttest: Jung geblieben? Ermitteln Sie Ihr biologisches Alter
    Anzeige Jérémy Desplanches: Olympia 2024 wird sein letztes großes Turnier

Einstärkenbrillen und Gleitsichtbrillen für den Alltag sind dann keine wirkliche Alternative mehr – letztere haben laut KGS einen zu schmalen Korrektionsbereich für die Monitordistanz.

到了四十多岁,如果视力开始下降,许多人在长时间使用电脑时会遇到越来越多的问题。那么,是否需要立即配备一副渐进多焦镜呢?

您是否大部分时间都在看屏幕?如果您经常感到头痛、眼睛干涩或视力模糊以及颈部和肩部紧张,应该查找原因。如果这些问题是由于视力缺陷引起的,根据“好视力基金会”(KGS)的建议,根据您的年龄以下解决方案可能会有所帮助。

45岁以上:过渡期的定制阅读眼镜

如果在四十多岁时视力开始下降,电脑工作者可以先使用常规的阅读眼镜,前提是老花眼的度数低于+1.25屈光度(Dioptrien)。

但是,超市购买的阅读眼镜不一定是最好的选择。KGS表示,少有两只眼睛的视力完全相同,而超市的阅读眼镜通常具有相同的屈光度。因此,更好的选择是到眼镜店定制阅读眼镜。

此外,重要的是,当您看远处时,应摘下阅读眼镜,否则视力会变得模糊。这一问题可以通过配备具有渐进多焦功能的屏幕眼镜来解决。这样,眼睛可以清晰地看到屏幕、键盘以及背后的房间。

50岁以上:带或不带渐进多焦功能的屏幕眼镜

KGS表示,老花眼的度数在50岁左右达到+1.25屈光度时,建议考虑配备专门的屏幕眼镜。具有渐进多焦功能的眼镜可以提供较大的中短距离视野。焦点在屏幕上,镜片的上部区域同时可以清晰地看到房间。日常生活中,这种眼镜也适用于阅读、烹饪或手工制作。

在这个年龄,没有渐进多焦功能的屏幕眼镜也是一种选择。但如果需要清晰地看房间,使用者必须随时摘下眼镜。

屈光度(dpt)是什么意思?

屈光度是表示眼镜镜片折射力的单位。通过这个数字,镜片可以被研磨成适当的弯曲度,以矫正个人的视力缺陷。加号(+)表示远视(正屈光度),减号(-)表示近视(负屈光度)。

55岁以上:带渐进多焦功能的屏幕眼镜

KGS表示,55岁以上时,老花眼的度数会上升到+1.5屈光度以上,此时带有渐进多焦功能的屏幕眼镜是最佳解决方案。为了最佳配镜,工作者最好提前在眼镜店测量到屏幕、键盘和桌上文件的距离。

单焦点眼镜和日常使用的渐进多焦眼镜不再是有效的替代品——根据KGS的说法,后者对屏幕距离的矫正范围太窄。

#最近早上醒来,眼睛特别干,看东西特别模糊,大概过半小时后好转。

根据您的年龄和症状,早上醒来视力模糊并在一段时间后好转,这可能与几种因素有关。以下是一些常见原因及其解释:
可能的原因

    干眼症
        解释:晚上睡觉时,泪液分泌减少,导致眼睛干涩,早上醒来时可能会感觉视力模糊。
        建议:使用人工泪液或眼药水,特别是在早晨醒来时。

    老花眼(Presbyopie)
        解释:45岁是老花眼的常见发病年龄,早上眼睛肌肉可能还未完全放松,导致近距离视力模糊。
        建议:使用阅读眼镜或多焦点眼镜。

    夜间眼压波动
        解释:睡觉时眼压波动可能导致角膜形状暂时性变化,影响早晨的视力。
        建议:定期检查眼压,尤其是有青光眼风险的人群。

    泪液质量或数量不足
        解释:泪液的质量或数量不足会导致泪膜不稳定,早晨视力模糊。
        建议:改善泪液质量,使用泪膜稳定剂。

    其他健康问题
        解释:某些全身性疾病如糖尿病、高血压等也可能导致视力波动。
        建议:定期体检,控制血糖和血压。

建议的行动

    咨询眼科医生
        建议您尽快预约眼科医生进行详细检查,以排除任何潜在的眼部疾病,并获取专业的诊断和治疗建议。

    保持眼睛湿润
        使用人工泪液或润滑眼药水,特别是在早晨醒来时使用,可以缓解干眼症状。

    改善生活习惯
        确保良好的睡眠姿势和环境,避免眼睛过度疲劳。
        保持良好的饮食习惯,多摄入有助于眼睛健康的营养素,如维生素A、C、E和Omega-3脂肪酸。

    定期体检
        进行全面的身体检查,确保没有全身性疾病影响您的视力。

总结

早上视力模糊可能由多种原因引起,最好通过专业的眼科检查来确定具体原因。根据您的症状和年龄,老花眼和干眼症是比较常见的原因,但也不排除其他因素的影响。定期检查眼睛健康和整体健康是确保视力保持良好的重要措施。

Logics of Transposon sequencing

original reads:
>read_1/1
    AAGAGTTAGTAGGGCGGGCGATTAAGGGCAAACGTGATCAGGTCTTTTTAGCGACTAAGT
TCGGTATTGTGCGTGACCCCACTGATCCCAGGGTACGCGGTGTGAGCAGCCGTCCTGATT
ATATCCGCCAGTCG  GTTGATGAGAGCTTTGTTGTAGGTG

TODO_TOMORROW: follow the logics below!
vs. NEB_i5_Tn_02
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCACCTACAACAAAGCTCTCATCAAC-3'-bacterial_genomics
duplicate_PCR generate rc fragments
                 adapter_removed!  |GTGGA  ...     AGAGTAGTTG                       ...  GATTGAGAA
                                                                                 <-------read_R1--

>read_1/2
AATGAAGAGTTAGTAGGGCGGGCGATTAAGGGCAAACGTGATCAGGTCTTTTTAGCGACT
AAGTTCGGTATTGTGCGTGACCCCACTGATCCCAGGGTACGCGGTGTGAGCAGCCGTCCT
GATTATATCCGCCAGTCGGTTGATG  AGAGCTTTGTTGT

SAM-format

The two records in the SAM-format represent two reads that are part of a pair from the sequencing data. Below is the analysis of these two records:

Record 1 VH00358:89:AAFC5MTM5:1:1101:62048:1038_:N:0:TTTCTCTA+CTCGACG 83 NZ_AKKR01000009 78640 60 110M20S = 78640-110 ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCGGGGGTGTAGG * NM:i:0 MD:Z:110 MC:Z:110M20S AS:i:110 XS:i:0

Record 2 VH00358:89:AAFC5MTM5:1:1101:62048:1038_:N:0:TTTCTCTA+CTCGACG 163 NZ_AKKR01000009 78640 60 110M20S = 78640110 ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCTTTGTTGTAGG * NM:i:0 MD:Z:110 MC:Z:110M20S AS:i:110 XS:i:0

Analysis

  • Read Names: Both reads have the same name: VH00358:89:AAFC5MTM5:1:1101:62048:1038_:N:0:TTTCTCTA+CTCGACG. This indicates they are part of the same pair.

  • Flags: First Record (83): 83 indicates the read is the second read in the pair (0x40 is not set), and it is mapped in reverse complement (0x10 is set). It also indicates that the read is part of a properly paired alignment (0x2 is set). Second Record (163): 163 indicates the read is the first read in the pair (0x40 is set), and it is mapped in reverse complement (0x10 is set). It also indicates that the read is part of a properly paired alignment (0x2 is set).

  • Reference Name: Both reads are mapped to the same reference sequence: NZ_AKKR01000009.

  • Position: Both reads are mapped to the same position: 78640.

  • Mapping Quality: Both reads have a mapping quality score of 60, which indicates high confidence in the mapping.

  • CIGAR String: Both reads have the same CIGAR string: 110M20S. This means the first 110 bases are matched/mismatched (alignment) and the last 20 bases are soft-clipped (not aligned to the reference).

  • Mate Information: The = symbol indicates that the mate is mapped to the same reference. The 78640-110 and 78640110 values indicate the positions of the mates.

  • Sequence: The sequences are nearly identical for the aligned parts: Record 1: ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCGGGGGTGTAGG Record 2: ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCTTTGTTGTAGG

  • Optional Fields: NM:i:0: Number of mismatches is 0. MD:Z:110: Mismatch string indicating no mismatches in the 110 aligned bases. MC:Z:110M20S: CIGAR string for the mate. AS:i:110: Alignment score for the read. XS:i:0: Suboptimal alignment score.

Conclusion

These two SAM records represent a pair of reads mapped to the same position on the reference sequence NZ_AKKR01000009. Both reads are aligned with high confidence (mapping quality 60), and they have no mismatches in the aligned portion of their sequences. The 110M20S CIGAR string indicates that the first 110 bases are aligned, and the last 20 bases are soft-clipped. Both reads are reverse complemented.

Step-by-Step Analysis of Transposon Insertion Sites

Transposon_Seq_Strategie-0

Transposon_Seq_Strategie-1

  1. The script processes paired-end sequencing data from a SAM file to identify reads mapping to a specific reference and position, outputting the results in FASTQ format. It imports necessary Biopython libraries, defines a function to parse the CIGAR string, and includes a function to save sequences as FASTQ files. The main function, extract_reads, extracts read pairs from the SAM file that map to a given reference and position, checking for soft-clipped sequences indicative of transposon insertions. It categorizes the reads into original, bacterial, and transposon sequences, and saves them in separate FASTQ files. The script concludes by printing the counts of each type of read pair saved. (Need to be debugged!)

    python3 check_insertion_site_with_transposon16.py
    
    import re
    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    
    def parse_cigar(cigar):
        """
        Parse the CIGAR string and return a list of tuples (operation, length).
        """
        return [(int(length), op) for length, op in re.findall(r'(\d+)([MIDNSHP=X])', cigar)]
    
    def save_fastq(sequences, file_name, read_ids):
        """
        Save sequences to a FASTQ file.
        """
        records = [SeqRecord(Seq(seq), id=read_id, description="", letter_annotations={"phred_quality": [40]*len(seq)})
                for seq, read_id in zip(sequences, read_ids)]
        SeqIO.write(records, file_name, "fastq")
    
    def extract_reads(sam_file, reference, position):
        """
        Extract paired reads mapping to a specific position and categorize them.
        """
        original_reads_R1 = []
        original_reads_R2 = []
        bacterial_reads_R1 = []
        bacterial_reads_R2 = []
        transposon_reads_R1 = []
        transposon_reads_R2 = []
    
        read_ids_R1 = []
        read_ids_R2 = []
    
        read_pairs = {}  # To store read pairs
    
        with open(sam_file, 'r') as file:
            for line in file:
                if line.startswith('@'):
                    continue
    
                parts = line.split('\t')
                flag = int(parts[1])
                ref_name = parts[2]
                start_pos = int(parts[3])
                cigar = parts[5]
                seq = parts[9]
                read_name = parts[0]
    
                # Skip unmapped reads
                if ref_name == '*' or cigar == '*':
                    continue
    
                # Skip reads not mapped to the specified reference
                if ref_name != reference:
                    continue
    
                # Determine if the read is R1 or R2 based on the flag
                is_read1 = flag & 64 != 0
                is_read2 = flag & 128 != 0
    
                # Parse the CIGAR string
                parsed_cigar = parse_cigar(cigar)
    
                # Calculate the end position of the aligned part of the read
                aligned_length = sum(length for length, op in parsed_cigar if op in "M=X")
                end_pos = start_pos + aligned_length - 1
    
                # Store the read and its pair
                if read_name not in read_pairs:
                    read_pairs[read_name] = [None, None]  # [R1, R2]
                if is_read1:
                    read_pairs[read_name][0] = (seq, start_pos, end_pos, parsed_cigar, read_name)
                if is_read2:
                    read_pairs[read_name][1] = (seq, start_pos, end_pos, parsed_cigar, read_name)
    
        for read_name, pair in read_pairs.items():
            if pair[0] and pair[1]:
                read1, read2 = pair
    
                # Check if one of the reads contains a soft-clip
                has_soft_clip1 = any(op == 'S' for length, op in read1[3])
                has_soft_clip2 = any(op == 'S' for length, op in read2[3])
    
                if has_soft_clip1 or has_soft_clip2:
                    original_reads_R1.append(read1[0])
                    original_reads_R2.append(read2[0])
                    read_ids_R1.append(read1[4] + "/1")
                    read_ids_R2.append(read2[4] + "/2")
    
                    # Extract bacterial parts only
                    bacterial_seq1 = ''.join([read1[0][i:i+length] for i, (length, op) in enumerate(read1[3]) if op in "M=X"])
                    bacterial_seq2 = ''.join([read2[0][i:i+length] for i, (length, op) in enumerate(read2[3]) if op in "M=X"])
                    bacterial_reads_R1.append(bacterial_seq1)
                    bacterial_reads_R2.append(bacterial_seq2)
    
                    # Extract transposon parts only (soft-clipped parts)
                    transposon_seq1 = ''.join([read1[0][i:i+length] for i, (length, op) in enumerate(read1[3]) if op == 'S'])
                    transposon_seq2 = ''.join([read2[0][i:i+length] for i, (length, op) in enumerate(read2[3]) if op == 'S'])
                    transposon_reads_R1.append(transposon_seq1)
                    transposon_reads_R2.append(transposon_seq2)
    
        save_fastq(original_reads_R1, 'original_reads_R1.fastq', read_ids_R1)
        save_fastq(original_reads_R2, 'original_reads_R2.fastq', read_ids_R2)
        save_fastq(bacterial_reads_R1, 'bacterial_reads_R1.fastq', read_ids_R1)
        save_fastq(bacterial_reads_R2, 'bacterial_reads_R2.fastq', read_ids_R2)
        save_fastq(transposon_reads_R1, 'transposon_reads_R1.fastq', read_ids_R1)
        save_fastq(transposon_reads_R2, 'transposon_reads_R2.fastq', read_ids_R2)
    
        print(f"Total original read pairs saved: {len(original_reads_R1)}")
        print(f"Total bacterial read pairs saved: {len(bacterial_reads_R1)}")
        print(f"Total transposon read pairs saved: {len(transposon_reads_R1)}")
    
        return len(original_reads_R1), len(bacterial_reads_R1), len(transposon_reads_R1)
    
    # Example usage
    sam_file = 'initial_mutants_rep1_no_insertion_sites.sam'
    reference = 'NZ_AKKR01000009'
    position = 31537
    
    original_pairs_count, bacterial_pairs_count, transposon_pairs_count = extract_reads(sam_file, reference, position)
    
    print(f"Number of original read pairs: {original_pairs_count}")
    print(f"Number of bacterial read pairs: {bacterial_pairs_count}")
    print(f"Number of transposon read pairs: {transposon_pairs_count}")
  2. Reversing Complementing FASTQ Reads. This effectively swaps and reverse complements the read pairs.

    seqtk seq -r original_reads_R1.fastq > original_reads_rc_R2.fastq
    seqtk seq -r original_reads_R2.fastq > original_reads_rc_R1.fastq
  3. Counting Specific Sequences in FASTQ Files

    grep "CCTACAACAAAGCTCTCATCAAC" original_reads_rc_R1.fastq | wc -l  #19186
    grep "CCTACAACAAAGCTCTCATCAAC" original_reads_rc_R2.fastq | wc -l  #42069
  4. Running Transposon Position Profiling (TPP)

    #This command runs the tpp (Transposon Position Profiling) tool, which is used for mapping transposon insertions.
    #    -bwa /usr/bin/bwa: Specifies the path to the BWA alignment tool.
    #    -protocol Tn5: Specifies the protocol used, in this case, Tn5 transposon.
    #    -ref contig_1_1.fna: Specifies the reference genome file.
    #    -reads1 original_reads_rc_R1.fastq -reads2 original_reads_rc_R2.fastq: Specifies the input FASTQ files.
    #    -output contig_1_1_rc: Specifies the output prefix.
    #    -primer CCTACAACAAAGCTCTCATCAAC: Specifies the primer sequence used for mapping.
    #    -mismatches 1: Allows up to 1 mismatch in the alignment.
    #    -bwa-alg mem: Uses the mem algorithm in BWA.
    #    -replicon-ids contig_1_1: Specifies the replicon IDs.
    python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref contig_1_1.fna -reads1 original_reads_rc_R1.fastq -reads2 original_reads_rc_R2.fastq -output contig_1_1_rc -primer CCTACAACAAAGCTCTCATCAAC -mismatches 1 -bwa-alg mem -replicon-ids contig_1_1
    mv tpp.cfg contig_1_1_rc.tpp.cfg
    # TA_sites: 84291
    # TAs_hit: 54
    # density: 0.001
    # max_count (among templates): 1
    # max_site (coordinate): 84038
    
    #This command runs tpp again but only uses original_reads_rc_R1.fastq.
    #Analysis: Indicates an issue because BWA alignments expected read2 to be from the bacterial genome, which #was not the case.
    python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref contig_1_1.fna -reads1 original_reads_rc_R1.fastq -output contig_1_1_rc_only_reads1 -primer CCTACAACAAAGCTCTCATCAAC -mismatches 1 -bwa-alg mem -replicon-ids contig_1_1
    mv tpp.cfg contig_1_1_rc_only_reads1.tpp.cfg
    #Max_count (among templates): 2013
    #Max_site (coordinate): 31531
    
    #This command runs tpp again but only uses original_reads_rc_R2.fastq.
    python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref contig_1_1.fna -reads1 original_reads_rc_R2.fastq -output contig_1_1_rc_only_reads2 -primer CCTACAACAAAGCTCTCATCAAC -mismatches 1 -bwa-alg mem -replicon-ids contig_1_1
    mv tpp.cfg contig_1_1_rc_only_reads2.tpp.cfg
    #Max_count (among templates): 7210
    #Max_site (coordinate): 31531
  5. Conclusion: With the pipeline described above, we aim to identify the read coverage at the first junction site between the bacterial genome and the transposon (see Figure_1.png). By default, we focus only on the second junction site between the transposon and the bacterial genome. For an example of how to identify these junction sites, see the following command.

Figure_1

    #-primer AGATGTGTATAAGAGACAG -mismatches 1
    #-primer TAAGAGACAG -mismatches 1
    #mismatches1 1
    #primer_start_window 0,159
    #window_size -1
    #transposon Tn5
    #protocol Tn5
    python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./240405_VH00358_89_AAFC5MTM5/kr1/initial_mutants_rep1_S25_R1_001.fastq -reads2 ./240405_VH00358_89_AAFC5MTM5/kr1/initial_mutants_rep1_S25_R2_001.fastq -output initial_mutants_rep1 -primer TAAGAGACAG -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689
    mv tpp.cfg initial_mutants_rep1.tpp.cfg