Author Archives: gene_x

transponson saturation test

TODO: edit and sort the following text after vacation

# Define the number of reads to sample
for i in {1..9}; do
  num_reads=$((i * 1000000))

  # Sample the reads
  seqtk sample -s100 ./240405_VH00358_89_AAFC5MTM5/kr1/initial_mutants_rep1_S25_R1_001.fastq $num_reads > ./samples/initial_mutants_rep1_S25_R1_${num_reads}.fastq
  seqtk sample -s100 ./240405_VH00358_89_AAFC5MTM5/kr1/initial_mutants_rep1_S25_R2_001.fastq $num_reads > ./samples/initial_mutants_rep1_S25_R2_${num_reads}.fastq
done

#set the parameters in ./lib/python3.10/site-packages/pytpp/tpp_tools.py
#set primer_start_window
    vars.primer_start_window = 0,159
#we didn't give primer-parameter, therefore the prefix is defined as TAAGAGACAG. We have only set a condition, namely prefix (see below).
    if "primer" not in kwargs and vars.transposon == "Tn5":
        vars.prefix = "TAAGAGACAG"
#Thank you for your feedback. In my last analysis, I focused on verifying whether read1 contains the end segment "TAAGAGACAG" of the 95-nt transposon sequence, allowing for one mismatch. This step is pivotal as accurately locating this segment in read1 is essential for pinpointing the exact insertion position of the transposon. The presence of "TAAGAGACAG" is critical for downstream analysis because without it, we cannot accurately determine transposon insertion sites. Utilizing this strategy, we found that approximately 6% of read1 in the initial_mutants_rep1 sample contained "TAAGAGACAG" (allowing for one mismatch). Almost all of these reads also start with the sequences "ACCTACAACAAAGCTCTCATCAACC", "CACCTACAACAAAGCTCTCATCAAC", or "CCTACAACAAAGCTCTCATCAACCG".
#For reference, I have listed the 95-nt sequences from the transposon as follows: ACCTACAACAAAGCTCTCATCAAC CGTGGCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTCAGGGTTGAGATGTGTA TAAGAGACAG.

conda deactivate

# ------ for 1000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_1000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_1000000.fastq -output 1000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

# Break-down of total reads (1000000):
#  939516 reads (94.0%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (60484), some of them are not starting with the sequences "ACCTACAACAAAGCTCTCATCAACC", "CACCTACAACAAAGCTCTCATCAAC", or "CCTACAACAAAGCTCTCATCAACCG". --> only 47994 are in the Excel-table.

mv tpp.cfg 1000000.tpp.cfg
mkdir 1M
mv 1000000* 1M
cd 1M
cp 1000000.tn_stats 1000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 1000000.tn_stats_ 1000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)  47994 and 44916, 29258

# ------ for 2000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_2000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_2000000.fastq -output 2000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 2000000.tpp.cfg
mkdir 2M
mv 2000000* 2M
cd 2M
cp 2000000.tn_stats 2000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 2000000.tn_stats_ 2000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)  96067 84578           43293

# ------ for 3000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_3000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_3000000.fastq -output 3000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 3000000.tpp.cfg
mkdir 3M
mv 3000000* 3M
cd 3M
cp 3000000.tn_stats 3000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 3000000.tn_stats_ 3000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)    143700  119668          51657

# ------ for 4000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_4000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_4000000.fastq -output 4000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 4000000.tpp.cfg
mkdir 4M
mv 4000000* 4M
cd 4M
cp 4000000.tn_stats 4000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 4000000.tn_stats_ 4000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)    191626  151444          57684

# ------ for 5000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_5000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_5000000.fastq -output 5000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 5000000.tpp.cfg
mkdir 5M
mv 5000000* 5M
cd 5M
cp 5000000.tn_stats 5000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 5000000.tn_stats_ 5000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)    239814  180369          62089

# ------ for 6000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_6000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_6000000.fastq -output 6000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 6000000.tpp.cfg
mkdir 6M
mv 6000000* 6M
cd 6M
cp 6000000.tn_stats 6000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 6000000.tn_stats_ 6000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)    287396  206325          65438

# ------ for 7000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_7000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_7000000.fastq -output 7000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 7000000.tpp.cfg
mkdir 7M
mv 7000000* 7M
cd 7M
cp 7000000.tn_stats 7000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 7000000.tn_stats_ 7000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)    335593  230313          68150

# ------ for 8000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_8000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_8000000.fastq -output 8000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 8000000.tpp.cfg
mkdir 8M
mv 8000000* 8M
cd 8M
cp 8000000.tn_stats 8000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 8000000.tn_stats_ 8000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)    383769  252191          70463

# ------ for 9000000 paired reads ------
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m.fna -reads1 ./samples/initial_mutants_rep1_S25_R1_9000000.fastq -reads2 ./samples/initial_mutants_rep1_S25_R2_9000000.fastq -output 9000000 -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689

mv tpp.cfg 9000000.tpp.cfg
mkdir 9M
mv 9000000* 9M
cd 9M
cp 9000000.tn_stats 9000000.tn_stats_
#Delete all general statistics before the table data in initial_mutants_rep1.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ../parse_tn_stats.py 9000000.tn_stats_ 9000000.tn_stats.xlsx

#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)", =SUM(F2:F130)    432255  272550          72303

# ------ for 9181515  paired reads ------
# ------ for the complete reads (9181515 x 2 reads), we have the folloiwng number of insertion sites: 441057    276060          72595

#------ Draw saturation graph in python3 ------

# Updated y-axis data
template_count = [44916, 84578, 119668, 151444, 180369, 206325, 230313, 252191, 272550, 276060]
TAs_hit = [29258, 43293, 51657, 57684, 62089, 65438, 68150, 70463, 72303, 72595]

# Plotting the graph with updated data
plt.figure(figsize=(10, 6))

# Plot template_count
plt.plot(paired_reads, template_count, 'o-', label='Template Count', color='blue')

# Plot TAs_hit
plt.plot(paired_reads, TAs_hit, 'o-', label='TAs Hit', color='red')

# Labels and title
plt.xlabel('Input Paired Reads Number')
plt.ylabel('Count')
plt.title('Saturation Curve for Template Count and TAs Hit')
plt.legend()

# Show plot
plt.grid(True)
plt.show()

transponson off-target analysis

Leave a reply

TODO: edit and sort the following text after vacation

# Input 9,181,515x2 reads
# -- starting point: initial_mutants_rep1.trimmed1_failed_trim

# -- generate bam-file containing  not-mapped reads --
#I have a sam-file containing 8586773 alignments from read1. I have 9181515 read2. Can I produce the sorted bam files of 8586773 paired-ends on WA-314_m.fna.

#1115132/2=557566 + 8586773 = 9144339 != 9183535
I have the paired read1 and read2, update the bwa commands
samtools view initial_mutants_rep1.trimmed1_failed_trim | cut -f1 > read1_ids.txt
seqtk subseq initial_mutants_rep1.reads1 read1_ids.txt > filtered_initial_mutants_rep1.reads1.fastq
seqtk subseq initial_mutants_rep1.reads2 read1_ids.txt > filtered_initial_mutants_rep1.reads2.fastq

bwa index ../WA-314_locusid.fasta
bwa mem -M ../WA-314_locusid.fasta filtered_initial_mutants_rep1.reads1.fastq filtered_initial_mutants_rep1.reads2.fastq > initial_mutants_rep1_no_insertion_sites.sam
# Convert SAM to BAM
samtools view -Sb initial_mutants_rep1_no_insertion_sites.sam > initial_mutants_rep1_no_insertion_sites.bam
# Sort the BAM file
samtools sort initial_mutants_rep1_no_insertion_sites.bam -o initial_mutants_rep1_sorted_no_insertion_sites.bam
# Index the sorted BAM file
samtools index initial_mutants_rep1_sorted_no_insertion_sites.bam

#Visualization using tools like IGV to inspect the alignment and distribution of reads.

#Analysis and Visualization: perform various analyses, such as coverage checks, quality assessments.

samtools view -b -f 2 initial_mutants_rep1_sorted_no_insertion_sites.bam > initial_mutants_rep1_properly_paired.bam
bedtools genomecov -ibam initial_mutants_rep1_properly_paired.bam -bg > initial_mutants_rep1_coverage.bedgraph
samtools depth initial_mutants_rep1_properly_paired.bam > initial_mutants_rep1_depth.txt
python3 ~/Scripts/draw_coverages.py

#----> TODO_1 (TOMORROW): draw coverage plots of the reads! senden the results to them!
#      TODO_2 (TOMORROW): generate the saturation plots tomorrow!)

samtools faidx WA-314_m.fasta "gi|420257081|ref|NZ_AKKR01000009.1|contig_1_1":31429-31582

Are the following 2 sequences related? for example, subsequences, or reverse complementary?
>gi|420257081|ref|NZ_AKKR01000009.1|contig_1_1:31429-31582
CGTGATCAGGTCTTTTTAGCGACTAAGTTCGGTATTGTGCGTGACCCCACTGATCCCAGG
GTACGCGGTGTGAGCAGCCGTCCTGATTATATCCGCCAGTCGGTTGATGGCAGCCTTAAG
CGCTTGGGTGTTGATGTCATTGATCTTTATTACC
>transposon
CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTAC
ATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAA
TACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATT
CCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGT
GCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAA
AGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTA
TGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACT
GCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATAT
TGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTT
TTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTT
GATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAAT
GCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATA
ACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCA
GACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACA
GAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATT
TGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATT
ACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGAT
CAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCAC
CAACTGGTCCACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCT
GCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG

To determine if the paired reads provide evidence of an insertion site, we need to analyze how the reads align to both the reference genome and the transposon sequence. Typically, for transposon sequencing, we expect to see reads that span the junction between the transposon and the insertion site in the genome.

Given the provided read information and the transposon sequence, here are some key points to consider:

    Read Alignment to the Reference Genome:
        The read is aligned to positions on the reference genome NZ_AKKR01000009 from 31,438 to 31,543 on the reverse strand.
        The CIGAR string 106M52S indicates that 106 bases of the read are mapped to the reference, while the remaining 52 bases are soft clipped (not mapped to the reference).

    Mate Read Information:
        The mate read starts at position 31,408 on the forward strand of the same reference sequence.
        The insert size is -135, suggesting the reads are overlapping or very close to each other.

    Supplementary Alignments:
        There are supplementary alignments on the reference genome:
            From 31,212 to 31,269 on the forward strand (58bp)
            From 31,438 to 31,543 on the reverse strand (107bp)

    Soft Clipped Sequence:
        The 52 soft clipped bases from the read may contain the transposon sequence or part of it, indicating a junction between the transposon and the genome.

To check for evidence of an insertion site, we should:

    Extract the soft clipped sequence.
    Compare it with the transposon sequence to see if it matches or partially matches the transposon sequence.

Extracting the Soft Clipped Sequence

Given the read length is 158bp and 106 bases are mapped, the remaining 52 bases are soft clipped.

The soft clipped sequence should be extracted from the end of the read.
Check if the Soft Clipped Sequence Matches the Transposon Sequence

You can use sequence alignment tools or a simple script to compare the soft clipped sequence with the transposon sequence.

Let's extract and compare the soft clipped sequence manually.
Manual Extraction and Comparison

Assume the read sequence is as follows:

less

Read sequence (158bp): [106 bases aligned][52 bases soft clipped]

To extract the 52 soft clipped bases:

    If the read is provided or can be retrieved from the FASTQ file, you can directly extract the sequence.
    Compare the extracted sequence with the start of the transposon sequence.

Example Extraction and Comparison Script

Here is a Python script to extract the soft clipped sequence and compare it with the transposon sequence:

python

# Example read sequence (for illustration, actual sequence needed from FASTQ)
read_sequence = "ACGT... (158bp read sequence here)"

# Extract the 52 soft clipped bases from the end
soft_clipped_sequence = read_sequence[-52:]

# Provided transposon sequence
transposon_sequence = ("CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTAC"
                       "ATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAA"
                       "TACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATT"
                       "CCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGT"
                       "GCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAA"
                       "AGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTA"
                       "TGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACT"
                       "GCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATAT"
                       "TGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTT"
                       "TTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTT"
                       "GATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAAT"
                       "GCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATA"
                       "ACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCA"
                       "GACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACA"
                       "GAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATT"
                       "TGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATT"
                       "ACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGAT"
                       "CAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCAC"
                       "CAACTGGTCCACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCT"
                       "GCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG")

# Check if the soft clipped sequence matches the start of the transposon sequence
if soft_clipped_sequence in transposon_sequence:
    print("The soft clipped sequence matches the transposon sequence, indicating a potential insertion site.")
else:
    print("No match found. The soft clipped sequence does not match the transposon sequence.")

# Print the sequences for reference
print("Soft clipped sequence:", soft_clipped_sequence)
print("Transposon sequence start:", transposon_sequence[:52])

Conclusion

To determine if the paired reads provide evidence of an insertion site:

    Extract the soft clipped sequence from the read.
    Compare the soft clipped sequence with the transposon sequence to see if there is a match.
    If there is a match, it indicates a junction between the transposon and the genome, suggesting an insertion site.

The analysis will confirm if the reads are evidence of an insertion site by showing the overlap or junction between the transposon and the genomic sequence.

#--------------
#VH00358:89:AAFC5MTM5:1:2202:39325:27126_:N:0:TTTCTCTA+CTCGACG

I have a lot of reads over 20000 reads in certain positions with soft clips (see the attached image). could you write a script check how many reads from initial_mutants_rep1_no_insertion_sites.sam can confirm the NZ_AKKR01000009:31537 is a insertion site?

python check_insertion_site.py

>transposon
CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTACATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAATACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATTCCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGTGCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAAAGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTATGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACTGCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATATTGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTTTTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTTGATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAATGCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATAACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCAGACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACAGAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATTTGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATTACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGATCAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCACCAACTGGTC
|CACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG

add the step to compare the clipped sequences to the transposon sequences. to locate which subsequences of transposon are inserted?

#In "saturation_graph.png," the x-axis represents the input paired reads number, and the y-axis shows the counts for both the template count and TAs hit. It suggests that deeper sequencing will yield more template counts and TAs hits, particularly for the template counts.

samtools flagstat initial_mutants_rep1_sorted_no_insertion_sites.bam
17836098 + 0 in total (QC-passed reads + QC-failed reads)
17173546 + 0 primary
662552 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
17000489 + 0 mapped (95.32% : N/A)
16337937 + 0 primary mapped (95.13% : N/A)
17173546 + 0 paired in sequencing
8586773 + 0 read1
8586773 + 0 read2
15942110 + 0 properly paired (92.83% : N/A)
16234318 + 0 with itself and mate mapped
103619 + 0 singletons (0.60% : N/A)
167124 + 0 with mate mapped to a different chr
163885 + 0 with mate mapped to a different chr (mapQ>=5)

ME Erkennungs Sequenz 5 ́ AGATGTGTATAAGAGACAG 3 ́
ME Sequenz Komplementär: 3 ́ TCTACACATATTCTCTGTC 5 ́
ME Sequenz Rev Komplementär: 5 ́ CTGTCTCTTATACACATCT 3 ́
Kan-2 FP-1 Forward Primer 5 ́ ACCTACAACAAAGCTCTCATCAACC 3 ́
Kan-2 RP-1 Reverse Primer 5 ́ GCAATGTAACATCAGAGATTTTGAG 3 ́
Ampli-Start Senior et al 5’ ACCTACAACAAAGCTCTCATCAACC
 3 ́
Sequenzierungs-Strategie (NEBNext Ultra II DNA Modifikation)
NEB_i5_Tn_01
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTACCTACAACAAAGCTCTCATCAACC-3'
NEB_i5_Tn_02
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCACCTACAACAAAGCTCTCATCAAC-3'
NEB_i5_Tn_03
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCCTACAACAAAGCTCTCATCAACCG-3'
>> mix 3 forward primer equimolar to create diversity in start of
read 1 (frameshift at the transposon amplification sequence)
NEB_i7_rc
5'-GACTGGAGTTCAGACGTGTGCTCTTCCGATC-3'

>transposon
CTGTCTCTTATACACATCTCAACCATCATCGATGAATTGTGTCTCAAAATCTCTGATGTTACATTGCACAAGATAAAAATATATCATCATGAACAATAAAACTGTCTGCTTACATAAACAGTAATACAAGGGGTGTTATGAGCCATATTCAACGGGAAACGTCTTGCTCGAGGCCGCGATTAAATTCCAACATGGATGCTGATTTATATGGGTATAAATGGGCTCGCGATAATGTCGGGCAATCAGGTGCGACAATCTATCGATTGTATGGGAAGCCCGATGCGCCAGAGTTGTTTCTGAAACATGGCAAAGGTAGCGTTGCCAATGATGTTACAGATGAGATGGTCAGACTAAACTGGCTGACGGAATTTATGCCTCTTCCGACCATCAAGCATTTTATCCGTACTCCTGATGATGCATGGTTACTCACCACTGCGATCCCCGGAAAAACAGCATTCCAGGTATTAGAAGAATATCCTGATTCAGGTGAAAATATTGTTGATGCGCTGGCAGTGTTCCTGCGCCGGTTGCATTCGATTCCTGTTTGTAATTGTCCTTTTAACAGCGATCGCGTATTTCGTCTCGCTCAGGCGCAATCACGAATGAATAACGGTTTGGTTGATGCGAGTGATTTTGATGACGAGCGTAATGGCTGGCCTGTTGAACAAGTCTGGAAAGAAATGCATAAACTTTTGCCATTCTCACCGGATTCAGTCGTCACTCATGGTGATTTCTCACTTGATAACCTTATTTTTGACGAGGGGAAATTAATAGGTTGTATTGATGTTGGACGAGTCGGAATCGCAGACCGATACCAGGATCTTGCCATCCTATGGAACTGCCTCGGTGAGTTTTCTCCTTCATTACAGAAACGGCTTTTTCAAAAATATGGTATTGATAATCCTGATATGAATAAATTGCAGTTTCATTTGATGCTCGATGAGTTTTTCTAATCAGAATTGGTTAATTGGTTGTAACACTGGCAGAGCATTACGCTGACTTGACGGGACGGCGGCTTTGTTGAATAAATCGAACTTTTGCTGAGTTGAAGGATCAGATCACGCATCTTCCCGACAACGCAGACCGTTCCGTGGCAAAGCAAAAGTTCAAAATCACCAACTGGTC
|CACCTACAACAAAGCTCTCATCAACCGTGGCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTCAGGGTTGAGATGTGTATAAGAGACAG

AGAGCTTTGTTGTAGGTG -->
GTGGATGTTGTTTCGAGA
TCTCGAAACAACATCCAC
CACCTACAACAAAGCTCT *

AGAGCTTTGTTGTAGG

Draw plots for miRNAs generated by COMPSRA

Leave a reply

Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.
```
COMPSRA_MERGE_0_miRNA.txt *
COMPSRA_MERGE_0_piRNA.txt *
COMPSRA_MERGE_0_snRNA.txt *
COMPSRA_MERGE_0_tRNA.txt
COMPSRA_MERGE_0_snoRNA.txt
COMPSRA_MERGE_0_circRNA.txt
```
Input files for miRNA are two files: COMPSRA_MERGE_0_miRNA.txt and ids under “/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_7/miRNA_re”
- COMPSRA_MERGE_0_miRNA.txt
  
  #./our_out_without_cutadapt/COMPSRA_MERGE_0_miRNA.txt #./our_out_on_hg38/COMPSRA_MERGE_0_miRNA.txt #./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt diff ./our_out_on_hg38/COMPSRA_MERGE_0_miRNA.txt ./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt #–> different! #BUG: why using the same code resulting in different miRNA results! A little different!! #–> using ./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt
  
  cp our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt miRNA_re cd miRNA_re
- prepare the file ids
  
  #cp ./our_out_on_hg38+JN707599_miRNA_merging_replicates/ids miRNA_re for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo “cut -d’,’ -f1-1 ${i}-up.txt > ${i}-up.id”; echo “cut -d’,’ -f1-1 ${i}-down.txt > ${i}-down.id”; done cat *.id | sort -u > ids #add Gene_Id in the first line, delete the “”

Draw plots with R using DESeq2

#BiocManager::install("AnnotationDbi")
#BiocManager::install("clusterProfiler")
#BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
#BiocManager::install("limma")
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library("org.Hs.eg.db")
library(DESeq2)
library(gplots)
library(limma)
# Check the current library paths
.libPaths()
#setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/our_out_on_hg38+JN707599_2024_corrected/")

d.raw<- read.delim2("COMPSRA_MERGE_0_miRNA.txt",sep="\t", header=TRUE, row.names=1)
d.raw$X <- NULL
#colnames(d.raw)<- c("WaGa_EV", "MKL1_EV", "WaGa_wt", "MKL1_wt")
d.raw[] <- lapply(d.raw, as.numeric)

#MCPyV-M1   0   0   29  0   23  0   30  8   0   0   202 196
# New row to add
new_row <- data.frame(
  X0505_WaGa_sT_DMSO = 0,
  X1905_WaGa_sT_DMSO = 0,
  X0505_WaGa_sT_Dox = 29,
  X1905_WaGa_sT_Dox = 0,
  X0505_WaGa_scr_DMSO = 23,
  X1905_WaGa_scr_DMSO = 0,
  X0505_WaGa_scr_Dox = 30,
  X1905_WaGa_scr_Dox = 8,
  X0505_WaGa_wt = 0,
  X1905_WaGa_wt = 0,
  control_MKL1 = 202,
  control_WaGa = 196,
  row.names = "MCPyV-M1"
)

# Add the new row to the data frame
d.raw <- rbind(d.raw, new_row)

#cell_line = as.factor(c("WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "MKL1","WaGa"))
#condition = as.factor(c("sT","sT", "sT","sT", "scr","scr", "scr","scr", "wt","wt", "control","control"))
#treatment = as.factor(c("DMSO","DMSO", "Dox","Dox", "DMSO","DMSO", "Dox","Dox", "wt","wt", "control","control"))

EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))

cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)

rld <- rlogTransformation(dds)

# -- before pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("replicates"))
#plotPCA(rld, intgroup = c("replicates", "batch"))
#plotPCA(rld, intgroup = c("replicates", "ids"))
#plotPCA(rld, "batch")
dev.off()

png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("donor"))
dev.off()

##### STEP3: prepare all_genes #####
rld <- rlogTransformation(dds)
mat <- assay(rld)
mm <- model.matrix(~replicates, colData(rld))
mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
assay(rld) <- mat
RNASeq.NoCellLine <- assay(rld)
# reorder the columns
colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]

#Option4: manully defining
#for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
#cat *.id | sort -u > ids
##add Gene_Id in the first line, delete the ""
GOI <- read.csv("ids")$Gene_Id
datamat = RNASeq.NoCellLine[GOI, ]

##### STEP4: clustering the genes and draw heatmap #####
datamat <- datamat[,-1]  #delete the sample "control MKL1"
colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
colnames(datamat)[2] <- "WaGa wildtype 0505"
colnames(datamat)[3] <- "WaGa wildtype 1905"
colnames(datamat)[4] <- "WaGa sT DMSO 0505"
colnames(datamat)[5] <- "WaGa sT DMSO 1905"
colnames(datamat)[6] <- "WaGa sT Dox 0505"
colnames(datamat)[7] <- "WaGa sT Dox 1905"
colnames(datamat)[8] <- "WaGa scr DMSO 0505"
colnames(datamat)[9] <- "WaGa scr DMSO 1905"
colnames(datamat)[10] <- "WaGa scr Dox 0505"
colnames(datamat)[11] <- "WaGa scr Dox 1905"
write.csv(datamat, file ="gene_expression_keeping_replicates.txt")

#"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/2.0)
mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

mycol = mycol[as.vector(mycl)]
png("miRNA_heatmap_keeping_replicates.png", width=800, height=1000)
#svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
heatmap.2(as.matrix(datamat),
  Rowv=as.dendrogram(hr),
  Colv=NA,
  dendrogram='row',
  labRow="",
  scale='row',
  trace='none',
  col=bluered(75),
  RowSideColors=mycol,
  srtCol=20,
  lhei=c(1,8),
  #cexRow=1.2,   # Increase row label font size
  cexCol=1.7,    # Increase column label font size
  margin=c(7,1)
 )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt GREEN.txt CYAN.txt MAGENTA.txt LIGHTCYAN.txt GREY.txt RED.txt -d',' -o miRNA_heatmap_keeping_replicates.xls

# merging replicates
datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
datamat <- datamat[,c(-2:-11)]
write.csv(datamat, file ="gene_expression_merging_replicates.txt")

# Ensure 'mycl' is calculated properly.
mycl <- cutree(hr, h=max(hr$height)/2.2)
# TODO: Rearrange the colors of the plot *_merging_replicates.png to match those in *_keeping_replicates.png!
# mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

# Now map your clusters to colors, making sure that there's one color for each row:
actualColors <- mycol[mycl]  # Assign colors based on cluster assignment

# Then use these 'actualColors' in your heatmap:
png("miRNA_heatmap_merging_replicates.png", width=800, height=1000)
heatmap.2(as.matrix(datamat),
          Rowv=as.dendrogram(hr),
          Colv=NA,
          dendrogram='row',
          labRow="",
          scale='row',
          trace='none',
          col=bluered(75),
          RowSideColors=actualColors, # Update this part
          srtCol=20,
          lhei=c(1,8),
          cexCol=1.7,    # Increase column label font size
          margin=c(7,1)
        )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
#write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o miRNA_heatmap_merging_replicates.xls
#100+11+4+7+2+5+14+63=206

Display viral transcripts found in mRNA-seq (or small RNA) MKL-1, WaGa EVs compared to parental cells

http://xgenes.com/article/article-content/87/display-viral-transcripts-found-in-mrna-seq-mkl-1-waga-evs-compared-to-cells/

TODO: why miRNA_heatmap_keeping_replicates.png and miRNA_heatmap_merging_replicates.png are different while gene_expression_keeping_replicates.txt and gene_expression_merging_replicates.txt are the same?

diff our_out_on_hg38+JN707599_miRNA_keeping_replicates/gene_expression_keeping_replicates.txt miRNA_re/gene_expression_keeping_replicates.txt
diff our_out_on_hg38+JN707599_miRNA_merging_replicates/gene_expression_merging_replicates.txt miRNA_re/gene_expression_merging_replicates.txt

display our_out_on_hg38+JN707599_miRNA_keeping_replicates/miRNA_heatmap_keeping_replicates.png
display our_out_on_hg38+JN707599_miRNA_merging_replicates/miRNA_heatmap_merging_replicates.png
display miRNA_re/miRNA_heatmap_keeping_replicates.png
display miRNA_re/miRNA_heatmap_merging_replicates.png

#TODO: If necessary, in the next revision, say the plots cannot be reproduced due to the software-update.

Draw plots for piRNA generated by COMPSRA

Leave a reply

Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.
```
COMPSRA_MERGE_0_miRNA.txt
COMPSRA_MERGE_0_piRNA.txt *
COMPSRA_MERGE_0_snRNA.txt
COMPSRA_MERGE_0_tRNA.txt
COMPSRA_MERGE_0_snoRNA.txt
COMPSRA_MERGE_0_circRNA.txt
```
Input files for piRNA are two files: COMPSRA_MERGE_0_piRNA.txt and ids
- COMPSRA_MERGE_0_piRNA.txt
  
  #The former are more precise due to the reads from virus will be mapped on the virus-genome diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_piRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_piRNA.txt diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_snRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_snRNA.txt cp ../our_out_on_hg38+JN707599/COMPSRA_MERGE_0_piRNA.txt .
- prepare the file ids
  
  #see Option4: manully defining

Draw plots with R using DESeq2

#BiocManager::install("AnnotationDbi")
#BiocManager::install("clusterProfiler")
#BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
#BiocManager::install("limma")
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library("org.Hs.eg.db")
library(DESeq2)
library(gplots)
library(limma)
# Check the current library paths
.libPaths()
#setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/our_out_on_hg38+JN707599_2024_corrected/")

d.raw<- read.delim2("COMPSRA_MERGE_0_piRNA.txt",sep="\t", header=TRUE, row.names=1)
d.raw$X <- NULL
d.raw[] <- lapply(d.raw, as.numeric)

EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))

cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)

rld <- rlogTransformation(dds)

# -- before pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("replicates"))
#plotPCA(rld, intgroup = c("replicates", "batch"))
#plotPCA(rld, intgroup = c("replicates", "ids"))
#plotPCA(rld, "batch")
dev.off()

png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("donor"))
dev.off()

#### STEP2: DEGs ####
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
sizeFactors(dds)
#NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)
write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)

#---- * to untreated ----
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~EV_or_parental+donor)
dds$EV_or_parental <- relevel(dds$EV_or_parental, "parental")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("EV_vs_parental")
for (i in clist) {
  contrast = paste("EV_or_parental", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
  res$padj <- ifelse(is.na(res$padj), 1, res$padj)
  res_df <- as.data.frame(res)
  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}
#~/Tools/csv2xls-0.4/csv_to_xls.py EV_vs_parental-all.txt EV_vs_parental-up.txt EV_vs_parental-down.txt -d$',' -o EV_vs_parental.xls;

dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
dds$replicates <- relevel(dds$replicates, "sT_DMSO")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("sT_Dox_vs_sT_DMSO")

dds$replicates <- relevel(dds$replicates, "scr_Dox")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("sT_Dox_vs_scr_Dox")

dds$replicates <- relevel(dds$replicates, "scr_DMSO")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("scr_Dox_vs_scr_DMSO", "sT_DMSO_vs_scr_DMSO")

for (i in clist) {
  contrast = paste("replicates", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
  res$padj <- ifelse(is.na(res$padj), 1, res$padj)
  res_df <- as.data.frame(res)
  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_Dox_vs_sT_DMSO-all.txt \
sT_Dox_vs_sT_DMSO-up.txt \
sT_Dox_vs_sT_DMSO-down.txt \
-d$',' -o sT_Dox_vs_sT_DMSO.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_Dox_vs_scr_Dox-all.txt \
sT_Dox_vs_scr_Dox-up.txt \
sT_Dox_vs_scr_Dox-down.txt \
-d$',' -o sT_Dox_vs_scr_Dox.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
scr_Dox_vs_scr_DMSO-all.txt \
scr_Dox_vs_scr_DMSO-up.txt \
scr_Dox_vs_scr_DMSO-down.txt \
-d$',' -o scr_Dox_vs_scr_DMSO.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_DMSO_vs_scr_DMSO-all.txt \
sT_DMSO_vs_scr_DMSO-up.txt \
sT_DMSO_vs_scr_DMSO-down.txt \
-d$',' -o sT_DMSO_vs_scr_DMSO.xls;

##### STEP3: prepare all_genes #####
rld <- rlogTransformation(dds)
mat <- assay(rld)
mm <- model.matrix(~replicates, colData(rld))
mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
assay(rld) <- mat
RNASeq.NoCellLine <- assay(rld)
# reorder the columns
colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]

#Option4: manully defining
#for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
#cat *.id | sort -u > ids
##add Gene_Id in the first line, delete the ""
GOI <- read.csv("ids")$Gene_Id
datamat = RNASeq.NoCellLine[GOI, ]

##### STEP4: clustering the genes and draw heatmap #####
datamat <- datamat[,-1]  #delete the sample "control MKL1"
colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
colnames(datamat)[2] <- "WaGa wildtype 0505"
colnames(datamat)[3] <- "WaGa wildtype 1905"
colnames(datamat)[4] <- "WaGa sT DMSO 0505"
colnames(datamat)[5] <- "WaGa sT DMSO 1905"
colnames(datamat)[6] <- "WaGa sT Dox 0505"
colnames(datamat)[7] <- "WaGa sT Dox 1905"
colnames(datamat)[8] <- "WaGa scr DMSO 0505"
colnames(datamat)[9] <- "WaGa scr DMSO 1905"
colnames(datamat)[10] <- "WaGa scr Dox 0505"
colnames(datamat)[11] <- "WaGa scr Dox 1905"
write.csv(datamat, file ="gene_expression_keeping_replicates.txt")

#"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.5)
mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

mycol = mycol[as.vector(mycl)]
png("piRNA_heatmap_keeping_replicates.png", width=800, height=1000)
#svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
heatmap.2(as.matrix(datamat),
  Rowv=as.dendrogram(hr),
  Colv=NA,
  dendrogram='row',
  labRow="",
  scale='row',
  trace='none',
  col=bluered(75),
  RowSideColors=mycol,
  srtCol=20,
  lhei=c(1,8),
  #cexRow=1.2,   # Increase row label font size
  cexCol=1.7,    # Increase column label font size
  margin=c(7,1)
 )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt -d',' -o piRNA_heatmap_keeping_replicates.xls

mv piRNA_heatmap_keeping_replicates.png piRNA_heatmap.png
mv piRNA_heatmap_keeping_replicates.xls piRNA_heatmap.xls
mv pca.png piRNA_pca.png
mv EV_vs_parental.xls piRNA_EV_vs_parental.xls
# --> SENDING piRNA_pca.png, piRNA_EV_vs_parental.xls, piRNA_heatmap.png and piRNA_heatmap.xls

# ---- NOT WORKING WELL ----
# merging replicates
datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
datamat <- datamat[,c(-2:-11)]
write.csv(datamat, file ="gene_expression_merging_replicates.txt")

# Ensure 'mycl' is calculated properly.
mycl <- cutree(hr, h=max(hr$height)/2.2)
# TODO: Rearrange the colors of the plot *_merging_replicates.png to match those in *_keeping_replicates.png!
# mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

# Now map your clusters to colors, making sure that there's one color for each row:
actualColors <- mycol[mycl]  # Assign colors based on cluster assignment

# Then use these 'actualColors' in your heatmap:
png("piRNA_heatmap_merging_replicates.png", width=800, height=1000)
heatmap.2(as.matrix(datamat),
          Rowv=as.dendrogram(hr),
          Colv=NA,
          dendrogram='row',
          labRow="",
          scale='row',
          trace='none',
          col=bluered(75),
          RowSideColors=actualColors, # Update this part
          srtCol=20,
          lhei=c(1,8),
          cexCol=1.7,    # Increase column label font size
          margin=c(7,1)
        )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
#write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o piRNA_heatmap_merging_replicates.xls

Draw plots for snRNA generated by COMPSRA

Leave a reply

Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.
```
COMPSRA_MERGE_0_miRNA.txt
COMPSRA_MERGE_0_piRNA.txt
COMPSRA_MERGE_0_snRNA.txt *
COMPSRA_MERGE_0_tRNA.txt
COMPSRA_MERGE_0_snoRNA.txt
COMPSRA_MERGE_0_circRNA.txt
```
Input files for snRNA are two files: COMPSRA_MERGE_0_snRNA.txt and ids
- COMPSRA_MERGE_0_snRNA.txt
  
  #The former are more precise due to the reads from virus will be mapped on the virus-genome diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_snRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_snRNA.txt cp ../our_out_on_hg38+JN707599/COMPSRA_MERGE_0_snRNA.txt .
- prepare the file ids
  
  #see Option4: manully defining

Draw plots with R using DESeq2

#BiocManager::install("AnnotationDbi")
#BiocManager::install("clusterProfiler")
#BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
#BiocManager::install("limma")
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library("org.Hs.eg.db")
library(DESeq2)
library(gplots)
library(limma)
# Check the current library paths
.libPaths()
#setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/our_out_on_hg38+JN707599_2024_corrected/")

d.raw<- read.delim2("COMPSRA_MERGE_0_snRNA.txt",sep="\t", header=TRUE, row.names=1)
d.raw$X <- NULL
d.raw[] <- lapply(d.raw, as.numeric)

EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))

cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)

rld <- rlogTransformation(dds)

# -- before pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("replicates"))
#plotPCA(rld, intgroup = c("replicates", "batch"))
#plotPCA(rld, intgroup = c("replicates", "ids"))
#plotPCA(rld, "batch")
dev.off()

png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("donor"))
dev.off()

#### STEP2: DEGs ####
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
sizeFactors(dds)
#NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)
write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)

#---- * to untreated ----
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~EV_or_parental+donor)
dds$EV_or_parental <- relevel(dds$EV_or_parental, "parental")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("EV_vs_parental")
for (i in clist) {
  contrast = paste("EV_or_parental", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
  res$padj <- ifelse(is.na(res$padj), 1, res$padj)
  res_df <- as.data.frame(res)
  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}
#~/Tools/csv2xls-0.4/csv_to_xls.py EV_vs_parental-all.txt EV_vs_parental-up.txt EV_vs_parental-down.txt -d$',' -o EV_vs_parental.xls;

dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
dds$replicates <- relevel(dds$replicates, "sT_DMSO")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("sT_Dox_vs_sT_DMSO")

dds$replicates <- relevel(dds$replicates, "scr_Dox")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("sT_Dox_vs_scr_Dox")

dds$replicates <- relevel(dds$replicates, "scr_DMSO")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("scr_Dox_vs_scr_DMSO", "sT_DMSO_vs_scr_DMSO")

for (i in clist) {
  contrast = paste("replicates", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
  res$padj <- ifelse(is.na(res$padj), 1, res$padj)
  res_df <- as.data.frame(res)
  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_Dox_vs_sT_DMSO-all.txt \
sT_Dox_vs_sT_DMSO-up.txt \
sT_Dox_vs_sT_DMSO-down.txt \
-d$',' -o sT_Dox_vs_sT_DMSO.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_Dox_vs_scr_Dox-all.txt \
sT_Dox_vs_scr_Dox-up.txt \
sT_Dox_vs_scr_Dox-down.txt \
-d$',' -o sT_Dox_vs_scr_Dox.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
scr_Dox_vs_scr_DMSO-all.txt \
scr_Dox_vs_scr_DMSO-up.txt \
scr_Dox_vs_scr_DMSO-down.txt \
-d$',' -o scr_Dox_vs_scr_DMSO.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_DMSO_vs_scr_DMSO-all.txt \
sT_DMSO_vs_scr_DMSO-up.txt \
sT_DMSO_vs_scr_DMSO-down.txt \
-d$',' -o sT_DMSO_vs_scr_DMSO.xls;

##### STEP3: prepare all_genes #####
rld <- rlogTransformation(dds)
mat <- assay(rld)
mm <- model.matrix(~replicates, colData(rld))
mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
assay(rld) <- mat
RNASeq.NoCellLine <- assay(rld)
# reorder the columns
colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]

#Option4: manully defining
#for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
#cat *.id | sort -u > ids
##add Gene_Id in the first line, delete the ""
GOI <- read.csv("ids")$Gene_Id
datamat = RNASeq.NoCellLine[GOI, ]

##### STEP4: clustering the genes and draw heatmap #####
datamat <- datamat[,-1]  #delete the sample "control MKL1"
colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
colnames(datamat)[2] <- "WaGa wildtype 0505"
colnames(datamat)[3] <- "WaGa wildtype 1905"
colnames(datamat)[4] <- "WaGa sT DMSO 0505"
colnames(datamat)[5] <- "WaGa sT DMSO 1905"
colnames(datamat)[6] <- "WaGa sT Dox 0505"
colnames(datamat)[7] <- "WaGa sT Dox 1905"
colnames(datamat)[8] <- "WaGa scr DMSO 0505"
colnames(datamat)[9] <- "WaGa scr DMSO 1905"
colnames(datamat)[10] <- "WaGa scr Dox 0505"
colnames(datamat)[11] <- "WaGa scr Dox 1905"
write.csv(datamat, file ="gene_expression_keeping_replicates.txt")

#"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.5)
mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

mycol = mycol[as.vector(mycl)]
png("snRNA_heatmap_keeping_replicates.png", width=800, height=1000)
#svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
heatmap.2(as.matrix(datamat),
  Rowv=as.dendrogram(hr),
  Colv=NA,
  dendrogram='row',
  labRow="",
  scale='row',
  trace='none',
  col=bluered(75),
  RowSideColors=mycol,
  srtCol=20,
  lhei=c(1,8),
  #cexRow=1.2,   # Increase row label font size
  cexCol=1.7,    # Increase column label font size
  margin=c(7,1)
 )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt -d',' -o snRNA_heatmap_keeping_replicates.xls

mv snRNA_heatmap_keeping_replicates.png snRNA_heatmap.png
mv snRNA_heatmap_keeping_replicates.xls snRNA_heatmap.xls
mv pca.png snRNA_pca.png
mv EV_vs_parental.xls snRNA_EV_vs_parental.xls
mv sT_DMSO_vs_scr_DMSO.xls snRNA_sT_DMSO_vs_scr_DMSO.xls
mv sT_Dox_vs_scr_Dox.xls snRNA_sT_Dox_vs_scr_Dox.xls
mv sT_Dox_vs_sT_DMSO.xls snRNA_sT_Dox_vs_sT_DMSO.xls
mv scr_Dox_vs_scr_DMSO.xls snRNA_scr_Dox_vs_scr_DMSO.xls
# --> SENDING snRNA_*.png, snRNA_EV_vs_parental.xls, and snRNA_heatmap.xls

# ---- NOT WORKING WELL ----
# merging replicates
datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
datamat <- datamat[,c(-2:-11)]
write.csv(datamat, file ="gene_expression_merging_replicates.txt")

# Ensure 'mycl' is calculated properly.
mycl <- cutree(hr, h=max(hr$height)/2.9)
# mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

# Now map your clusters to colors, making sure that there's one color for each row:
actualColors <- mycol[mycl]  # Assign colors based on cluster assignment

# Then use these 'actualColors' in your heatmap:
png("snRNA_heatmap_merging_replicates.png", width=800, height=1000)
heatmap.2(as.matrix(datamat),
          Rowv=as.dendrogram(hr),
          Colv=NA,
          dendrogram='row',
          labRow="",
          scale='row',
          trace='none',
          col=bluered(75),
          RowSideColors=actualColors, # Update this part
          srtCol=20,
          lhei=c(1,8),
          cexCol=1.7,    # Increase column label font size
          margin=c(7,1)
        )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
#write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o snRNA_heatmap_merging_replicates.xls

高维数据推断：简介

Leave a reply

Text source from https://rafalab.dfci.harvard.edu/

高通量技术已将基础生物学和生物医学科学从数据贫乏的学科转变为数据密集的学科。
一个具体的例子来自于研究基因表达的领域。
基因表达是将DNA（生命的蓝图）复制到RNA（蛋白质合成模板，生命的构建基石）的过程。
在1990年代，基因表达数据的分析相当于在纸上识别黑点或从标准曲线中提取几个数字。
随着微阵列等高通量技术的出现，这一过程突然变为处理成千上万个数字。
最近，RNA测序进一步增加了数据的复杂性。
生物学家从用眼睛或简单的总结来分类结果，转变为每个样本有成千上万（现在是数百万）的测量数据需要分析。
在本章中，我们将重点讨论高通量测量背景下的统计推断。
具体而言，我们关注使用统计检验检测组间差异的问题，并以有意义的方式量化不确定性。
我们还介绍了在分析高通量数据时应结合推断使用的探索性数据分析技术。
在后续章节中，我们将研究聚类、机器学习、因子分析和多层次建模背后的统计学原理。
由于有大量可用的公共数据集，我们使用了几个基因表达的例子。
尽管如此，您将学习的统计技术在其他使用高通量技术的领域也已被证明是有用的。
微阵列、下一代测序、fMRI和质谱等技术都生成数据来回答一些问题，而我们在这里学到的知识将是不可或缺的。

数据包

我们将在接下来的章节中使用的几个示例最好通过R包获取。
这些包可以从GitHub获得，并可以使用devtools包中的install_github函数安装。

三张表

我们在本书中用作示例的大多数数据都是通过高通量技术生成的。
这些技术测量成千上万个特征。
特征的例子包括基因、基因组的单个碱基位置、基因组区域或图像像素强度。
每个特定的测量产品由一组特定的特征定义。
例如，特定的基因表达微阵列产品由它测量的基因集定义。
一个特定的研究通常会使用一个产品对若干实验单位（如个体）进行测量。
最常见的实验单位是个体，但也可以由其他实体定义，例如肿瘤的不同部分。
我们通常按照实验术语将实验单位称为样本。
重要的是不要将这些与前几章提到的样本（例如“随机样本”）混淆。
因此，高通量实验通常由三张表定义：一张包含高通量测量数据，另外两张分别包含关于这张表的列和行的信息。
因为一个数据集通常由一组实验单位定义，而一个产品定义了一组固定的特征，所以高通量测量数据可以存储在一个n × m的矩阵中，其中n是单位数，m是特征数。
在R中，惯例是存储这些矩阵的转置。以下是一个基因表达数据集的示例：
```
library(GSE5859Subset)
data(GSE5859Subset) ##这会加载三张表
dim(geneExpression)
```
我们有来自24个个体（实验单位）的血液样本中8793个基因的RNA表达测量数据。
对于大多数统计分析，我们还需要关于这些个体的信息：dim(sampleInfo)
例如，在这种情况下，数据最初是为了比较不同族群的基因表达而收集的。

其中一列，filenames，使我们能够将此表的行与测量表的列连接起来。

match(sampleInfo$filename,colnames(geneExpression))
head(match(geneAnnotation$PROBEID,rownames(geneExpression)))

Long-Read Transposon Insertion-site Sequencing vs. short-read TIS methods

Leave a reply

https://pubmed.ncbi.nlm.nih.gov/35241765/

Transposon insertion site sequencing (TIS) is a powerful method for associating genotype to phenotype.
However, all TIS methods described to date use short nucleotide sequence reads which cannot uniquely determine the locations of transposon insertions within repeating genomic sequences where the repeat units are longer than the sequence read length.
To overcome this limitation, we have developed a TIS method using Oxford Nanopore sequencing technology that generates and uses long nucleotide sequence reads; we have called this method LoRTIS (Long-Read Transposon Insertion-site Sequencing).
LoRTIS enabled the unique localisation of transposon insertion sites within long repetitive genetic elements of E. coli, such as the transposase genes of insertion sequences and copies of the ~ 5 kb ribosomal RNA operon.
We demonstrate that LoRTIS is reproducible, gives comparable results to short-read TIS methods for essential genes, and better resolution around repeat elements.
The Oxford Nanopore sequencing device that we used is cost-effective, small and easily portable.
Thus, LoRTIS is an efficient means of uniquely identifying transposon insertion sites within long repetitive genetic elements and can be easily transported to, and used in, laboratories that lack access to expensive DNA sequencing facilities.

Introduction

Transposon insertion site sequencing (TIS) is a robust tool used to identify the genetic loci associated with the phenotype of an organism1.
The general approach for TIS methods has been developed over the last decade to understand the essential genes in many bacterial and eukaryotic species2,3. * TIS has helped the scientific community to understand genes involved in responses to a wide range of stress conditions and has enabled identification of many novel susceptibility and survival mechanisms1.
The first step in TIS methods is the generation of a transposon mutant library in which the resulting pool of mutants contains many random transposon insertions in all genes (except essential genes) of the target organism at multiple positions4.
The location of transposon insertions is identified by sequencing from the transposon insertion site into adjacent regions of the target genome5.
Saturated mutant libraries can provide very high resolution coverage of the genome allowing the importance of regions within genes to be identified as well as simply a role for the gene itself6,7.
To identify the repertoire of genes essential for responses to a particular stress condition, the transposon mutant library is grown under that stress condition (test) and in the absence of that stress (control).
Differences in the prevalence of transposon mutants within the pools are compared between the stress condition of interest and the control6.
Mutants with an advantage following exposure to the stress condition will proliferate while those that are disadvantaged will become depleted.
Comparison of the total number of inserts at each site between control and condition experiments indicates loci of interest8.
The DNA of mutant libraries is sequenced, using a customised protocol, [e.g. transposon-directed insertion site sequencing (TraDIS), transposon sequencing (Tn-seq), high-throughput insertion tracking by deep sequencing (HITS) and insertion sequencing (INSeq)], from the transposon reading into the adjacent genome to identify the location of transposon insertions5,9–11.
These TIS methods use different transposon-genome junction enrichment techniques including restriction enzyme digestion, (e.g. MmeI), or the use of customised primer annealing followed by high-throughput sequencing of the resulting libraries5,12. These transposon based methods all use short-read sequencing, and the majority of them use either Illumina or Ion torrent sequencing platforms13.
Sequence reads will not map uniquely to repeat sequences within a reference genome that are longer than the nucleotide sequence reads themselves (Fig. 1)14. * The sequence read length generated by Illumina-based approaches is around 300 base pairs (for a single read), and so reads cannot map uniquely to repeat-sequence motifs that are longer than these reads15.
Examples of long repeat regions in E. coli include transposase insertion sequences, which are usually over 600 bp, and the several copies of the ribosomal RNA operons, which are over 5 kb16.
The repeat regions are even a greater challenge in eukaryotic genomes which are much larger and have numerous repeating genetic elements.
Many repeating elements are of evolutionary importance, and/or have roles in replication, recombination, transcription and genome rearrangement17.
Thus, transposon insertions into different copies of a repeating element may have different phenotypic effects.
Therefore, determining the unique locations of transposon insertion sites in repeating elements is essential for formulating meaningful hypotheses from TIS data.
Here, we describe LoRTIS (Long-Read Transposon Insertion-site Sequencing), which overcomes the limitations resulting from using short-read nucleotide sequences.
In LoRTIS, as with other TIS methods, we have used the known sequences of the transposon as an anchor point from which to generate DNA fragments by amplification of DNA into the unknown transposon insertion sites for sequencing.
However, in LoRTIS, by modifying this amplification step, much longer DNA fragments are generated, from which long sequence reads are possible using a MinION nanopore sequencer (Oxford Nanopore)18.
The resulting long-reads are then aligned with the nucleotide sequence of a reference genome, and the locations identifying the precise positions of the transposon insertions sites determined.
The longer reads can span repeat sequences and read into adjacent unique sites.
The long-reads generated by LoRTIS were able to locate transposon insertion sites uniquely within repeating sequence elements of the ribosomal RNA operons, which are approximately 5 kb.
In addition, we incorporated different nucleotide sequence identifiers (indexes) into replicates to demonstrate that multiple samples can be sequenced on a single flow cell.
Thus, LoRTIS can identify hundreds of thousands of transposon insertion sites uniquely and overcomes the limitation of short-reads generated by other TIS methods.

Methods: DNA extraction and experimental setup

The Escherichia coli strain BW25113 large transposon mutant library described by Yasir et al. was used in this study6. Genomic DNA was extracted using the Quick-DNA Fungal/Bacterial 96 Kit (Zymo Research).
LoRTIS and TraDIS-Xpress samples were prepared in duplicate, from DNA extraction to sequencing library preparation. For LoRTIS each replicate was sequenced on a separate flow cell, so Native Barcodes (NBD) were added to each replicate to test the multiplexing capability of the LoRTIS method. NBD details are shown in Supplementary Table 5.

Methods: Preparation of DNA fragments for LoRTIS and TraDIS-Xpress

For LoRTIS sequencing library preparation,1 μg of genomic DNA was tagmented using an Illumina compatible MuSeek library preparation kit (Thermo Fisher Scientific) and the tagmented DNA was cleaned using 0.5× volume of AMPure XP magnetic beads (Beckman-Coulter). LongAmp Taq DNA polymerase (New England Biolabs, UK) was used with biotinylated oligonucleotide primers (Supplementary Table 5) that hybridise specifically within the transposon to amplify DNA fragments containing transposon insertion junctions (Supplementary Table 6 PCR1a). A second primer that anneals to MuSeek adapters (‘IonTMu-02’) was used to generate biotinylated double-stranded PCR products containing transposon-genome junctions (Supplementary Table 6 PCR1b). These PCR products were enriched using a Dynabeads™ kilobaseBINDER™ kit (Thermo Fisher Scientific). A nested PCR reaction was carried out using NEB LongAmp Taq polymerase to generate DNA fragments with Oxford Nanopore multiplexing adapters that lacked biotin (Supplementary Table 6 PCR2). The PCR products from this reaction were purified using 0.5× volume of AMPure XP magnetic beads (Beckman-Coulter) and were end-repaired, then nanopore sequencing adapters were ligated according to the Oxford Nanopore protocol.
For nucleotide sequencing by LoRTIS, the resulting DNA fragments were loaded onto an Oxford Nanopore MinION flowcell and sequenced for up to 48 h (Fig. 7). Using the LoRTIS protocol, nanopore long-read nucleotide sequence data were generated from two separate runs on an Oxford Nanopore flow cell to generate two replicate data sets that could be compared for reproducibility.

Methods: Bioinformatics

FAST5 format data from MinION was processed using Guppy Basecalling software (version 3.6.0) running in High Accuracy Calling (HAC) mode and using GPUs on the Quadram Institute cloud infrastructure. The resulting sequence data in FASTQ format were demultiplexed using QCat (version 1.1.0) and those reads containing transposon nucleotide sequences were identified, trimmed and retained, again using QCat. The nucleotide sequence reads were then located to the BW25113 reference genome (CP009273) using Bio-TraDIS (version 1.4.1)27 and Bio-LoRTIS software (version 0.0.5)28 in a similar way as for short-read data, except that steps to remove the transposon sequences were skipped and Minimap2 (version 2.17-r941)29 was substituted in the place of SMALT. Results from BioTradis, including plot files, were outputted in the same format as for short-read data.
The insertion patterns at candidate loci were inspected visually using Artemis (version 18.1.0) which was also used to capture images for figures30. Gene essentiality was determined using tradis_essentiality.R from BioTradis, and scatter charts were plotted to determine reproducibility using Microsoft Excel after normalising for numbers of reads in each replicate. The correlation between TraDIS-Xpress and LoRTIS data was calculated using Spearman’s rank correlation coefficient.

TraDIS bioinformatics toolkit:

    #https://www.biorxiv.org/content/10.1101/2022.05.26.493556v1.full.pdf
    Nano-Bio-TraDIS: https://github.com/quadram-institute-bioscience/LoRTIS
    http://sanger-pathogens.github.io/Bio-Tradis/

Downdetector: https://downdetector.com/status/openai/

散光和老花

Leave a reply

1.揉眼睛是散光的最大原因。
2.查检查结果，拍下来。约一个眼科医生或好的验光师。下次选择抗蓝光涂层和防反光涂层。
3.带一个尺子测量一下屏幕的距离。

"Blau-Licht-Filter" und "Entspiegelung" sind zwei wichtige Beschichtungen für Brillengläser, die unterschiedliche Vorteile bieten:
Blau-Licht-Filter (Blau-Licht-Blocker)

Blau-Licht-Filter, auch als Anti-Blau-Licht-Beschichtung bekannt, hilft, schädliches blaues Licht zu reduzieren, das von digitalen Bildschirmen (Computer, Smartphones, Tablets) und künstlichen Lichtquellen ausgestrahlt wird. Diese Beschichtung kann helfen:

    Augenbelastung zu verringern
    Schlafqualität zu verbessern
    Das Risiko von Augenschäden durch langfristige Einwirkung von blauem Licht zu reduzieren

Entspiegelung (Anti-Reflex-Beschichtung)

Eine Entspiegelungsbeschichtung reduziert Reflexionen auf der Brillenoberfläche, was zu klarerem und komfortablerem Sehen führt. Diese Beschichtung bietet folgende Vorteile:

    Weniger störende Reflexionen, besonders beim Autofahren in der Nacht oder bei hellem Licht
    Bessere Ästhetik, da die Brille weniger blendet und die Augen besser sichtbar sind
    Erhöhte Haltbarkeit und einfache Reinigung, da viele Entspiegelungsbeschichtungen auch wasser- und schmutzabweisend sind

Beide Beschichtungen können kombiniert werden, um maximalen Komfort und Schutz für Ihre Augen zu bieten.

https://www.allaboutvision.com/de/augenerkrankungen/presbyopie/presbyopie-optionen/

“散光”是Stabsichtigkeit，或Astigmatismus
Presbyopie Alterssichtigkeit
Kombination of Stabsichtigkeit und Presbyopie Brille
Brille für Alterssichtigkeit und Stabsichtigkeit

#https://www.fielmann.it/de/wissen/fehlsichtigkeiten/astigmatismus/
Astigmatismus: „Alles unscharf – wenn man einen Knick in der Optik hat…“

老花眼镜不太适合长期在屏幕前工作。老花眼镜主要设计用于近距离阅读，它们的焦点通常在30-40厘米左右的距离，这对于看书或查看手机是理想的。然而，当您在电脑上工作时，屏幕通常会放置在更远的位置（约60-70厘米），这种情况下，老花眼镜的焦点并不适用。

为了更舒适地进行屏幕工作，建议使用专门设计的屏幕眼镜或渐进多焦镜。屏幕眼镜通常具有中距离的焦点，非常适合电脑屏幕的工作距离。渐进多焦镜则具有多个焦点，可以涵盖从近距离到中距离的视野，非常适合需要频繁在不同距离间切换视野的情况，如在电脑屏幕和桌面文件之间切换。

总结来说，如果您长期在屏幕前工作，老花眼镜可能会导致视力模糊、眼睛疲劳甚至头痛，因此更适合选择屏幕眼镜或渐进多焦镜。

#------

是的，您可以选择结合老花和散光矫正的镜片。这种镜片通常称为多焦点镜片或渐进多焦镜片，它们能够同时矫正老花和散光。

以下是一些常见的选择：

    渐进多焦镜片：
        特点：这种镜片在同一镜片中具有多个焦点区域，能够从近距离（阅读）到中距离（电脑工作）再到远距离（看远处）提供连续的清晰视野。
        优点：适合需要在不同距离间切换视野的情况，既能解决老花问题，又能矫正散光。

    双光或三光镜片：
        特点：这种镜片在同一镜片中具有两个或三个不同的焦点区域，通常是近距离和远距离（双光镜片）或近距离、中距离和远距离（三光镜片）。
        优点：能够满足不同距离的视力需求，适合需要固定距离工作的情况。

    专门的屏幕眼镜：
        特点：这种镜片专门为中距离（电脑工作距离）设计，同时可以根据您的具体情况加入老花和散光的矫正。
        优点：专为屏幕工作设计，可以有效减少眼睛疲劳和不适。

建议步骤：

    眼科检查：首先，进行详细的眼科检查，确定您的老花度数和散光度数。
    咨询专业眼镜店：与专业的眼镜店或验光师进行咨询，选择适合您的镜片类型。
    定制镜片：根据您的具体需求和验光结果定制适合的镜片。

注意事项：

    正确的验光结果：确保验光结果准确，以便定制的镜片能够完美适合您的视力需求。
    镜片材质和涂层：考虑选择抗蓝光涂层和防反光涂层，以提高在电脑工作时的舒适度。
    佩戴习惯的适应：如果您选择渐进多焦镜片，可能需要一些时间来适应佩戴的习惯。

通过上述方式，您可以获得既能矫正老花又能矫正散光的镜片，提升在屏幕前工作的舒适度和效率

#------

如果您不需要在不同距离间切换视野，并且只是长期在屏幕前工作，那么一副专门的屏幕眼镜可能是最适合您的选择。屏幕眼镜的设计是专门为中距离视野优化的，这正好适合电脑屏幕的工作距离（约60-70厘米）。这种眼镜能帮助您在使用电脑时保持视力清晰，同时减少眼睛疲劳和不适。

以下是选择屏幕眼镜的一些建议：

    定制镜片：确保镜片是根据您的视力情况定制的，而不是使用超市购买的标准老花眼镜。这样可以保证每只眼睛的度数都是最适合您的。

    抗蓝光涂层：考虑选择具有抗蓝光涂层的镜片，减少长时间看屏幕对眼睛的伤害。

    舒适的框架：选择轻便、佩戴舒适的镜框，避免长时间佩戴造成的不适。

总结而言，如果您长期在固定距离的屏幕前工作，专门的屏幕眼镜会比老花眼镜更合适，能有效提高工作舒适度和效率。

Lässt ab Mitte 40 die Sehkraft nach, haben viele Menschen zunehmend Probleme, wenn sie längere Zeit am Monitor arbeiten. Muss es dann gleich eine Gleitsichtbrille sein?

Schauen Sie auch den größten Teil des Tages auf einen Bildschirm? Wer dann mit Kopfschmerzen, trockenen Augen oder verschwommener Sicht und Verspannungen im Nacken- und Schulterbereich zu kämpfen hat, sollte sich auf die Suche nach den Ursachen machen. Stecken Sehdefizite hinter den Problemen, können Ihnen laut Kuratorium Gutes Sehen (KGS) je nach Alter folgende Lösungen helfen.
Ab 45 Jahren: Angepasste Lesebrille für den Übergang

Lässt Mitte 40 die Sehkraft nach, können sich Betroffene, die am Computer arbeiten, auch erst mal mit einer regulären Lesebrille behelfen. Vorausgesetzt, die Alterssichtigkeit liegt bei einem Wert von unter +1,25 Dioptrien.

Ein Modell vom Discounter ist aber nicht unbedingt die beste Wahl. Selten haben beide Augen die gleiche Sehstärke, heißt es vom KSG – die Gläser von Lesebrillen aus dem Discounter weisen aber üblicherweise jeweils die gleiche Dioptrien auf. Besser also: auch die Lesebrille individuell beim Optiker anpassen lassen.

Außerdem wichtig: Beim Blick in den Raum sollten Sie die Lesebrille absetzen, sonst verschwimmt die Sicht. Dieses Problem lässt sich mit einer Bildschirmbrille mit Gleitsichtfunktion umgehen. Damit können die Augen sowohl den Monitor als auch die Tastatur sowie den Raum dahinter klar erfassen.

    Lesen Sie auch: Leiden Sie unter dem Office-Eye-Syndrom?

Ab 50 Jahren: Bildschirmbrille mit oder ohne Gleitsicht

Spätestens wenn die Alterssichtigkeit um das 50. Lebensjahr +1,25 Dioptrien beträgt, ist es ratsam, über eine spezielle Bildschirmbrille nachzudenken, so das KGS. Ein Modell mit Gleitsichtfunktion biete einen großen Sehbereich für mittlere und kurze Distanzen. Der Fokus liegt auf dem Monitor, der obere Bereich des Glases ermöglicht gleichzeitig scharfes Sehen im Raum. Im Alltag eignet sich die Brille den Infos zufolge ebenfalls zum Lesen, Kochen oder Handwerken.

Eine Bildschirmbrille ohne Gleitsichtgläser ist in diesem Alter ebenfalls eine Option. Beschäftigte müssen sie aber immer absetzen, wenn sie etwa den Raum scharf wahrnahmen möchten.
Was genau bedeutet Dioptrien (dpt)?

Dioptrien ist die Maßeinheit, die die Brechkraft der Brillengläser angibt. Mithilfe dieser Zahl können die Gläser mit der richtigen Wölbung geschliffen werden, sodass die individuelle Fehlsichtigkeit korrigiert wird. Ein Pluszeichen (+) steht dabei für eine Weitsichtigkeit (Dioptrien im positiven Bereich), ein Minuszeichen (-) bedeutet eine Kurzsichtigkeit (Dioptrien im negativen Bereich).
Ab 55 Jahren: Bildschirmbrille mit Gleitsicht

Ab dem 55. Lebensjahr steige die Alterssichtigkeit auf über +1,5 Dioptrien an, so das Kuratorium. Dann sei eine Bildschirmbrille mit Gleitsichtfunktion die beste Lösung. Am besten messen Beschäftigte für eine passende Einstellung beim Optiker vorab die Entfernungen zum Monitor, zur Tastatur und zu einem Blatt auf dem Schreibtisch.

    Kräuter, Kompressen und Co.: Diese Hausmittel helfen bei trockenen Augen
    Gefahr von Fehlhaltungen: Wann Sie Gleitsichtbrillen nicht tragen sollten
    Selbsttest: Jung geblieben? Ermitteln Sie Ihr biologisches Alter
    Anzeige Jérémy Desplanches: Olympia 2024 wird sein letztes großes Turnier

Einstärkenbrillen und Gleitsichtbrillen für den Alltag sind dann keine wirkliche Alternative mehr – letztere haben laut KGS einen zu schmalen Korrektionsbereich für die Monitordistanz.

到了四十多岁，如果视力开始下降，许多人在长时间使用电脑时会遇到越来越多的问题。那么，是否需要立即配备一副渐进多焦镜呢？

您是否大部分时间都在看屏幕？如果您经常感到头痛、眼睛干涩或视力模糊以及颈部和肩部紧张，应该查找原因。如果这些问题是由于视力缺陷引起的，根据“好视力基金会”（KGS）的建议，根据您的年龄以下解决方案可能会有所帮助。

45岁以上：过渡期的定制阅读眼镜

如果在四十多岁时视力开始下降，电脑工作者可以先使用常规的阅读眼镜，前提是老花眼的度数低于+1.25屈光度（Dioptrien）。

但是，超市购买的阅读眼镜不一定是最好的选择。KGS表示，少有两只眼睛的视力完全相同，而超市的阅读眼镜通常具有相同的屈光度。因此，更好的选择是到眼镜店定制阅读眼镜。

此外，重要的是，当您看远处时，应摘下阅读眼镜，否则视力会变得模糊。这一问题可以通过配备具有渐进多焦功能的屏幕眼镜来解决。这样，眼睛可以清晰地看到屏幕、键盘以及背后的房间。

50岁以上：带或不带渐进多焦功能的屏幕眼镜

KGS表示，老花眼的度数在50岁左右达到+1.25屈光度时，建议考虑配备专门的屏幕眼镜。具有渐进多焦功能的眼镜可以提供较大的中短距离视野。焦点在屏幕上，镜片的上部区域同时可以清晰地看到房间。日常生活中，这种眼镜也适用于阅读、烹饪或手工制作。

在这个年龄，没有渐进多焦功能的屏幕眼镜也是一种选择。但如果需要清晰地看房间，使用者必须随时摘下眼镜。

屈光度（dpt）是什么意思？

屈光度是表示眼镜镜片折射力的单位。通过这个数字，镜片可以被研磨成适当的弯曲度，以矫正个人的视力缺陷。加号（+）表示远视（正屈光度），减号（-）表示近视（负屈光度）。

55岁以上：带渐进多焦功能的屏幕眼镜

KGS表示，55岁以上时，老花眼的度数会上升到+1.5屈光度以上，此时带有渐进多焦功能的屏幕眼镜是最佳解决方案。为了最佳配镜，工作者最好提前在眼镜店测量到屏幕、键盘和桌上文件的距离。

单焦点眼镜和日常使用的渐进多焦眼镜不再是有效的替代品——根据KGS的说法，后者对屏幕距离的矫正范围太窄。

#最近早上醒来，眼睛特别干，看东西特别模糊，大概过半小时后好转。

根据您的年龄和症状，早上醒来视力模糊并在一段时间后好转，这可能与几种因素有关。以下是一些常见原因及其解释：
可能的原因

    干眼症
        解释：晚上睡觉时，泪液分泌减少，导致眼睛干涩，早上醒来时可能会感觉视力模糊。
        建议：使用人工泪液或眼药水，特别是在早晨醒来时。

    老花眼（Presbyopie）
        解释：45岁是老花眼的常见发病年龄，早上眼睛肌肉可能还未完全放松，导致近距离视力模糊。
        建议：使用阅读眼镜或多焦点眼镜。

    夜间眼压波动
        解释：睡觉时眼压波动可能导致角膜形状暂时性变化，影响早晨的视力。
        建议：定期检查眼压，尤其是有青光眼风险的人群。

    泪液质量或数量不足
        解释：泪液的质量或数量不足会导致泪膜不稳定，早晨视力模糊。
        建议：改善泪液质量，使用泪膜稳定剂。

    其他健康问题
        解释：某些全身性疾病如糖尿病、高血压等也可能导致视力波动。
        建议：定期体检，控制血糖和血压。

建议的行动

    咨询眼科医生
        建议您尽快预约眼科医生进行详细检查，以排除任何潜在的眼部疾病，并获取专业的诊断和治疗建议。

    保持眼睛湿润
        使用人工泪液或润滑眼药水，特别是在早晨醒来时使用，可以缓解干眼症状。

    改善生活习惯
        确保良好的睡眠姿势和环境，避免眼睛过度疲劳。
        保持良好的饮食习惯，多摄入有助于眼睛健康的营养素，如维生素A、C、E和Omega-3脂肪酸。

    定期体检
        进行全面的身体检查，确保没有全身性疾病影响您的视力。

总结

早上视力模糊可能由多种原因引起，最好通过专业的眼科检查来确定具体原因。根据您的症状和年龄，老花眼和干眼症是比较常见的原因，但也不排除其他因素的影响。定期检查眼睛健康和整体健康是确保视力保持良好的重要措施。

Logics of Transposon sequencing

Leave a reply

original reads:
>read_1/1
    AAGAGTTAGTAGGGCGGGCGATTAAGGGCAAACGTGATCAGGTCTTTTTAGCGACTAAGT
TCGGTATTGTGCGTGACCCCACTGATCCCAGGGTACGCGGTGTGAGCAGCCGTCCTGATT
ATATCCGCCAGTCG  GTTGATGAGAGCTTTGTTGTAGGTG

TODO_TOMORROW: follow the logics below!
vs. NEB_i5_Tn_02
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCACCTACAACAAAGCTCTCATCAAC-3'-bacterial_genomics
duplicate_PCR generate rc fragments
                 adapter_removed!  |GTGGA  ...     AGAGTAGTTG                       ...  GATTGAGAA
                                                                                 <-------read_R1--

>read_1/2
AATGAAGAGTTAGTAGGGCGGGCGATTAAGGGCAAACGTGATCAGGTCTTTTTAGCGACT
AAGTTCGGTATTGTGCGTGACCCCACTGATCCCAGGGTACGCGGTGTGAGCAGCCGTCCT
GATTATATCCGCCAGTCGGTTGATG  AGAGCTTTGTTGT

SAM-format

Leave a reply

The two records in the SAM-format represent two reads that are part of a pair from the sequencing data. Below is the analysis of these two records:

Record 1 VH00358:89:AAFC5MTM5:1:1101:62048:1038_:N:0:TTTCTCTA+CTCGACG 83 NZ_AKKR01000009 78640 60 110M20S = 78640-110 ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCGGGGGTGTAGG * NM:i:0 MD:Z:110 MC:Z:110M20S AS:i:110 XS:i:0

Record 2 VH00358:89:AAFC5MTM5:1:1101:62048:1038_:N:0:TTTCTCTA+CTCGACG 163 NZ_AKKR01000009 78640 60 110M20S = 78640110 ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCTTTGTTGTAGG * NM:i:0 MD:Z:110 MC:Z:110M20S AS:i:110 XS:i:0

Analysis

Read Names: Both reads have the same name: VH00358:89:AAFC5MTM5:1:1101:62048:1038_:N:0:TTTCTCTA+CTCGACG. This indicates they are part of the same pair.
Flags: First Record (83): 83 indicates the read is the second read in the pair (0x40 is not set), and it is mapped in reverse complement (0x10 is set). It also indicates that the read is part of a properly paired alignment (0x2 is set). Second Record (163): 163 indicates the read is the first read in the pair (0x40 is set), and it is mapped in reverse complement (0x10 is set). It also indicates that the read is part of a properly paired alignment (0x2 is set).
Reference Name: Both reads are mapped to the same reference sequence: NZ_AKKR01000009.
Position: Both reads are mapped to the same position: 78640.
Mapping Quality: Both reads have a mapping quality score of 60, which indicates high confidence in the mapping.
CIGAR String: Both reads have the same CIGAR string: 110M20S. This means the first 110 bases are matched/mismatched (alignment) and the last 20 bases are soft-clipped (not aligned to the reference).
Mate Information: The = symbol indicates that the mate is mapped to the same reference. The 78640-110 and 78640110 values indicate the positions of the mates.
Sequence: The sequences are nearly identical for the aligned parts: Record 1: ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCGGGGGTGTAGG Record 2: ATCATAACCGTCGGCTGAATAAGCAAGATTATAAAACCCTCACTCTGGCGGCTTTAGGCGGAGCGCTTGAGTTTTATGACTTCATTATTTTCGTTTTCTTTGCTGCGGTTGATGAGAGCTTTGTTGTAGG
Optional Fields: NM:i:0: Number of mismatches is 0. MD:Z:110: Mismatch string indicating no mismatches in the 110 aligned bases. MC:Z:110M20S: CIGAR string for the mate. AS:i:110: Alignment score for the read. XS:i:0: Suboptimal alignment score.

Conclusion

These two SAM records represent a pair of reads mapped to the same position on the reference sequence NZ_AKKR01000009. Both reads are aligned with high confidence (mapping quality 60), and they have no mismatches in the aligned portion of their sequences. The 110M20S CIGAR string indicates that the first 110 bases are aligned, and the last 20 bases are soft-clipped. Both reads are reverse complemented.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Author Archives: gene_x

transponson saturation test

transponson off-target analysis

Draw plots for miRNAs generated by COMPSRA

Draw plots for piRNA generated by COMPSRA

Draw plots for snRNA generated by COMPSRA

高维数据推断：简介

Long-Read Transposon Insertion-site Sequencing vs. short-read TIS methods

散光和老花

Logics of Transposon sequencing

SAM-format