Author Archives: gene_x

YopQ Secretion Boxplot and Fitting Function

YopQ_Secretion_Barplot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import seaborn as sns

# Sample data
data = {
    "Sample": ["1 (04.10.23)", "2 (27.20.23)", "3 (30.11.23)", "1 (SR003)", "2 (SR004)", "3 (SR021)"],
    "YopQ1-5": [0.17864353, 0.08980754, 0.13318754, 0.039536, 0.07075792, 0.04852611],
    "YopQ1-10": [0.3663838, 0.45519364, 0.29828942, 0.18891885, 0.38740741, 0.30086734],
    "YopQ1-15": [0.49962418, 0.96844132, 0.77290941, 0.4739692, 0.74691256, 1.22698631],
    "YopQ1-25": [0.9791465, 0.85120313, 0.97336014, 0.4107113, 1.31386354, 0.69474945],
    "YopQ1-35": [1.32707017, 1.23130563, 1.23649531, 0.68362804, 0.78614696, 1.14339001],
    "YopQ1-45": [1.42828421, 1.32843428, 1.68480714, 0.84469081, 0.79365431, 1.58619928],
    "YopQ1-55": [0.83107654, 1.24622023, 0.92185733, 1.05695509, 0.77422497, 1.28645523],
    "YopQ1-65": [1.62036726, 1.21848867, 0.94667898, 0.91017609, 1.181266, 1.2667003],
    "YopQ1-76": [1.22602291, 0.85284654, 1.15612041, 0.82651662, 1.6654101, 1.17553764],
    "YopQ1-182(wt)": [1, 1, 1, 1, 1, 1]
}

# Create DataFrame
df = pd.DataFrame(data)

# Extract columns for plotting
yopq1_columns = df.columns[1:]
x_categories = np.arange(len(yopq1_columns))  # Numeric representation for curve fitting

# Calculate means and standard deviations for the bar plot
means = df.iloc[:, 1:].mean()
stds = df.iloc[:, 1:].std()

# Create the bar plot with error bars
plt.figure(figsize=(12, 6))
plt.bar(x_categories, means, yerr=stds, color='grey', capsize=5, edgecolor="black", zorder=2)

# Add jittered points using swarmplot for visualizing individual data points
sns.swarmplot(data=df.iloc[:, 1:], color="black", size=6, zorder=3, edgecolor="white")

# Define exponential plateau function
def plateau_exponential_func(x, a, b, c):
    return a * (1 - np.exp(-b * x)) + c

# Fit data using the mean of each group
popt, _ = curve_fit(plateau_exponential_func, x_categories, means.values, p0=[1, 0.1, 0.5], maxfev=10000)

# Generate fit curve data
x_fit = np.linspace(0, len(yopq1_columns) - 1, 100)
y_fit = plateau_exponential_func(x_fit, *popt)

# Plot fitting curve
plt.plot(x_fit, y_fit, 'r-', label=f'Fitting function\n($y = {popt[0]:.2f} \\cdot (1 - e^{{-{popt[1]:.2f}x}}) + {popt[2]:.2f}$)', zorder=4)

# Update labels and styles
plt.ylabel('Relative Secretion', fontsize=12)
plt.xticks(ticks=x_categories, labels=[
    '$YopQ^{1-5}$', '$YopQ^{1-10}$', '$YopQ^{1-15}$', '$YopQ^{1-25}$',
    '$YopQ^{1-35}$', '$YopQ^{1-45}$', '$YopQ^{1-55}$', '$YopQ^{1-65}$',
    '$YopQ^{1-76}$', '$YopQ^{1-182}(wt)$'
], rotation=15)

# Grid and legend
plt.grid(True, axis='y', linestyle='--', linewidth=0.7, zorder=0)
plt.legend(title='', loc='upper left')
plt.tight_layout()
plt.subplots_adjust(left=0.1)
plt.show()

## Save plot with high resolution
#plt.savefig("high_resolution_plot.png", dpi=300)  # Save with 300 DPI

YopQ_Secretion_Boxplot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import seaborn as sns

#Explanation of Key Parts:
#
#    sns.swarmplot: This function automatically adds jitter to the points, ensuring they don't overlap. By setting color="black", the points will be displayed in black, and size=5 sets the size of the points.
#
#    Boxplot: The sns.boxplot is still used to display the boxplot for the data distribution in grey (color="darkgrey").
#
#    Fitting Curve: The exponential curve fitting remains the same as before.
#
#    Legends: The legend for the points has been removed, and only the fitting function legend is shown.
#
#Benefits of Using swarmplot:
#
#    Automatic Jittering: It provides a nice way to scatter points without overlap, as it automatically adjusts the points’ positions along the x-axis.
#    Combining with Boxplot: You can easily combine it with the boxplot for showing both the distribution and the individual points.

# 数据集
data = {
    "Sample": ["1 (04.10.23)", "2 (27.20.23)", "3 (30.11.23)", "1 (SR003)", "2 (SR004)", "3 (SR021)"],
    "YopQ1-5": [0.17864353, 0.08980754, 0.13318754, 0.039536, 0.07075792, 0.04852611],
    "YopQ1-10": [0.3663838, 0.45519364, 0.29828942, 0.18891885, 0.38740741, 0.30086734],
    "YopQ1-15": [0.49962418, 0.96844132, 0.77290941, 0.4739692, 0.74691256, 1.22698631],
    "YopQ1-25": [0.9791465, 0.85120313, 0.97336014, 0.4107113, 1.31386354, 0.69474945],
    "YopQ1-35": [1.32707017, 1.23130563, 1.23649531, 0.68362804, 0.78614696, 1.14339001],
    "YopQ1-45": [1.42828421, 1.32843428, 1.68480714, 0.84469081, 0.79365431, 1.58619928],
    "YopQ1-55": [0.83107654, 1.24622023, 0.92185733, 1.05695509, 0.77422497, 1.28645523],
    "YopQ1-65": [1.62036726, 1.21848867, 0.94667898, 0.91017609, 1.181266, 1.2667003],
    "YopQ1-76": [1.22602291, 0.85284654, 1.15612041, 0.82651662, 1.6654101, 1.17553764],
    "YopQ1-182(wt)": [1, 1, 1, 1, 1, 1]
}

# 创建DataFrame
df = pd.DataFrame(data)

# 抽取YopQ1列
yopq1_columns = df.columns[1:]
x_categories = np.arange(len(yopq1_columns))  # 数值化类别索引

# 创建箱线图(灰色)
plt.figure(figsize=(12, 6))
sns.boxplot(data=df.iloc[:, 1:], color="darkgrey", zorder=2)  # Boxplot in dark grey

# Add jittered points using swarmplot (black points)
sns.swarmplot(data=df.iloc[:, 1:], color="black", size=6, zorder=3)  # Swarmplot adds jitter

# 定义指数趋近函数
def plateau_exponential_func(x, a, b, c):
    return a * (1 - np.exp(-b * x)) + c

# 计算每个类别中值并进行拟合
y_data = df.iloc[:, 1:].median().values

# 拟合指数趋近模型
popt, pcov = curve_fit(plateau_exponential_func, x_categories, y_data, p0=[1, 0.1, 0.5], maxfev=10000)
a, b, c = popt

# 生成拟合曲线数据
x_fit = np.linspace(0, len(yopq1_columns) - 1, 100)
y_fit = plateau_exponential_func(x_fit, a, b, c)

# 绘制拟合曲线
plt.plot(x_fit, y_fit, 'r-', label=f'Fitting function\n($y = {a:.2f} \\cdot (1 - e^{{-{b:.2f}x}}) + {c:.2f}$)', zorder=4)

# 更新轴标签
plt.xlabel('')   #YopQ1 (Categories)
plt.ylabel('Relative Secretion', fontsize=12)
plt.title('')    #Boxplot of YopQ1 Measurements with Plateau Exponential Fit and Raw Data

# 显示类别标签
labels = ['$YopQ^{1-5}$', '$YopQ^{1-10}$', '$YopQ^{1-15}$', '$YopQ^{1-25}$', '$YopQ^{1-35}$', '$YopQ^{1-45}$', '$YopQ^{1-55}$', '$YopQ^{1-65}$', '$YopQ^{1-76}$', '$YopQ^{1-182}(wt)$']
plt.xticks(ticks=x_categories, labels=labels, rotation=15)

# Remove the legend for the points, keep for fitting function
plt.legend(title='', loc='upper left')  # Only keep the legend for the fitting function

plt.grid(True)
plt.tight_layout()
plt.subplots_adjust(left=0.1)  # Increase the left margin to avoid cutting off ylabel
plt.show()

## Create a figure with specified size
#fig = plt.figure(figsize=(12, 9))  # (12, 9) corresponds to 1200x900 in pixels at 100 dpi
## Your plotting code goes here (e.g., boxplot, scatter plot, etc.)
## Save as PNG
#fig.savefig("plot.png", dpi=100)  # The dpi is 100, so figsize=(12, 9) will be 1200x900 pixels
## Save as SVG
#fig.savefig("plot.svg")
## Close the plot to release resources
#plt.close(fig)

RNA-seq Tam on CP059040.1 (Acinetobacter baumannii strain ATCC 19606)

nf-core_rnaseq_on_CP059040

http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

Methods

Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).

The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:

nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results –fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta –gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume –max_cpus 55 –max_memory 512.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –gtf_group_features gene_id –gtf_extra_attributes gene_name –featurecounts_group_type gene_biotype –featurecounts_feature_type transcript

  1. prepare reference

    They are wildtype strains grown in different medium.
    AUM - artificial urine medium
    Urine - human urine
    MHB - Mueller-Hinton broth
    
    mkdir raw_data; cd raw_data
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz
  2. (Optional) using trinity to find the most closely reference

    Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
    #https://www.genome.jp/kegg/tables/br08606.html#prok
    acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
    abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
    aby     KGB     Acinetobacter baumannii AYE     2008    GenBank
    abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
    abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
    abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
    abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
    abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
    abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
    abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
    abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
    abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
    abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
    abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
    abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
    abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
    abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
    abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
    abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
    abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
    abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
  3. Downloading CP059040.fasta and CP059040.gff from GenBank

  4. (Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed

    #Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta .     # Elements (Anna C.arnes)
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed .
    rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    (base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*"
    ./CP059040.fasta
    ./CP059040.bed
    ./CP059040.gb
    ./CP059040.gff3
    ./CP059040.gff3_backup
    ./CP059040_full.gb
    ./CP059040_gene.gff3
    ./CP059040_gene.gtf
    ./CP059040_gene_old.gff3
    ./CP059040_rRNA.gff3
    ./CP059040_rRNA_v.gff3
    
    # ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ----
    #gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_
    #grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    #sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    #gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    
    grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript
    sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    #grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l  #69=3887-3803
    
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    
    #3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42
    # --------------------------------------------------------------------------------------------------------------------------------------------------
    # ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3    #-->3796
    
    #The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing.
    # from gff3 to gtf
    sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf
    sed -i -e "s/;/\"; /g" CP059040_gene.gtf
    sed -i -e "s/=/=\"/g" CP059040_gene.gtf
    
    #sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf
    #using editor instead!
    
    #The following is GTF-format.
    CP000521.1      Genbank exon    95      1492    .       +       .       transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001";
    
    #NZ_MJHA01000001.1       RefSeq  region  1       8663    .       +       .       ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii
    #NZ_MJHA01000001.1       RefSeq  gene    228     746     .       -       .       ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795
    #NZ_MJHA01000001.1       Protein Homology        CDS     228     746     .       -       0       ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11
    
    ##gff-version 3
    ##sequence-region CP059040.1 1 3980852
    ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470
    
    gffread -E -F --bed CP059040.gff3 -o CP059040.bed    #-->3796
    ##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3
    ##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown
    #[01/21 10:57:46] Loading reference annotation (guides)..
    #GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810)
    #[01/21 10:57:46] 3796 reference transcripts loaded.
    #Default stack size for threads: 8388608
    #WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?)
    #WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
    #Please make sure the -G annotation file uses the same naming convention for the genome sequences.
    #[01/21 10:58:30] All threads finished.
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'Name'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to termin
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'gene_biotype'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to terminate.
    
    #grep "ID=cds-" CP059040.gff3 | wc -l
    #grep "ID=exon-" CP059040.gff3 | wc -l
    #grep "ID=gene-" CP059040.gff3 | wc -l   #the same as H0N29_18980/5=3796
    grep "gbkey=" CP059040.gff3 | wc -l  7695
    grep "ID=id-" CP059040.gff3 | wc -l  5
    grep "locus_tag=" CP059040.gff3 | wc -l    7689
    #...
    cds   3701                             locus_tag=xxxx, no gene_biotype
    exon   96                              locus_tag=xxxx, no gene_biotype
    gene   3796                            locus_tag=xxxx, gene_biotype=xxxx,
    id  (riboswitch+direct_repeat,5)       both no --> ignoring them!!  # grep "ID=id-" CP059040.gff3
    rna    96                              locus_tag=xxxx, no gene_biotype
    ------------------
        7694
    
    cp CP059040.gff3_backup CP059040.gff3
    grep "^##" CP059040.gff3 > CP059040_gene.gff3
    grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3
    #!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'!
    sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3
  5. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \
    for sample_id in MHB_r1 MHB_r2 MHB_r3; do \
            java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
    done
  6. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto
    AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto
    AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto
    MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto
    MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto
    MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto
    Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
    Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto
    Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto
  7. nextflow run

    #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
    docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    (host_env) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_1 --
    #After checking the record (see below) in results/genome/CP059040.gtf, we have to change 'exon' to 'transcript', the default values are --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
    #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"

Function of Small basic protein (Sbp) in Staphylococcus epidermidis biofilm matrix assembly: molecular mechanisms and spatio-temporal patterning

https://tu-dresden.de/mn/biologie/allgemeine_mikrobiologie/spp2389/research-groups/function-of-small-basic-protein-sbp-in-staphylococcus-epidermidis-biofilm-matrix-assembly-molecular-mechanisms-and-spatio-temporal-patterning

  • The assembly of multi-layered bacterial biofilm architectures is a prime example in which bacterial single cell traits are integrated and translated into population behaviour, which in turn provides cues for individual cells to prevail in changing potentially harsh environments.

  • A hallmark of biofilm formation is the production of an extracellular matrix, and in fact, biofilm matrix production and its functional consequences are outstanding examples of emergent functions of bacterial multicellularity.

  • The biofilm matrix consists of a plethora of various biomolecules, referred to as the “matrixome”.

  • Recently, we have identified 18 kDa Small basic protein (Sbp) from S. epidermidis biofilm matrix preparations.

  • Sbp accumulates at the biofilm – substratum interface and in the intercellular space and thereby functionally supports biofilm formation.

  • In addition, a sbp knock-mutant exhibits striking stationary growth phase defects and increased susceptibility against glycopeptides, both phenotypes being consistent with stringent response gene expression patterns as identified in transcriptome analysis.

  • The available evidence supports the hypothesis that Sbp matrix assembly occurs along temporally and spatially orchestrated events, providing an environment which fundamentally affects bacterial physiology.

  • We will use high resolution imaging in combination with biochemical approaches and single cell techniques to dissect principles of Sbp biofilm matrix assembly, identify and characterize involved interaction networks, and to provide insights into the importance of Sbp biofilm matrices for staphylococcal physiology.

  • In the overarching context, the project will provide molecular insights into general principles of dynamics driving bacterial multicellularity, and their inter-relation with bacterial physiology on population and single cell levels.

Cluster Analysis and Quantification of Segmented Volumes from IMARIS 3D Segmentation Data

    % the goal of this script is to determine cluster information starting from
    % the results extracted from IMARIS 3D segmentation 
    clear all

    clast_par=1.3; 
    % filename = '20241107 1457 18h_0002-1.tif DAPI.xls';
    % filename = '202406 1457dsbp pRB473_1.xls';
    % filename = '1457 1h 2.xls';
    %filename = '20231214 1457dsbp pRB473 sbp alfa 508_003-1.xls';
    filename = '20240321 1457dsbppRB473sbp ALFA508 0 1.xls';
    %filename = '202406 1457dsbp pRB473_1.xls';
    %The only two input necessary are the name of the file to be analized i.e file name and parameter that control how far away two
    % segmented volumes should be. clast_par < 1 means the distance between two segmented volume should be inferior to the sum of their mean radius
    % clast_par =1 the distance between two segmented volumes should be equal to the sum of their mean radius
    %  clast_par >1 the distance between two segmented volumes should be bigger than the sum of their mean radius

    % the data to be quantified is stored in an excel multipage file

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% segmented volumes radius determination
    %read_volume=xlsread(filename,32);% opens page where the volumes are listed and extracts the volume information for each cluster
    read_volume=readtable(filename, 'Sheet', 32);
    siz_point=size(read_volume);     %it extracts the information about how many volumes were quantified 
    radius_nn(1:siz_point(1))=0;     %it creates a vector that will contain the estimated radius of all the detected volumes
    %radius_nn=1000*( read_volume(:,1)*3/(4*pi)).^(1/3);
    radius_nn = 1000 * (read_volume.Volume * 3 / (4 * pi)).^(1 / 3);
    % the radius is calculated approximating the quantified volumes with be the volumes of a sphere
    Mean_radius_nm = mean(radius_nn)   % displays the mean radius of the segmented volumes
    %please note that the radius of the segmented volumes is a good
    %approximation of the full width at half maximum of the fluorescence object
    Variance_radius_n = std(radius_nn)  % displays the variance of the radius of the segmented volumes
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% determination of
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% the distance between the segmented volumes and sorting them between volumes belonging to the same cluster and volumes not belonging to the same cluster.
    % center_of_mass=xlsread(filename,5);% open page where the coordinate of the center of mass of the segmented volumes are listed
    % 
    % total_num=size(center_of_mass); % again number of detected spots just to check if it match with the previous values (not necessary)
    % total_resolved_spots = total_num(1) % displays the total number of volumes quantified
    % %center_of_mass=xlsread(filename,5); % open the page of the file where the values of the centre of mass of the detected volumes are stored and write them into a matrix
    % distances(1:siz_point(1),1:siz_point(1))=0;% produce the matrix were the distances between detected volumes will be stored
    % cluster(1:siz_point(1),1:siz_point(1))=0;%produce a matrix where will be stored the information wether two segmented volumes are close together (distance fulfil cluster parameter) cluster (i,j)=1
    % %or not cluster(i,j)=0;
    % 
    % %%calculates the matrix of the distance between segmented volumes i and j as the distance between their centre of mass
    % %the factor 1000 converts the distance from micronmeter to nanometer
    % for i=1:siz_point(1)
    % for j=1:siz_point(1)
    % 
    % distances(i,j)= sqrt( (center_of_mass(i,1)-center_of_mass(j,1))^2 + (center_of_mass(i,2)-center_of_mass(j,2))^2 + (center_of_mass(i,3)-center_of_mass(j,3))^2 )*1000;
    % end
    % end
    % 
    % %Here is determined if two volumes are clustering together i.e. if their distance <= the sum of their radius multiplied for clust_par defined above, or not
    % for i=1:siz_point(1)
    % for j=1:siz_point(1)
    % 
    % if i==j
    %     cluster(i,j)=0;
    % else
    %     if  distances(i,j)<=clast_par*(radius_nn(i)+radius_nn(j))
    %     cluster(i,j)=1;
    %     end
    % end
    % 
    % end
    % end
    % cluster3=cluster; %%%here the cluster matrix is doubled for keep avaliable its information at the end of the script

    % Read the data from Sheet 5 (Center of Homogeneous Mass)
    center_of_mass = readtable(filename, 'Sheet', 5);

    % Display the modified variable names (these are the names used by MATLAB)
    disp('Modified Variable Names:');
    disp(center_of_mass.Properties.VariableNames);

    % Display the original column headers (as stored in the Excel file)
    disp('Original Column Names:');
    disp(center_of_mass.Properties.VariableDescriptions);

    % Define the starting row (based on the repeating headers)
    %start_row = 3; % Adjust this according to your actual data structure

    % Read the table, starting from the correct row
    %center_of_mass = readtable(filename, 'Sheet', 5, 'Range', sprintf('A%d', start_row));

    % Optional: Remove rows that are still unwanted (if any)
    % For example, if there are rows with unwanted headers in the middle of the data:
    %center_of_mass = center_of_mass(~ismember(center_of_mass{:, 1}, {'CenterofHomogeneousMass'}), :);
    %center_of_mass = center_of_mass(~ismember(center_of_mass{:, 1}, {'CenterofHomogeneousMassX'}), :);

    % Use the modified column names to extract the data
    x_coords = center_of_mass{:, 'CenterOfHomogeneousMass'};  % Extract X coordinates
    y_coords = center_of_mass{:, 'Var2'};  % Adjust as needed for Y coordinates
    z_coords = center_of_mass{:, 'Var3'};  % Adjust as needed for Z coordinates

    % Determine the number of detected spots (volumes)
    siz_point = size(center_of_mass);  % Size of the table, number of rows (spots)
    total_resolved_spots = siz_point(1);  % Total number of volumes quantified
    disp(['Total number of volumes quantified: ', num2str(total_resolved_spots)])

    % Create a matrix to store the distances between detected volumes
    disp('Initializing distance matrix...');
    distances = zeros(total_resolved_spots, total_resolved_spots);  % Initialize distance matrix

    % Create a matrix to store cluster information (1 for close volumes, 0 for far volumes)
    disp('Initializing cluster matrix...');
    cluster = zeros(total_resolved_spots, total_resolved_spots);  % Initialize cluster matrix

    % Loop to calculate the distance matrix between detected volumes
    % The factor 1000 converts the distance from micrometers to nanometers
    disp('Calculating distance matrix...');
    for i = 1:total_resolved_spots
        for j = 1:total_resolved_spots
            distances(i,j) = sqrt( (x_coords(i) - x_coords(j))^2 + (y_coords(i) - y_coords(j))^2 + (z_coords(i) - z_coords(j))^2 ) * 1000;  % Calculate distance in nm
        end
    end

    % Loop to calculate whether two volumes belong to the same cluster
    % Check if distance between volumes is less than or equal to the sum of their radii
    disp('Determining cluster assignments...');
    % ERROR_NAME_SHOULD_BE_clast_par.    clust_par = 1.5;  % Define your clustering parameter (this is an example, adjust as necessary)
    for i = 1:total_resolved_spots
        for j = 1:total_resolved_spots
            if i == j
                cluster(i,j) = 0;  % No clustering with itself
            else
                if distances(i,j) <= clast_par * (radius_nn(i) + radius_nn(j))  % Check if the distance is less than the sum of radii
                    cluster(i,j) = 1;  % Assign to the same cluster
                end
            end
        end
    end

    cluster3 = cluster;  % Store the cluster matrix for further use
    %disp('Cluster Matrix (1 indicates volumes in the same cluster, 0 otherwise):');
    %disp(cluster3);

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % determination of the number of clusters the cluster size i.e. how many
    % volumes are contained into one cluster and other cluster information

    numclast=0;
    numclast2(1:siz_point(1),1)=1;
    numclast3(1:siz_point(1),1:siz_point(1))=0;
    num=0; % initial number of clusters
    cluster_record(1:siz_point(1),1:siz_point(1))=0;% matrix where the label of the cluster members should be stored i.e the label
    %of the volumes of cluster members. Please note that the column number of cluster_record matrix is also the label of the first volume find for a given
    %cluster if this column is not made only by zero values

    % searching and isolating clusters
    for i=1:siz_point(1)
        count=0;  % how many volumes are clustering with an individual volume
        count3=0; % how many volumes are belonging to the same cluster
        if sum( cluster(i,:))>=1
        num=num+1;
        %     ind= sum( cluster(i,:));

            for j=1:siz_point(1)
                if  cluster(i,j)==1
                    count=count+1;
                    count3=count3+1;
                    cluster(i,j)=0;% every time if a point on the cluster matrix is found to belong to a specific cluster it is set to
                    %zero in order to not choose this point twice. At the end of the cluster search the cluster matrix should be a full zeros matrix
                    cluster(j,i)=0;% everytime if a point on the cluster matrix is found to belong to a specific cluster is set to
                    %zero in order to the choose this point twice at the end the cluster matrix should be a full zeros matrix
                    vect(count)=j;% here are listed the label of the volumes clastering with the volume under analysis
                    cluster_record(count3,i)=j;
                end
            end
        end

        while count > 0;%%%% in this recursive loop are searched the volumes that were clustering with the first one and in case
            %they are found clustering with other volumes these are stored and this cycle is repeated until when no more clustering volume are found
           count2=count;
           count=0;
           vect2=vect;% saving the information contained in vect (is a variable size vector so overwriting is not to suggest)

           clear vect % delete this variable in a way it can be defined newly in the following part of the loop

           for ss=1:count2
                for j=1:siz_point(1)
                if  cluster(vect2(ss),j)==1
                 count=count+1;

                 cluster(vect2(ss),j)=0;
                 cluster(j,vect2(ss))=0;
                vect(count)=j;
                ll=0; %%ll=0 a volume is for the first time found to be member of a given cluster; ll=1 the volume was already found to be part of the cluster
                % and it will be not counted twice
                     for ms=1:count3
                         if cluster_record(ms,i)==j
                             ll=1;
                         end
                     end
                  if ll==0
                count3=count3+1;
                cluster_record(count3,i)=j;
                  end
                  ll=0;

                end
                end
            end
            clear vect2 %  this variable vector will be redefined at the beginning of the loop
            count;
        end

        if num > 0 % if a cluster is found
            if count3>0
                numb(num)=count3+1;% 1 comes from the fact that a cluster is always found starting from a volume that is not included
                %in the counting routine
            end
        end

    end
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%dysplaying the information
    sit=size(numb);
    number_of_cluster=sit(2)% the number of columns of the variable numb gives the total number of clusters
    total_number_of_spot_in_cluster= sum(numb)% the total number of volumes belonging to the one of the clusters (it must be inferior to the total number of volumes quantified!!!)
    average_number_of_spot_in_cluster= mean(numb)%  mean number of volumes belonging to a cluster
    dispersion_number_of_spot_in_cluster= std(numb)% variance of the number of volumes belonging to a cluster
    frequent_number_of_spot_in_cluster= median(numb)% median of the number of volumes belonging to a cluster

    indd = 0;

    % Preallocate dist and disti arrays with a sensible size limit
    max_cluster_size = siz_point(1);  % Max number of volumes per cluster, assuming worst case

    % Preallocate clust_info for efficiency
    clust_info = zeros(siz_point(1), 8); % Preallocate space for cluster info

    for l = 1:siz_point(1)

        % Display progress
        fprintf('Processing cluster %d of %d\n', l, siz_point(1));

        % Initialize variables
        cluster_member = [];
        if sum(cluster_record(:, l)) >= 1
            indd = indd + 1;
            cluster_member = [l; cluster_record(cluster_record(:, l) >= 1, l)]; % All labels in the cluster
        end

        if ~isempty(cluster_member)
            got = numel(cluster_member); % Number of volumes in the cluster
            rel = 0; % To keep track of valid distances
            xx = 0; yy = 0; zz = 0;

            % Sum x, y, z positions (in nm, factor 1000)
            cluster_coords = [center_of_mass{cluster_member, 'CenterOfHomogeneousMass'}, ...
                              center_of_mass{cluster_member, 'Var2'}, ...
                              center_of_mass{cluster_member, 'Var3'}] * 1000; % Coordinates in nm

            xx = sum(cluster_coords(:, 1));
            yy = sum(cluster_coords(:, 2));
            zz = sum(cluster_coords(:, 3));

            % Vectorized distance calculation between all points in the cluster
            % We calculate the pairwise distance matrix for all points in the cluster
            dist_matrix = pdist2(cluster_coords, cluster_coords); % pdist2 is highly optimized

            % Extract non-zero distances and store them in disti
            disti = dist_matrix(dist_matrix > 0); % Convert to vector and remove zeros

            % Collect statistical information
            clust_info(indd, 1) = numb(indd); % Number of volumes in the cluster
            clust_info(indd, 2) = 0.5 * mean(disti); % Mean distance between volumes
            clust_info(indd, 3) = std(disti / 2); % Estimate of variance
            clust_info(indd, 4) = max(disti); % Max distance
            clust_info(indd, 5) = min(disti); % Min distance
            clust_info(indd, 6) = xx / got; % X coordinate of the cluster center of mass
            clust_info(indd, 7) = yy / got; % Y coordinate of the cluster center of mass
            clust_info(indd, 8) = zz / got; % Z coordinate of the cluster center of mass

            % Clear temporary variables for next iteration
            clear dist_matrix disti cluster_coords
        end
    end

    % --------- Preparing the final results and save them in the Excel files ---------
    %%% the labels of cluster members are saved in a two column matrix the
    %%% second column contains the label of the first member of the cluster and
    %%% it repeats its value until when the same cluster is under consideration.
    %%% The first column displays the label of the other volumes belonging to
    %%% the clusters
    a=0;
    for i=1:siz_point(1)
        if sum(cluster_record(:,i))>=1
            for j=1:siz_point(1)

                if cluster_record(j,i)>=1
                    a=a+1;
                    position(a,1)=cluster_record(j,i);
                    position(a,2)=i;
                end
            end
        end
    end
    position2= position-1; % since the labelling in Imaris starts from number 0 and not from one in order to find the same cluster in Imaris to the
    % to the label values must be substructed 1
    %
    %%%%%%%% printing in the output file the data  quantified till now
    %%%%%%
    names={'number of particles', 'mean distance', 'variance of distance', 'max distance', 'min distance', 'center_x', 'center_y', 'center_z' };

    names2={'Mean_radius_nm', 'Variance_radius_n ', 'Total_resolved_spots' };
    values=[Mean_radius_nm , Variance_radius_n  total_resolved_spots];
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],names2,1,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],values,1,'A2')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],clust_info,2,'A2')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],names,2,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],position2,3,'A1')
    % Write 'names2' and 'values' to the first sheet
    %DEBUG: writetable(table(values), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 1, 'WriteRowNames', false, 'WriteVariableNames', false);
    %DEBUG: writetable(table(names2), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 1, 'WriteRowNames', false);
    % Create a table where the first row is the header and the second is the data
    output_table = [names2; num2cell(values)];
    disp(output_table);
    % Write the table to the Excel file, starting at cell A1
    writetable(cell2table(output_table), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 1, 'WriteRowNames', false, 'WriteVariableNames', false);
    % % Saving as an older .xls format (Excel 97-2003 format)
    % writetable(cell2table(output_table), [num2str(clast_par) '_output_' filename '.xls'], 'Sheet', 1, 'WriteRowNames', false, 'WriteVariableNames', false);
    % Save as CSV file
    writetable(cell2table(output_table), [num2str(clast_par) '_output_' filename '_Sheet1.csv'], 'WriteRowNames', false, 'WriteVariableNames', false);

    % Convert 'clust_info' into a table if it's not one already and write to the second sheet
    clust_info_table = array2table(clust_info, 'VariableNames', names);  % Convert 'clust_info' to table with proper column names
    writetable(clust_info_table, [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 2, 'WriteRowNames', false);
    % Assuming position2 is a matrix that needs to be written to the third sheet
    writetable(array2table(position2), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 3, 'WriteRowNames', false, 'WriteVariableNames', false);

    %%%% quantification of the distance between different clusters
    %%%%% the distance between clusters is defined to be the distance between their centre of
    %%%%% mass
    dist_clust(1:sit(2),1:sit(2))=0;
    cc1=0;
    for i=1:sit(2)
        for j=1:sit(2)
            if i==j
                dist_clust(i,j)=100000;% the fact that the distance of one cluster respect itself is zero would make more complicate
                %to determine the minimum distance between a cluster and it neighbours
            else

                cc1=cc1+1;
                dist_clust(i,j)=sqrt( (clust_info(i,6)-clust_info(j,6))^2 + (clust_info(i,7)-clust_info(j,7))^2 + (clust_info(i,8)-clust_info(j,8))^2 );

                dist_clust2(cc1)=dist_clust(i,j);
            end
        end
    end

    mindist_cluster=min(dist_clust);% the minimum distance between one cluster and the other ones
    mean_dist_cluster=mean(mindist_cluster);% the mean distance between closer clusters
    %%% exporting the latest information
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'], dist_clust,4,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],mean_dist_cluster,5,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],mindist_cluster',5,'B1')
    % names3={'number_of_cluster', 'total_number_of_spot_in_cluster ', 'average_number_of_spot_in_cluster', 'dispersion_number_of_spot_in_cluster','frequent_number_of_spot_in_cluster' }
    % values3=[number_of_cluster total_number_of_spot_in_cluster average_number_of_spot_in_cluster dispersion_number_of_spot_in_cluster frequent_number_of_spot_in_cluster];
    % values3=values3';
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],names3,6,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],values3',6,'A2')

    % Exporting the latest information
    writetable(table(dist_clust), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 4, 'WriteRowNames', true);
    writetable(table(mean_dist_cluster), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 5, 'WriteRowNames', true);
    writetable(table(mindist_cluster'), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 5, 'WriteRowNames', true, 'WriteVariableNames', false);

    % Save other information about clusters
    names3 = {'number_of_cluster', 'total_number_of_spot_in_cluster', 'average_number_of_spot_in_cluster', ...
        'dispersion_number_of_spot_in_cluster', 'frequent_number_of_spot_in_cluster'};
    values3 = [number_of_cluster, total_number_of_spot_in_cluster, average_number_of_spot_in_cluster, ...
        dispersion_number_of_spot_in_cluster, frequent_number_of_spot_in_cluster];

    %DEBUG: writetable(table(values3), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 6, 'WriteRowNames', true, 'WriteVariableNames', false);
    %DEBUG: writetable(table(names3), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 6, 'WriteRowNames', true);
    % Create a table where the first row is the header and the second row is the data
    output_table3 = [names3; num2cell(values3)];
    disp(output_table3);
    % Write the combined table to the Excel file, starting from cell A1 of Sheet 6
    writetable(cell2table(output_table3), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 6, 'WriteRowNames', false, 'WriteVariableNames', false);
    % % Saving as an older .xls format (Excel 97-2003 format)
    % writetable(cell2table(output_table3), [num2str(clast_par) '_output_' filename '.xls'], 'Sheet', 6, 'WriteRowNames', false, 'WriteVariableNames', false);
    % Save as CSV file
    writetable(cell2table(output_table3), [num2str(clast_par) '_output_' filename '_Sheet6.csv'], 'WriteRowNames', false, 'WriteVariableNames', false);

AGR-Kühler defekt – Symptome, Reparatur & Kosten (AGR散热器故障 – 症状、维修与费用)

https://www.motointegrator.de/blog/agr-kuehler-defekt-symptome-reparatur-kosten/

Pos. ATU-Nr. Dienstleistung / Teilebezeichnung Menge Betrag EUR Brutto

1 TEXT00 KV für AGR-Ventil erneuern 2 1EIN60 AGR-KÜHLER – AUS- UND EINBAUEN 1

3 FREMD0 AGR Kühler-Ventil (WM: 301.14.54) 1

4 NO6931 NORAUTO KUEHLERSCH. READYMI G12 EVO 5L 2

Werkstatt Alborz KfZ-Meisterbetrieb (https://alborz.gmbh/)

Hr. Mostafa

Kelloggstr. 8, 22045 HH

040-6570539 (0179-68372)

AGR散热器(也称为废气散热器)属于废气再循环(AGR)系统的一部分。它们的作用是将废气流中的一部分在返回进气道之前降温。为此,AGR散热器集成在冷却水循环系统中。故障的AGR散热器可能会导致高温负荷、废气残留物和因老化造成的磨损。其症状非常明显。一旦出现这些症状,应立即更换AGR散热器。

目录

* 10种常见的AGR散热器故障症状
* AGR散热器维修与费用
* 常见问题解答
   - AGR散热器的作用是什么?
   - 故障的AGR散热器能继续使用吗?
   - 如果AGR散热器坏了,会发生什么?

10种常见的废气再循环(AGR)散热器故障症状

一个故障的AGR散热器可能会表现出多种症状,这些症状不仅会影响车辆的性能,还可能导致严重的故障。准确了解AGR散热器故障的迹象非常重要,以便及时发现和修复可能的损坏。以下是一些可能表明AGR散热器故障的常见症状:

* 涡轮增压器的增压压力下降
* 冷却液流失,伴随发动机温度升高
* 废气压力无法控制地泄漏
* 在发动机运转时,冷却水缓慢流失
* 发动机重新启动后,可能会对气缸、活塞和连杆造成损害
* 车内有废气味
* 发动机关闭后,冷却系统的压力下降
* AGR散热器上可见泄漏,尤其是在废气侧
* 发动机区域温度异常
* 发动机舱内有明显的蒸汽或烟雾

一个故障的AGR散热器通过明显的症状表现出来。一个明显的故障迹象是冷却水的缓慢流失,伴随着更高的发动机温度。通常,这种流失最初是不可察觉的,因为在发动机运转时,废气背压高于冷却系统的压力。一旦发动机关闭,压力下降,冷却液便会进入发动机。重新启动发动机时,可能会对气缸、活塞和连杆造成损害。另一个症状是在车内出现废气味。

AGR散热器维修和费用

当出现AGR散热器故障症状时,应立即更换散热器。否则,可能会造成严重的损坏,甚至导致发动机严重损坏。尤其是当冷却水流失严重时,可能会导致发动机过热,从而损坏气缸盖垫片和气缸盖。

AGR散热器故障的维修费用根据车辆型号差异较大。通常,需要拆卸多个部件,如废气管、油回流管或柴油颗粒过滤器,以便接触到散热器。一旦散热器暴露出来,就可以进行更换。建议不要使用二手AGR散热器。

如果废气散热器故障,AGR阀通常也需要更换,因为它们通常是一个整体部件。如果这些部件是分开安装的,或者只是AGR阀损坏,则不需要更换两个部件。

一个新的AGR散热器原厂价格通常在750欧元到1500欧元之间。此外,还需要支付新冷却液的材料费用——大约三到五升。每升冷却液大约8到10欧元,总费用大约为25欧元到50欧元。更换散热器的工作时间也需要计算在内。此过程中还包括AGR阀的重新学习。整体工作时间大约为四到六小时。根据工时费,工作费用大约为400欧元到600欧元。最终费用根据车辆型号不同而有所不同。总体而言,预计费用大约为1100欧元到2200欧元。

通过在Motointegrator的在线商店购买新的AGR散热器和相关组件,您可以节省费用。我们的商品包括约500万个汽车配件和配件,来自顶级品牌和优质制造商,价格优惠。欢迎您亲自来看看并在我们的在线商店中选购。

常见问题解答

AGR散热器的作用是什么?

AGR散热器,也称为废气散热器,是废气再循环系统(AGR)的一个重要组成部分,旨在在废气被重新引导进入进气管之前对部分废气流进行冷却。它集成在冷却水循环系统中。一个故障的AGR散热器可能会导致过高的热负荷、废气残留物和因老化造成的磨损。这些明显的症状表明,必须立即更换AGR散热器。

故障的AGR散热器可以继续使用吗?

通常情况下,如果AGR阀故障,短期内不会产生重大影响。然而,如果发生严重故障,导致发动机进入紧急模式,则应立即前往修理厂。

如果AGR散热器坏了,会发生什么?

一个故障的AGR散热器可能会导致发动机过热、废气残留物堆积以及加速老化和磨损。

AGR-Kühler (auch Abgaskühler genannt) gehören zur Abgasrückführung (AGR). Sie sorgen dafür, dass die Temperaturen eines Teils des Abgasstroms vor der Rückführung in den Ansaugtrakt heruntergekühlt werden. Dafür ist der AGR-Kühler in den Kühlwasserkreislauf integriert. Ein defekter AGR-Kühler kann zu hohen thermischen Belastungen, Abgasrückständen und altersbedingtem Verschleiß führen. Die Symptome sind eindeutig. Treten diese auf, sollte der AGR-Kühler umgehend ausgetauscht werden.

Inhaltsverzeichnis

* 10 häufige Symptome eines defekten AGR-Kühlers
* AGR-Kühler reparieren und Kosten
* FAQs
   - Was macht ein AGR-Kühler?
   - Kann man mit defektem AGR-Kühler fahren?
   - Was passiert, wenn der AGR-Kühler kaputt ist?

AGR-Kühler defekt – Symptome, Reparatur & Kosten

AGR-Kühler (auch Abgaskühler genannt) gehören zur Abgasrückführung (AGR). Sie sorgen dafür, dass die Temperaturen eines Teils des Abgasstroms vor der Rückführung in den Ansaugtrakt heruntergekühlt werden. Dafür ist der AGR-Kühler in den Kühlwasserkreislauf integriert. Ein defekter AGR-Kühler kann zu hohen thermischen Belastungen, Abgasrückständen und altersbedingtem Verschleiß führen. Die Symptome sind eindeutig. Treten diese auf, sollte der AGR-Kühler umgehend ausgetauscht werden.

10 häufige Symptome eines defekten AGR-Kühlers

Ein fehlerhafter AGR-Kühler kann sich durch eine Vielzahl von Symptomen bemerkbar machen, die nicht nur die Leistung des Fahrzeugs beeinträchtigen, sondern auch zu ernsthaften Problemen führen können. Ein genaues Verständnis der Anzeichen eines AGR-Kühler-Defekts ist von entscheidender Bedeutung, um mögliche Schäden zu erkennen und zu beheben. Hier sind einige häufige Symptome, die auf einen defekten AGR-Kühler hinweisen können:

* Abfallender Ladedruck des Turboladers
* Verlust von Kühlflüssigkeit in Verbindung mit erhöhter Motortemperatur
* Unkontrolliertes Entweichen des Abgasdrucks
* Schleichender Verlust von Kühlwasser beim laufenden Motor
* Mögliche Schäden an Zylinder, Kolben und Pleuel nach erneutem Start des Motors
* Abgasgeruch im Fahrzeuginnenraum
* Druckabfall im Kühlsystem nach Abstellen des Motors
* Sichtbare Undichtigkeiten am AGR-Kühler, besonders auf der Abgasseite
* Temperaturanomalien im Motorbereich
* Auffällige Dampf- oder Rauchentwicklung im Motorraum

Ein defekter AGR-Kühler zeigt sich durch eindeutige Symptome. Ein sicheres Anzeichen für einen Defekt am Abgaskühler ist ein schleichender Verlust des Kühlwassers in Verbindung mit einer höheren Motortemperatur. Meist bleibt dieser Verlust zunächst unentdeckt, weil bei einem laufenden Motor der Abgasgegendruck höher als der Druck im Kühlsystem ist. Sobald der Motor abgestellt wird, sinkt der Druck und die Kühlflüssigkeit gelangt in den Motor. Wird der Motor wieder gestartet, kann es zu Schäden an Zylinder, Kolben und Pleuel kommen. Ein weiteres Symptom ist Abgasgeruch im Fahrzeuginnenraum. AGR-Kühler reparieren und Kosten

Treten Symptome eines defekten AGR-Kühlers auf, sollte umgehend der Kühler ausgetauscht werden. Anderenfalls besteht die Gefahr von Schäden bis hin zu einem kapitalen Motorschaden. Das gilt vor allem dann, wenn der Kühlwasserverlust so stark ist, dass der Motor überhitzt. Dann kann es zu Schäden an der Zylinderkopfdichtung und dem Zylinderkopf kommen.

Die Kosten für einen defekten AGR-Kühler können je nach Fahrzeugmodell stark variieren. Meist müssen verschiedene Komponenten wie das Abgasrohr, die Ölrücklaufleitung oder der Dieselpartikelfilter demontiert werden, um an den Kühler zu gelangen. Sobald der Kühler freigelegt ist, kann er ersetzt werden. Die Verwendung von gebrauchten AGR-Kühlern empfiehlt sich nicht.

Bei einem defekten Abgaskühler wird auch das AGR-Ventil ersetzt, da es sich hier oft um ein Bauteil handelt. Anders sieht das aus, wenn die Komponenten getrennt verbaut wurden oder lediglich das AGR-Ventil beschädigt ist. In diesem Fall müssen nicht beide Bauteile gewechselt werden.

Ein neuer AGR-Kühler kostet im Original zwischen 750 Euro und 1.500 Euro. Dazu kommen die Materialkosten für das neue Kühlwasser – ca. drei bis fünf Liter. Zusammengerechnet kommen hierfür pro Liter etwa acht bis zehn Euro zusammen. In Summe ergeben sich ca. 25 Euro bis 50 Euro. Der Arbeitsaufwand für den Wechsel des Kühlers muss ebenfalls dazugerechnet werden. Hier fließt auch der Anlernvorgang des AGR-Ventils rein. Insgesamt müssen für den Aufwand vier bis sechs Stunden veranschlagt werden. Je nach Stundensatz kann es zu Arbeitskosten von ca. 400 Euro bis 600 Euro kommen. Die finalen Kosten variieren je nach Fahrzeugmodell. Insgesamt kann mit Kosten von rund 1.100 Euro und 2.200 Euro gerechnet werden.

Sie können Kosten sparen, indem Sie den neuen AGR-Kühler und dessen Komponenten günstig im Online-Shop von Motointegrator kaufen. In unserem Sortiment mit ca. 5 Millionen Kfz-Ersatzteilen und Zubehör finden Sie eine große Auswahl an Kfz-Ersatzteilen von Top-Marken und Premium-Herstellern zu einem günstigen Preis. Überzeugen Sie sich selbst davon und stöbern Sie in unserem Online-Shop. FAQs Was macht ein AGR-Kühler?

Der AGR-Kühler, auch als Abgaskühler bekannt, ist ein wesentlicher Bestandteil der Abgasrückführung (AGR), der dazu dient, einen Teil des Abgasstroms vor seiner Rückführung in den Ansaugtrakt zu kühlen. Er ist in den Kühlwasserkreislauf integriert. Bei einem defekten AGR-Kühler können erhöhte thermische Belastungen, Rückstände im Abgas und altersbedingter Verschleiß auftreten. Diese klar erkennbaren Symptome weisen auf Probleme hin, die einen sofortigen Austausch des AGR-Kühlers erfordern. Kann man mit defektem AGR-Kühler fahren?

Bei einem defekten AGR-Ventil sind normalerweise keine kurzfristigen Auswirkungen zu erwarten. Allerdings sollte bei einer schwerwiegenden Fehlfunktion, die das Notlaufprogramm des Motors aktiviert, sofort eine Werkstatt aufgesucht werden. Was passiert, wenn der AGR-Kühler kaputt ist?

Ein defekter AGR-Kühler kann zu einer Überhitzung des Motors führen, Rückstände im Abgas zurücklassen und altersbedingten Verschleiß begünstigen.

How to Fix NTFS Mount Errors and File System Corruption on Linux

jhuang@WS-2290C:/dev$ sudo mount /dev/sdg1 /media/jhuang/Elements $MFTMirr does not match $MFT (record 3). Failed to mount ‘/dev/sdg1’: Input/output error NTFS is either inconsistent, or there is a hardware fault, or it’s a SoftRAID/FakeRAID hardware. In the first case run chkdsk /f on Windows then reboot into Windows twice. The usage of the /f parameter is very important! If the device is a SoftRAID/FakeRAID then first activate it and mount a different device under the /dev/mapper/ directory, (e.g. /dev/mapper/nvidia_eahaabcc1). Please see the ‘dmraid’ documentation for more details.

The error indicates an issue with the NTFS file system on /dev/sdg1. Specifically, it points to a corruption in the $MFT (Master File Table) or a potential hardware issue. This is a common problem when NTFS file systems are not properly shut down or there are issues during file system operations. Solutions to try:

  1. Run chkdsk on Windows

    Since the error specifically mentions running chkdsk /f on Windows, you can try the following:

    • Unmount the drive if it’s currently mounted:

      sudo umount /dev/sdg1
    • Use a Windows machine (or dual-boot with Windows) to check the NTFS file system.

    • Once you connect the drive to a Windows machine, open a Command Prompt as Administrator and run:

      chkdsk /f X:

      Replace X: with the actual drive letter corresponding to your NTFS partition.

    • After chkdsk finishes, restart Windows twice as suggested in the error message to ensure the system fully clears any issues.

    • Reconnect the drive to your Linux machine and try mounting it again:

      sudo mount /dev/sdg1 /media/jhuang/Elements
  2. Check for Hardware Issues

    If chkdsk does not resolve the issue, there could be a problem with the hardware itself:

    • Try a different USB port (if it’s a USB drive) or different computer to rule out issues with your current setup.
    • If the drive shows signs of failure (slow access, clicking sounds, etc.), it could be a hardware issue.
  3. Use ntfsfix on Linux

    If you do not have access to a Windows machine, you can try using ntfsfix, which is a tool available on Linux that can sometimes resolve minor NTFS inconsistencies. It’s not as powerful as chkdsk but might help in some cases.

    • Install ntfs-3g (if not already installed):

      sudo apt-get install ntfs-3g
    • Run ntfsfix:

      sudo ntfsfix /dev/sdg1
    • Try mounting the partition again:

      sudo mount /dev/sdg1 /media/jhuang/Elements
  4. Check for RAID/FakeRAID Configuration

    If you are using RAID or FakeRAID (software RAID) on the device, this can cause issues with mounting the NTFS partition. The error message mentions the possibility of a FakeRAID, so you may need to use dmraid or ensure that the RAID is properly set up.

    • Check RAID status:

      sudo dmraid -s
    • If a RAID setup is detected, you might need to activate the RAID configuration and mount the device using /dev/mapper/, not directly via /dev/sdg1.

    • If you are unsure, refer to the documentation of your RAID setup for how to correctly mount the device.

  5. Backup Data (if possible)

    If the partition is still inaccessible after these steps, you may want to try data recovery tools like TestDisk or PhotoRec (both of which can handle NTFS partitions) to recover any important files before you attempt to format the drive.

  6. Summary:

    • Run chkdsk /f on Windows to repair the NTFS file system.
    • If the drive has hardware issues, consider using a different port or machine.
    • Try using ntfsfix on Linux as a last resort.
    • If using RAID, verify and manage the RAID setup.

What is rlog in DESeq2?

The regularized log transformation (rlog) in DESeq2 is a function designed to stabilize the variance in RNA-seq count data, making it more suitable for downstream statistical analyses. This transformation is particularly useful for reducing the heteroscedasticity (variance increases with the mean) commonly seen in RNA-seq data.

The function used in DESeq2 to apply the regularized log transformation is rlog(). It performs the transformation by modeling the counts as a function of size factors (accounting for library size differences) and then applying a regularization technique to smooth out the variance across genes, which helps with downstream visualizations and clustering.

Key points:

  • Purpose: The main goal of rlog is to stabilize variance across genes and make RNA-seq data more comparable, which helps to visualize relationships between samples and identify patterns of gene expression in techniques like principal component analysis (PCA) and clustering.

  • Variance Stabilization: Raw RNA-seq counts often show greater variance at higher expression levels. The rlog transformation reduces this effect, making low and high-expression genes more comparable in terms of variance.

  • Function Call:

      rlog_counts <- rlog(dds, blind = FALSE)
    
        #dds: A DESeqDataSet object containing your RNA-seq count data.
        #blind: If TRUE, the transformation ignores the experimental design (useful for exploratory analysis). If FALSE, it incorporates the experimental design into the transformation (recommended for differential expression analysis).

Example Workflow in DESeq2:

    library(DESeq2)

    # Assuming dds is a DESeqDataSet object created from raw count data
    dds <- DESeq(dds)

    # Apply the rlog transformation
    rlog_counts <- rlog(dds, blind = FALSE)

    # Use rlog-transformed data for PCA
    plotPCA(rlog_counts)

Conclusion:

The regularized log transformation is an important step in RNA-seq data analysis when you want to visualize the relationships between samples or perform clustering, as it stabilizes variance and removes the mean-dependent variability present in raw count data.

Virus Genome Analysis Pipeline: Hybrid Capture, DAMIAN Blastn, and VRAP Mapping for Measles (麻疹) Sample

Measles_S1_on_OR854811_reads_coverage

  1. DAMIAN calculations

    cd /mnt/nvme0n1p1/blast
    
    # -- Measles --
    #I think it would be good to first analyse the sample in DAMIAN Blastn and then map it to the closest related genome resulting from DAMIAN to see if we cover the entire sequence.
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_measles/p20578/N24_539_1A_measles_S1_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_measles/p20578/N24_539_1A_measles_S1_R2_001.fastq.gz --sample N24_539_1A_measles_S1_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 40 --force
    damian_report.rb
    
    #Please find enclosed two links to further samples that require analysis with DAMIAN. In each case, the host is a human. The samples comprise three liver/serum samples (upper link below)  and 12 samples from the ring trial (lower link below).
    #This is merely an initial evaluation of samples with DAMIAN, using Blastn.
    
    # -- Leber/Serum --
    #Bei den Lexogen CORALL Libraries muss bei der Analyse darauf geachtet werden, dass diese in Read 1 standardmäßig einen UMI (12 nt) haben und die ersten 9 nt von Read 2 haben eine erhöhte Fehlerrate beim mapping (durch das Priming-Prinzip laut Hersteller) – die könnte man vor der weiteren Analyse trimmen.
    
    #cutadapt -u 12 -U 9 -o output_R1_trimmed.fastq -p output_R2_trimmed.fastq input_R1.fastq input_R2.fastq
    #
    #Explanation:
    #
    #    -u 12: Trim 12 bases from the start of Read 1 (this is for the UMI).
    #    -U 9: Trim 9 bases from the start of Read 2 (this is to deal with the higher error rate in the first 9 bases of Read 2).
    #    -o output_R1_trimmed.fastq: Output file for Read 1 after trimming.
    #    -p output_R2_trimmed.fastq: Output file for Read 2 after trimming.
    #
    #Why this works:
    #
    #    The -u option applies trimming to Read 1, while the -U option applies trimming to Read 2. This ensures that the correct bases are trimmed from each read, without affecting the other.
    #
    #Now when you run this command, Read 1 will have 12 bases trimmed, and Read 2 will have 9 bases trimmed, as expected.
    
    cutadapt -u 12 -U -9 -o Leber1_Corall_wo_Dnase_S5_R1_001.fastq -p Leber1_Corall_wo_Dnase_S5_R2_001.fastq p20575_L3/Leber1_Corall_wo_Dnase_S5_R1_001.fastq.gz p20575_L3/Leber1_Corall_wo_Dnase_S5_R2_001.fastq.gz
    cutadapt -u 12 -U -9 -o Leber2_Corall_wo_Dnase_S6_R1_001.fastq -p Leber2_Corall_wo_Dnase_S6_R2_001.fastq ./p20576_L3/Leber2_Corall_wo_Dnase_S6_R1_001.fastq.gz ./p20576_L3/Leber2_Corall_wo_Dnase_S6_R2_001.fastq.gz
    cutadapt -u 12 -U -9 -o Serum_Corall_wo_Dnase_S7_R1_001.fastq -p Serum_Corall_wo_Dnase_S7_R2_001.fastq ./p20577_L3/Serum_Corall_wo_Dnase_S7_R1_001.fastq.gz ./p20577_L3/Serum_Corall_wo_Dnase_S7_R2_001.fastq.gz
    
    for sample in Leber1_Corall_wo_Dnase_S5 Leber2_Corall_wo_Dnase_S6 Serum_Corall_wo_Dnase_S7; do
    java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ${sample}_R1_001.fastq ${sample}_R2_001.fastq trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
    done
    
    fastqc ./trimmed/Leber1_Corall_wo_Dnase_S5_R1.fastq.gz
    fastqc ./trimmed/Leber1_Corall_wo_Dnase_S5_R2.fastq.gz
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber1_Corall_wo_Dnase_S5_R1.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber1_Corall_wo_Dnase_S5_R2.fastq.gz --sample Leber1_Corall_wo_Dnase_S5_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 60 --force
    damian_report.rb -a 37 -n Leber1_Corall_wo_Dnase_S5_blastn
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber2_Corall_wo_Dnase_S6_R1.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber2_Corall_wo_Dnase_S6_R2.fastq.gz --sample Leber2_Corall_wo_Dnase_S6_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 60 --force
    damian_report.rb -a 39 -n Leber2_Corall_wo_Dnase_S6_blastn
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Serum_Corall_wo_Dnase_S7_R1.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Serum_Corall_wo_Dnase_S7_R2.fastq.gz --sample Serum_Corall_wo_Dnase_S7_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 60 --force
    damian_report.rb -a 41 -n Serum_Corall_wo_Dnase_S7_blastn
    
    zip -r Serum_Corall_wo_Dnase_S7_blastn.zip Serum_Corall_wo_Dnase_S7_blastn
    
    # -- Ringversuch --
    #Hier waren die DNA-Libraries der Samples ungewöhnlich niedrig konzentriert (das Kit ist normalerweise sehr robust für niedrigen Input, von ChIP-Proben auch z.B.) und die Read-Anzahl ist auch niedriger als angepeilt. Vielleicht klärt die bioinformatische Analyse das ja etwas auf – bzw. wie viel DNA würdet ihr aus dem Ausgangsmaterial erwarten (waren es Zell-Spikeins in der RNA?).. Die RNA-Libraries aus den gleichen Proben sehen jeweils ok aus, Readanzahl wurde in etwa erreicht, die Duplikationsrate ist relativ hoch, was bei low input aber auch normal sein kann.
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20585/07_RV1_RNA_S7_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20585/07_RV1_RNA_S7_R2_001.fastq.gz --sample 07_RV1_RNA_S7_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20586/08_RV2_RNA_S8_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20586/08_RV2_RNA_S8_R2_001.fastq.gz --sample 08_RV2_RNA_S8_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20587/09_RV3_RNA_S9_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20587/09_RV3_RNA_S9_R2_001.fastq.gz --sample 09_RV3_RNA_S9_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20588/10_RV4_RNA_S10_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20588/10_RV4_RNA_S10_R2_001.fastq.gz --sample 10_RV4_RNA_S10_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20589/11_RV5_RNA_S11_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20589/11_RV5_RNA_S11_R2_001.fastq.gz --sample 11_RV5_RNA_S11_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20590/12_RV6_RNA_S12_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20590/12_RV6_RNA_S12_R2_001.fastq.gz --sample 12_RV6_RNA_S12_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    #END
    
    cd jhuang@hamm:/mnt/h1/jhuang/blast
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20579/01_RV1_DNA_S1_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20579/01_RV1_DNA_S1_R2_001.fastq.gz --sample 01_RV1_DNA_S1_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20580/02_RV2_DNA_S2_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20580/02_RV2_DNA_S2_R2_001.fastq.gz --sample 02_RV2_DNA_S2_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20581/03_RV3_DNA_S3_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20581/03_RV3_DNA_S3_R2_001.fastq.gz --sample 03_RV3_DNA_S3_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20582/04_RV4_DNA_S4_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20582/04_RV4_DNA_S4_R2_001.fastq.gz --sample 04_RV4_DNA_S4_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20583/05_RV5_DNA_S5_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20583/05_RV5_DNA_S5_R2_001.fastq.gz --sample 05_RV5_DNA_S5_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20584/06_RV6_DNA_S6_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20584/06_RV6_DNA_S6_R2_001.fastq.gz --sample 06_RV6_DNA_S6_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    #END
    
    #Sent: DNA_S1, S2, S3, RNA_S7, S8, S9, S10.
    #TODOs: DNA_S4, S5, S6, RNA_S11, S12.
  2. Trimming

    mkdir trimmed
    for sample in N24_539_1A_measles_S1; do
    java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ${sample}_R1_001.fastq.gz ${sample}_R2_001.fastq.gz trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
    done
  3. Download all virus genomes (18644)

    mv datasets /usr/local/bin/
    chmod +x /usr/local/bin/datasets
    #datasets download virus genome --complete-only --assembly-source refseq under ~/REFs
    datasets download virus genome taxon "Viruses" --complete-only --refseq
    #To check for RefSeq data only, look for NC_, NM_, or similar prefixes in sequence headers and identifiers.
    wget -r -np -nH --cut-dirs=3 ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/
    [Option1] sudo apt install ncbi-entrez-direct
    [Option1] esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta  # 23788
     ~/Scripts/filter_fasta.py virus_genome_3052345.fasta virus_complete_genome_3052345.fasta
        import sys
        from Bio import SeqIO
    
        # Get input and output file paths from command line arguments
        input_file = sys.argv[1]
        output_file = sys.argv[2]
    
        # Open the output file to write filtered sequences
        with open(output_file, "w") as out_handle:
            # Parse the multi-line FASTA file
            for record in SeqIO.parse(input_file, "fasta"):
                # Check if 'complete genome' is in the header (description)
                if "complete genome" in record.description:
                    # Write the entire record (header + multi-line sequence) to the output file
                    SeqIO.write(record, out_handle, "fasta")
    
        print(f"Complete genome sequences saved to {output_file}")
    [Option1] grep "complete genome" virus_complete_genome_3052345.fasta | wc -l  #917
    [Option2] https://www.ebi.ac.uk/ena/browser/view/3052345
    [Option2] grep "complete genome" ena_3052345_sequence.fasta | wc -l  #820
  4. The commends for more comprehensive blast annotation, in order to get the closest related genome.

    ln -s ~/Tools/vrap/ .
    conda activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 trimmed/N24_539_1A_measles_S1_R1.fastq.gz -2 trimmed/N24_539_1A_measles_S1_R2.fastq.gz  -o N24_539_1A_measles_S1 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    [Optional] vrap/vrap.py  -1 trimmed/N24_539_1A_measles_S1_R1.fastq.gz -2 trimmed/N24_539_1A_measles_S1_R2.fastq.gz  -o N24_539_1A_measles_S1_3052345 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/virus_complete_genome_3052345.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    # * 4 nt_dbs (--virus, --host, download_db.py(nucleotide, currently alphaherpesvirus), nt), 2 prot_db (download_db.py(protein, currently alphaherpesvirus), nr) for blast, save under ./blast/db/virus, ./blast/db/host, vrap/database/viral_db/viral_nucleotide, vrap/database/viral_db/viral_protein
    # * 1 bowtie_database for host removal (--host), save under ./bowtie/host.
    # * bowtie run before assembly
    # * blast run after assembly for the contigs, therefore it does not exist the taxfilt step in vrap.
    # * checking the order of the databases for annotation step, namely which database will be taken firstly for annotionn after setting --virus?
    # * If --host is for both bowtie and blastn.
    # * if only --bt2idx define, only bowtie, no blastn! == no ""--host=/home/jhuang/REFs/genome.fa"" still has the host-removal step!
  5. Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

    vrap/vrap.py  -1 trimmed/N24_539_1A_measles_S1_R1.fastq.gz -2 trimmed/N24_539_1A_measles_S1_R2.fastq.gz  -o N24_539_1A_measles_S1_on_OR854811 --host /home/jhuang/Downloads/OR854811.fasta   -t 100 -l 200  -g
    cd bowtie
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    #5755885 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    4457643 + 0 mapped (77.44% : N/A)
    1300648 + 0 paired in sequencing
    650324 + 0 read1
    650324 + 0 read2
    637608 + 0 properly paired (49.02% : N/A)
    901044 + 0 with itself and mate mapped
    77271 + 0 singletons (5.94% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    #4457643 of 5755885 (77.44%) can map the reference.
    bamCoverage -b mapped_sorted.bam -o ../../Measles_S1_on_OR854811_reads_coverage.bw
  6. Show the bw on IGV

TODO: mapping virus reads should use TopHat2, BWA or Bowtie2 could not detect splice events.

RNA sequencing (RNA-seq)

  • Library preparation for strand specific RNA sequencing was carried out using the NEXTflex Directional RNA-Seq Kit (Bioo Scientific) according to the manufacturer’s instructions.

  • Libraries were sequenced on the Illumina HiSeq 2500 platform.

  • To allow detection of splice events that extend over the origin, reads were mapped to two concatenated copies of the MCVSyn or MCVSyn-hpko genomes using TopHat2 v 2.0.13 [94].

  • The positions of mapped reads and junctions were subsequently collapsed back on unit-length genomes.

  • From the resulting SAM files, we counted the number of unspliced reads that extended over splice sites of junctions detected by TopHat to determine splice site efficacy and frequency of individual junctions.

  • To estimate transcript abundance, for each combination of splice junctions that mapped within either the major early or late transcription cassettes we calculated a relative strand-specific combinatorial frequency value by multiplying observed frequency values for individual donor sites.

  • The relative ratio of late to early transcripts was subsequently estimated by calculation of normalized RPKM (reads per kilobase per million mapped reads) for each of the transcripts.

  • Based on the Juliane paper, “A Comprehensive Analysis of Replicating Merkel Cell Polyomavirus Genomes Delineates the Viral Transcription Program and Suggests a Role for mcv-miR-M1 in Episomal Persistence,” particularly in the “RNA sequencing (RNA-seq)” methods section, they used strand-specific RNA sequencing. This explains why they are able to separate the coverage into blue (positive strand) and red (negative strand). The original text reads: “Library preparation for strand-specific RNA sequencing was carried out using the NEXTflex Directional RNA-Seq Kit (Bioo Scientific) according to the manufacturer’s instructions.”

  • Unfortunately, your RNA-seq data is not strand-specific. Additionally, the primary focus of your sequencing is on the human genome, while the virus is a by-product, meaning we only have a very limited number of virus reads. For example, on average, there are only about 10 reads for VP1 and VP2 per sample. For this reason, creating a plot similar to the one in Juliane’s paper, with separate coverage for positive and negative strands, is not feasible.